You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/tilegym-converting-cutile-to-julia/BENCHMARK.md
+36-15Lines changed: 36 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,14 +7,19 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
7
7
## Evaluation Summary
8
8
9
9
- Skill: `tilegym-converting-cutile-to-julia`
10
-
- Evaluation date: 2026-05-29
10
+
- Evaluation date: 2026-06-10
11
11
- NVSkills-Eval profile: `external`
12
+
- Environment: `astra-sandbox`
13
+
- Dataset: 5 evaluation tasks
14
+
- Attempts per task: 1
15
+
- Pass threshold: 50%
12
16
- Overall verdict: FAIL
13
-
- Tier 3 live agent evaluation: not available in this report
17
+
The skill should be reviewed before NVSkills-Eval publication. **Skill owners should address the applicable findings below and rerun NVSkills-Eval to refresh this benchmark.**
14
18
15
19
## Agents Used
16
20
17
-
- Tier 3 agent details were not available in this report.
-`accuracy` (Accuracy): grades final-answer correctness against the reference answer.
40
+
-`goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
41
+
-`behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
42
+
-`token_efficiency` (Token Efficiency): compares token usage with and without the skill.
32
43
33
44
## Test Tasks
34
45
35
-
Tier 3 evaluation task details were not available in this report.
46
+
The benchmark dataset contained 5 evaluation tasks:
47
+
48
+
- Positive tasks: 1 tasks where the skill was expected to activate.
49
+
- Negative tasks: 4 tasks where no skill was expected.
50
+
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.
51
+
52
+
Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.
36
53
37
54
## Results
38
55
39
-
Tier 3 dimension rollup was not available in this report.
56
+
| Dimension | Num |`claude-code`|`codex`|
57
+
|---|---:|---:|---:|
58
+
| Security | 5 | 100% (+0%) | 100% (+0%) |
59
+
| Correctness | 5 | 100% (+20%) | 99% (+14%) |
60
+
| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
61
+
| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
62
+
| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
63
+
64
+
Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
40
65
41
66
## Tier 1: Static Validation Summary
42
67
43
-
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 20 total findings.
68
+
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 16 total findings.
44
69
45
70
Top findings:
46
71
47
72
- MEDIUM QUALITY/quality_correctness: No documented scripts in table format (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
48
73
- MEDIUM QUALITY/quality_correctness: Instructions don't mention 'run_script' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
49
74
- MEDIUM QUALITY/quality_efficiency: Deeply nested references in debugging.md (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
50
75
- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Examples' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
51
-
- MEDIUM SECURITY/Unknown (SDI-2): A code translation skill (Python to Julia GPU kernel conversion) should not need to output shell commands as part of its (`skill-card.md:29`)
76
+
- MEDIUM SECURITY/Unknown (SQP-2): The workflow instructs the agent to auto-proceed through all phases without user confirmation, including writing files t (`translations/workflow.md:26`)
52
77
53
78
## Tier 2: Deduplication Summary
54
79
55
80
Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 6 total findings.
56
81
57
82
Top findings:
58
83
59
-
- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
60
-
"### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
61
-
vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
62
84
- HIGH DUPLICATE/duplicate: Duplicate content found within references/critical-rules.md:
63
85
"# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 32-33)
64
86
vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 34-36) (`references/critical-rules.md:32`)
65
87
- HIGH DUPLICATE/duplicate: Duplicate content found across references/api-mapping.md and references/critical-rules.md and translations/workflow.md:
66
88
"## Memory Layout Considerations" in references/api-mapping.md (lines 233-248)
67
89
vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 8-8)
68
90
vs "### Step 4: Memory Layout Considerations" in translations/workflow.md (lines 288-305) (`references/api-mapping.md:233`)
91
+
- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
92
+
"### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
93
+
vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
69
94
- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/testing.md and translations/workflow.md:
70
95
"# Run tests" in SKILL.md (lines 92-100)
71
96
vs "### Step 1: Create test file `julia/test/test_<op>.jl`" in references/testing.md (lines 43-48)
@@ -75,7 +100,3 @@ Top findings:
75
100
"# Run a single test file directly" in references/testing.md (lines 32-34)
76
101
vs "# Run a single test file directly" in translations/workflow.md (lines 106-108)
77
102
vs "# Run a single test file" in translations/workflow.md (lines 370-379) (`references/testing.md:32`)
78
-
79
-
## Publication Recommendation
80
-
81
-
The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark.
Copy file name to clipboardExpand all lines: skills/tilegym-converting-cutile-to-julia/skill-card.md
+32-5Lines changed: 32 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ NVIDIA <br>
9
9
### License/Terms of Use: <br>
10
10
CC-BY-4.0 AND Apache-2.0 <br>
11
11
## Use Case: <br>
12
-
Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting or translating kernel code, or debugging and optimizing existing Julia cuTile translations. <br>
12
+
Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting kernel implementations across languages, or debugging and optimizing existing Julia cuTile translations. <br>
13
13
14
14
### Deployment Geography for Use: <br>
15
15
Global <br>
@@ -19,19 +19,28 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
19
19
Mitigation: Review and scan skill before deployment. <br>
20
20
21
21
## Reference(s): <br>
22
-
-[API Mapping (Python to Julia)](references/api-mapping.md) <br>
-`accuracy`: Grades final-answer correctness against the reference answer. <br>
57
+
-`goal_accuracy`: Checks whether the overall user task completed successfully. <br>
58
+
-`behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
59
+
-`token_efficiency`: Compares token usage with and without the skill. <br>
60
+
61
+
43
62
63
+
## Evaluation Results: <br>
64
+
| Dimension | Num |`claude-code`|`codex`|
65
+
|---|---:|---:|---:|
66
+
| Security | 5 | 100% (+0%) | 100% (+0%) |
67
+
| Correctness | 5 | 100% (+20%) | 99% (+14%) |
68
+
| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
69
+
| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
70
+
| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
44
71
45
72
## Skill Version(s): <br>
46
-
v1.3.0-19-g8da79ba (source: git tag) <br>
73
+
v1.3.0 (source: git tag) <br>
47
74
48
75
## Ethical Considerations: <br>
49
76
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
0 commit comments