Skip to content

Commit 656cab2

Browse files
Attach NVSkills validation signatures
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
1 parent bac3196 commit 656cab2

9 files changed

Lines changed: 209 additions & 71 deletions

File tree

skills/tilegym-converting-cutile-to-julia/BENCHMARK.md

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,19 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
77
## Evaluation Summary
88

99
- Skill: `tilegym-converting-cutile-to-julia`
10-
- Evaluation date: 2026-05-29
10+
- Evaluation date: 2026-06-10
1111
- NVSkills-Eval profile: `external`
12+
- Environment: `astra-sandbox`
13+
- Dataset: 5 evaluation tasks
14+
- Attempts per task: 1
15+
- Pass threshold: 50%
1216
- Overall verdict: FAIL
13-
- Tier 3 live agent evaluation: not available in this report
17+
The skill should be reviewed before NVSkills-Eval publication. **Skill owners should address the applicable findings below and rerun NVSkills-Eval to refresh this benchmark.**
1418

1519
## Agents Used
1620

17-
- Tier 3 agent details were not available in this report.
21+
- `claude-code`
22+
- `codex`
1823

1924
## Metrics Used
2025

@@ -28,44 +33,64 @@ Reported benchmark dimensions:
2833

2934
Underlying evaluation signals used in this run:
3035

31-
- No Tier 3 evaluation signal details were available in this report.
36+
- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
37+
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
38+
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
39+
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
40+
- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
41+
- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
42+
- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
3243

3344
## Test Tasks
3445

35-
Tier 3 evaluation task details were not available in this report.
46+
The benchmark dataset contained 5 evaluation tasks:
47+
48+
- Positive tasks: 1 tasks where the skill was expected to activate.
49+
- Negative tasks: 4 tasks where no skill was expected.
50+
- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.
51+
52+
Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.
3653

3754
## Results
3855

39-
Tier 3 dimension rollup was not available in this report.
56+
| Dimension | Num | `claude-code` | `codex` |
57+
|---|---:|---:|---:|
58+
| Security | 5 | 100% (+0%) | 100% (+0%) |
59+
| Correctness | 5 | 100% (+20%) | 99% (+14%) |
60+
| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
61+
| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
62+
| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
63+
64+
Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
4065

4166
## Tier 1: Static Validation Summary
4267

43-
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 20 total findings.
68+
Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 16 total findings.
4469

4570
Top findings:
4671

4772
- MEDIUM QUALITY/quality_correctness: No documented scripts in table format (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
4873
- MEDIUM QUALITY/quality_correctness: Instructions don't mention 'run_script' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
4974
- MEDIUM QUALITY/quality_efficiency: Deeply nested references in debugging.md (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
5075
- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Examples' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
51-
- MEDIUM SECURITY/Unknown (SDI-2): A code translation skill (Python to Julia GPU kernel conversion) should not need to output shell commands as part of its (`skill-card.md:29`)
76+
- MEDIUM SECURITY/Unknown (SQP-2): The workflow instructs the agent to auto-proceed through all phases without user confirmation, including writing files t (`translations/workflow.md:26`)
5277

5378
## Tier 2: Deduplication Summary
5479

5580
Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 6 total findings.
5681

5782
Top findings:
5883

59-
- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
60-
"### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
61-
vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
6284
- HIGH DUPLICATE/duplicate: Duplicate content found within references/critical-rules.md:
6385
"# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 32-33)
6486
vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 34-36) (`references/critical-rules.md:32`)
6587
- HIGH DUPLICATE/duplicate: Duplicate content found across references/api-mapping.md and references/critical-rules.md and translations/workflow.md:
6688
"## Memory Layout Considerations" in references/api-mapping.md (lines 233-248)
6789
vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 8-8)
6890
vs "### Step 4: Memory Layout Considerations" in translations/workflow.md (lines 288-305) (`references/api-mapping.md:233`)
91+
- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
92+
"### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
93+
vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
6994
- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/testing.md and translations/workflow.md:
7095
"# Run tests" in SKILL.md (lines 92-100)
7196
vs "### Step 1: Create test file `julia/test/test_<op>.jl`" in references/testing.md (lines 43-48)
@@ -75,7 +100,3 @@ Top findings:
75100
"# Run a single test file directly" in references/testing.md (lines 32-34)
76101
vs "# Run a single test file directly" in translations/workflow.md (lines 106-108)
77102
vs "# Run a single test file" in translations/workflow.md (lines 370-379) (`references/testing.md:32`)
78-
79-
## Publication Recommendation
80-
81-
The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark.

skills/tilegym-converting-cutile-to-julia/skill-card.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ NVIDIA <br>
99
### License/Terms of Use: <br>
1010
CC-BY-4.0 AND Apache-2.0 <br>
1111
## Use Case: <br>
12-
Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting or translating kernel code, or debugging and optimizing existing Julia cuTile translations. <br>
12+
Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting kernel implementations across languages, or debugging and optimizing existing Julia cuTile translations. <br>
1313

1414
### Deployment Geography for Use: <br>
1515
Global <br>
@@ -19,19 +19,28 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
1919
Mitigation: Review and scan skill before deployment. <br>
2020

2121
## Reference(s): <br>
22-
- [API Mapping (Python to Julia)](references/api-mapping.md) <br>
22+
- [API Mapping Reference](references/api-mapping.md) <br>
2323
- [Critical Rules](references/critical-rules.md) <br>
2424
- [Debugging Guide](references/debugging.md) <br>
2525
- [Testing Patterns](references/testing.md) <br>
2626
- [Conversion Workflow](translations/workflow.md) <br>
2727

2828

2929
## Skill Output: <br>
30-
**Output Type(s):** [Code, Files] <br>
31-
**Output Format:** [Julia source files (.jl)] <br>
30+
**Output Type(s):** [Code, Shell commands] <br>
31+
**Output Format:** [Julia source files and shell commands] <br>
3232
**Output Parameters:** [1D] <br>
3333
**Other Properties Related to Output:** [None] <br>
3434

35+
## Evaluation Agents Used: <br>
36+
- claude-code <br>
37+
- codex <br>
38+
39+
40+
41+
## Evaluation Tasks: <br>
42+
5 evaluation tasks (1 positive skill-activation, 4 negative) under NVSkills-Eval external profile in astra-sandbox environment. <br>
43+
3544
## Evaluation Metrics Used: <br>
3645
Reported benchmark dimensions: <br>
3746
- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
@@ -40,10 +49,28 @@ Reported benchmark dimensions: <br>
4049
- Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
4150
- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
4251

52+
Underlying evaluation signals used in this run: <br>
53+
- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
54+
- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
55+
- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
56+
- `accuracy`: Grades final-answer correctness against the reference answer. <br>
57+
- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
58+
- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
59+
- `token_efficiency`: Compares token usage with and without the skill. <br>
60+
61+
4362

63+
## Evaluation Results: <br>
64+
| Dimension | Num | `claude-code` | `codex` |
65+
|---|---:|---:|---:|
66+
| Security | 5 | 100% (+0%) | 100% (+0%) |
67+
| Correctness | 5 | 100% (+20%) | 99% (+14%) |
68+
| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
69+
| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
70+
| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
4471

4572
## Skill Version(s): <br>
46-
v1.3.0-19-g8da79ba (source: git tag) <br>
73+
v1.3.0 (source: git tag) <br>
4774

4875
## Ethical Considerations: <br>
4976
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>

0 commit comments

Comments
 (0)