NVIDIA
diff --git a/‎skills/tilegym-converting-cutile-to-julia/BENCHMARK.md‎
Lines changed: 36 additions & 15 deletions b/‎skills/tilegym-converting-cutile-to-julia/BENCHMARK.md‎
Lines changed: 36 additions & 15 deletions
diff --git a/‎skills/tilegym-converting-cutile-to-julia/skill-card.md‎
Lines changed: 32 additions & 5 deletions b/‎skills/tilegym-converting-cutile-to-julia/skill-card.md‎
Lines changed: 32 additions & 5 deletions
@@ -7,14 +7,19 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s
 ## Evaluation Summary
 
 - Skill: `tilegym-converting-cutile-to-julia`
-- Evaluation date: 2026-05-29
+- Evaluation date: 2026-06-10
 - NVSkills-Eval profile: `external`
+- Environment: `astra-sandbox`
+- Dataset: 5 evaluation tasks
+- Attempts per task: 1
+- Pass threshold: 50%
 - Overall verdict: FAIL
-- Tier 3 live agent evaluation: not available in this report
+The skill should be reviewed before NVSkills-Eval publication. **Skill owners should address the applicable findings below and rerun NVSkills-Eval to refresh this benchmark.**
 
 ## Agents Used
 
-- Tier 3 agent details were not available in this report.
+- `claude-code`
+- `codex`
 
 ## Metrics Used
 
@@ -28,44 +33,64 @@ Reported benchmark dimensions:
 
 Underlying evaluation signals used in this run:
 
-- No Tier 3 evaluation signal details were available in this report.
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
+- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
+- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
+- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
 
 ## Test Tasks
 
-Tier 3 evaluation task details were not available in this report.
+The benchmark dataset contained 5 evaluation tasks:
+
+- Positive tasks: 1 tasks where the skill was expected to activate.
+- Negative tasks: 4 tasks where no skill was expected.
+- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred.
+
+Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases.
 
 ## Results
 
-Tier 3 dimension rollup was not available in this report.
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 5 | 100% (+0%) | 100% (+0%) |
+| Correctness | 5 | 100% (+20%) | 99% (+14%) |
+| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
+| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
+| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
+
+Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
 
 ## Tier 1: Static Validation Summary
 
-Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 20 total findings.
+Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 16 total findings.
 
 Top findings:
 
 - MEDIUM QUALITY/quality_correctness: No documented scripts in table format (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
 - MEDIUM QUALITY/quality_correctness: Instructions don't mention 'run_script' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
 - MEDIUM QUALITY/quality_efficiency: Deeply nested references in debugging.md (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
 - MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Examples' (`skills/tilegym-converting-cutile-to-julia/SKILL.md`)
-- MEDIUM SECURITY/Unknown (SDI-2): A code translation skill (Python to Julia GPU kernel conversion) should not need to output shell commands as part of its (`skill-card.md:29`)
+- MEDIUM SECURITY/Unknown (SQP-2): The workflow instructs the agent to auto-proceed through all phases without user confirmation, including writing files t (`translations/workflow.md:26`)
 
 ## Tier 2: Deduplication Summary
 
 Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 6 total findings.
 
 Top findings:
 
-- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
-  "### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
-  vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
 - HIGH DUPLICATE/duplicate: Duplicate content found within references/critical-rules.md:
   "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 32-33)
   vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 34-36) (`references/critical-rules.md:32`)
 - HIGH DUPLICATE/duplicate: Duplicate content found across references/api-mapping.md and references/critical-rules.md and translations/workflow.md:
   "## Memory Layout Considerations" in references/api-mapping.md (lines 233-248)
   vs "# Critical Rules for cuTile Python → Julia Conversion" in references/critical-rules.md (lines 8-8)
   vs "### Step 4: Memory Layout Considerations" in translations/workflow.md (lines 288-305) (`references/api-mapping.md:233`)
+- HIGH DUPLICATE/duplicate: Duplicate content found across references/testing.md and translations/workflow.md:
+  "### Step 2: Register in `julia/test/runtests.jl`" in references/testing.md (lines 67-79)
+  vs "### Step 2: Register in `julia/test/runtests.jl`" in translations/workflow.md (lines 355-363) (`references/testing.md:67`)
 - HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/testing.md and translations/workflow.md:
   "# Run tests" in SKILL.md (lines 92-100)
   vs "### Step 1: Create test file `julia/test/test_<op>.jl`" in references/testing.md (lines 43-48)
@@ -75,7 +100,3 @@ Top findings:
   "# Run a single test file directly" in references/testing.md (lines 32-34)
   vs "# Run a single test file directly" in translations/workflow.md (lines 106-108)
   vs "# Run a single test file" in translations/workflow.md (lines 370-379) (`references/testing.md:32`)
-
-## Publication Recommendation
-
-The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark.
 
@@ -9,7 +9,7 @@ NVIDIA <br>
 ### License/Terms of Use: <br>
 CC-BY-4.0 AND Apache-2.0 <br>
 ## Use Case: <br>
-Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting or translating kernel code, or debugging and optimizing existing Julia cuTile translations. <br>
+Developers and engineers converting cuTile Python GPU kernels to cuTile.jl Julia equivalents, porting kernel implementations across languages, or debugging and optimizing existing Julia cuTile translations. <br>
 
 ### Deployment Geography for Use: <br>
 Global <br>
@@ -19,19 +19,28 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi
 Mitigation: Review and scan skill before deployment. <br>
 
 ## Reference(s): <br>
-- [API Mapping (Python to Julia)](references/api-mapping.md) <br>
+- [API Mapping Reference](references/api-mapping.md) <br>
 - [Critical Rules](references/critical-rules.md) <br>
 - [Debugging Guide](references/debugging.md) <br>
 - [Testing Patterns](references/testing.md) <br>
 - [Conversion Workflow](translations/workflow.md) <br>
 
 
 ## Skill Output: <br>
-**Output Type(s):** [Code, Files] <br>
-**Output Format:** [Julia source files (.jl)] <br>
+**Output Type(s):** [Code, Shell commands] <br>
+**Output Format:** [Julia source files and shell commands] <br>
 **Output Parameters:** [1D] <br>
 **Other Properties Related to Output:** [None] <br>
 
+## Evaluation Agents Used: <br>
+- claude-code <br>
+- codex <br>
+
+
+
+## Evaluation Tasks: <br>
+5 evaluation tasks (1 positive skill-activation, 4 negative) under NVSkills-Eval external profile in astra-sandbox environment. <br>
+
 ## Evaluation Metrics Used: <br>
 Reported benchmark dimensions: <br>
 - Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. <br>
@@ -40,10 +49,28 @@ Reported benchmark dimensions: <br>
 - Effectiveness: Checks whether the agent performs measurably better with the skill than without it. <br>
 - Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work. <br>
 
+Underlying evaluation signals used in this run: <br>
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access. <br>
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow. <br>
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage. <br>
+- `accuracy`: Grades final-answer correctness against the reference answer. <br>
+- `goal_accuracy`: Checks whether the overall user task completed successfully. <br>
+- `behavior_check`: Verifies expected behavior steps, including safety expectations. <br>
+- `token_efficiency`: Compares token usage with and without the skill. <br>
+
+
 
+## Evaluation Results: <br>
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | 5 | 100% (+0%) | 100% (+0%) |
+| Correctness | 5 | 100% (+20%) | 99% (+14%) |
+| Discoverability | 5 | 100% (+20%) | 99% (+8%) |
+| Effectiveness | 5 | 99% (+18%) | 96% (+18%) |
+| Efficiency | 5 | 96% (+13%) | 97% (+7%) |
 
 ## Skill Version(s): <br>
-v1.3.0-19-g8da79ba (source: git tag) <br>
+v1.3.0 (source: git tag) <br>
 
 ## Ethical Considerations: <br>
 NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>