|
| 1 | +--- |
| 2 | +name: validate |
| 3 | +description: Run the validation framework against one or more trace IDs, traces in a date range, or all traces in a session |
| 4 | +argument-hint: <trace_id(s) | --from <datetime> --to <datetime> | --session-id <id>> |
| 5 | +allowed-tools: Bash, Read, Write |
| 6 | +--- |
| 7 | + |
| 8 | +## Instructions |
| 9 | + |
| 10 | +Run the validation framework using the provided input. The skill supports: |
| 11 | +- **Single trace**: `/validate <trace_id>` |
| 12 | +- **Multiple traces**: `/validate <id1>,<id2>,<id3>` |
| 13 | +- **Date range**: `/validate --from 2026-02-10T00:00:00+00:00 --to 2026-02-15T23:59:59+00:00` |
| 14 | +- **Date range with user filter**: `/validate --from <datetime> --to <datetime> --user-id <user_id>` |
| 15 | +- **Session ID**: `/validate --session-id <langfuse_session_id>` |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +### Step 1: Determine Input Mode and Run batch_validate.py |
| 20 | + |
| 21 | +**If `$ARGUMENTS` is empty or blank**, read the latest trace ID from the persistent state file before proceeding: |
| 22 | + |
| 23 | +```bash |
| 24 | +python3 -c " |
| 25 | +import json, pathlib |
| 26 | +# Walk up from CWD to find the .claude directory |
| 27 | +d = pathlib.Path.cwd() |
| 28 | +while d != d.parent: |
| 29 | + candidate = d / '.claude' / 'state' / 'current_trace.json' |
| 30 | + if candidate.exists(): |
| 31 | + print(json.loads(candidate.read_text())['trace_id']) |
| 32 | + break |
| 33 | + d = d.parent |
| 34 | +" |
| 35 | +``` |
| 36 | + |
| 37 | +Use the printed trace ID as `$ARGUMENTS` for the rest of this step. |
| 38 | + |
| 39 | +First, resolve the project root directory and the script path: |
| 40 | + |
| 41 | +```bash |
| 42 | +# PROJECT_ROOT is the current working directory (the repo root containing .altimate-code/ or .claude/) |
| 43 | +PROJECT_ROOT="$(pwd)" |
| 44 | +VALIDATE_SCRIPT="$(find "$PROJECT_ROOT/.altimate-code/skills/validate" "$HOME/.altimate-code/skills/validate" "$PROJECT_ROOT/.claude/skills/validate" "$HOME/.claude/skills/validate" -name "batch_validate.py" 2>/dev/null | head -1)" |
| 45 | +``` |
| 46 | + |
| 47 | +Parse `$ARGUMENTS` to determine the mode and construct the command: |
| 48 | +- If it contains `--session-id` → session mode: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --session-id "<session_id>"` |
| 49 | +- If it contains `--from` → date range mode: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --from-time "<from>" --to-time "<to>"` (add `--user-id` if present) |
| 50 | +- If it contains commas (and NOT `--from` or `--session-id`) → multiple trace IDs: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --trace-ids "$ARGUMENTS"` |
| 51 | +- Otherwise → single trace ID: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --trace-ids "$ARGUMENTS"` |
| 52 | + |
| 53 | +Run the command: |
| 54 | + |
| 55 | +```bash |
| 56 | +uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" <appropriate_args> |
| 57 | +``` |
| 58 | + |
| 59 | +The script will: |
| 60 | +- Query Langfuse for date range traces (if applicable) |
| 61 | +- Call the validation API for each trace |
| 62 | +- Write raw JSON results to `logs/batch_validation_<timestamp>.json` |
| 63 | +- Create a report folder `logs/batch_validation_<timestamp>/` containing: |
| 64 | + - Individual per-trace markdown reports: `trace_1_<id>.md`, `trace_2_<id>.md`, ... |
| 65 | + - Cross-trace summary: `SUMMARY.md` |
| 66 | +- Output JSON to stdout |
| 67 | + |
| 68 | +**IMPORTANT**: The stdout output may be very large. Read the output carefully. The JSON structure is: |
| 69 | +```json |
| 70 | +{ |
| 71 | + "total_traces": N, |
| 72 | + "results": [ |
| 73 | + { |
| 74 | + "trace_id": "...", |
| 75 | + "status_code": 200, |
| 76 | + "result": { |
| 77 | + "trace_id": "...", |
| 78 | + "status": "success", |
| 79 | + "error_count": 0, |
| 80 | + "criteria_results": { |
| 81 | + "Groundedness": {"text_response": "...", "input_tokens": ..., "output_tokens": ..., "total_tokens": ..., "model_name": "..."}, |
| 82 | + "Validity": {...}, |
| 83 | + "Coherence": {...}, |
| 84 | + "Utility": {...}, |
| 85 | + "Tool Validation": {...} |
| 86 | + }, |
| 87 | + "observation_count": N, |
| 88 | + "elapsed_seconds": N |
| 89 | + } |
| 90 | + } |
| 91 | + ], |
| 92 | + "summary": { |
| 93 | + "Groundedness": {"average_score": N, "min_score": N, "max_score": N, ...}, |
| 94 | + ... |
| 95 | + }, |
| 96 | + "log_file": "logs/batch_validation_...", |
| 97 | + "report_dir": "logs/batch_validation_<timestamp>" |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +### Step 2: For Each Trace - Semantic Matching (Groundedness Post-Processing) |
| 104 | + |
| 105 | +For EACH trace in the results array, apply semantic matching to Groundedness: |
| 106 | + |
| 107 | +1. Parse the `criteria_results.Groundedness.text_response` and identify all **failed claims** whose `input_transformation_type` is `string_match`. |
| 108 | +2. If there are claims identified: |
| 109 | + 2.1. **For each claim , check whether `claim_text` and `source_data` are semantically the same. |
| 110 | + - 2 statements are considered **semantically same** if they talk about the same topics. |
| 111 | + - 2 statements are considered **semantically different** if they talk about different topics. |
| 112 | + - If semantically same → update claim status to `SUCCESS`. |
| 113 | + 2.2. Re-count the number of failing claims whose status is `FAILURE`. |
| 114 | + 2.3. Update `failed_count` with the re-counted number. |
| 115 | + 2.4. Re-calculate OverallScore as `round(((total length of claims - failed_count)/total length of claims) * 5, 2)` |
| 116 | +3. If no claims identified, do nothing. |
| 117 | + |
| 118 | +**This is being done for semantic matching as the deterministic tool did not do semantic matching.** |
| 119 | + |
| 120 | +When doing this task, first generate a sequence of steps as a plan and execute step by step for consistency. |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +### Step 3: For Each Trace - Semantic Reason Generation (Groundedness Post-Processing) |
| 125 | + |
| 126 | +For EACH trace in the results array, apply semantic reason generation to Groundedness: |
| 127 | + |
| 128 | +1. Parse the `criteria_results.Groundedness.text_response` and identify all **claims**. |
| 129 | +2. If there are claims identified, then **for each claim**: |
| 130 | + 2.1. If claim status is `SUCCESS` → generate a brief and complete reason explaining **why it succeeded** (e.g. the claim matches the source data, the value is within acceptable error, etc.) and update the claim's `reason` field with the generated reason. |
| 131 | + - REMEMBER to provide full proof details in the reason with tool calculated claims as well as actual claim. |
| 132 | + 2.2. If claim status is `FAILURE` → generate a brief and complete reason explaining **why it failed** (e.g. the claimed value differs from source data, the error exceeds the threshold, etc.) and update the claim's `reason` field with the generated reason. |
| 133 | + - REMEMBER to provide full proof details in the reason with tool calculated claims as well as actual claim. |
| 134 | +3. If no claims identified, do nothing. |
| 135 | + |
| 136 | +**This ensures every claim has a human-readable, semantically generated reason regardless of its outcome.** |
| 137 | + |
| 138 | +When doing this task, first generate a sequence of steps as a plan and execute step by step for consistency. |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +### Step 4: Present Per-Trace Results |
| 143 | + |
| 144 | +For EACH trace, present the results in the following format: |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +## Trace: `<trace_id>` |
| 149 | + |
| 150 | +### Criteria Summary Table in markdown table |
| 151 | + |
| 152 | +| Criteria | Status | Score | |
| 153 | +|---|---|---| |
| 154 | +| **Groundedness** | <status> | <score>/5 | |
| 155 | +| **Validity** | <status> | <score>/5 | |
| 156 | +| **Coherence** | <status> | <score>/5 | |
| 157 | +| **Utility** | <status> | <score>/5 | |
| 158 | +| **Tool Validation** | <status> | <score>/5 | |
| 159 | + |
| 160 | +P.S. **Consider 'RIGHT NODE' as 'SUCCESS' and 'WRONG NODE' as 'FAILURE' IF PRESENT.** |
| 161 | + |
| 162 | +### Per-Criteria Node Results in markdown table |
| 163 | + |
| 164 | +For **Validity**, **Coherence**, and **Utility**, show a node-level breakdown table: |
| 165 | + |
| 166 | +| Node | Score | Status | |
| 167 | +|---|---|---| |
| 168 | +| <node_name> | <score> | <status> | |
| 169 | + |
| 170 | +### Individual Criteria Results |
| 171 | + |
| 172 | +#### Groundedness |
| 173 | + |
| 174 | +Generate a summary of the generated groundedness response detailing strengths and weaknesses. |
| 175 | + |
| 176 | +Now display **ALL the claims in the **markdown table format** with these columns**: |
| 177 | + |
| 178 | +| # | Source Tool | Source Data| Input Data | Claim Text | Claimed | Input | Conversion Statement | Calculated | Error | Status | Reason | |
| 179 | +|---|---|-----------------------|-------------------------------------------|------------------------------------------|------------------------------|---|---|---|---|---|---| |
| 180 | +| <claim_id> | <source tool id> | <claim_text> | <source_data>| <input_data>| <claimed_value> <claim_unit> | <input data> | <input to claim conversion statement> | <Calculated claim> <claim_unit> | <Error in claim as %> | SUCCESS/FAILURE | <reason> | |
| 181 | + |
| 182 | +Then show a separate **Failed Claims Summary in markdown table format** with only the failed claims: |
| 183 | + |
| 184 | +| # | Claim | Claimed | Source Tool ID | Actual Text | Actual Data | Error | Root Cause | |
| 185 | +|---|---|---|------------------|---------------|--------------|---|-------------| |
| 186 | +| <claim_id> | <claim_text> | <claimed_value> | <source_tool_id> | <source_data> | <Input data> | <error %> | <reasoning> | |
| 187 | + |
| 188 | +REMEMBER to generate each value COMPLETELY. DO NOT TRUNCATE. |
| 189 | + |
| 190 | +#### Validity |
| 191 | +Generate a summary of the generated validity response detailing strengths and weaknesses. |
| 192 | + |
| 193 | +#### Coherence |
| 194 | +Generate a summary of the generated coherence response detailing strengths and weaknesses. |
| 195 | + |
| 196 | +#### Utility |
| 197 | +Generate a summary of the generated utility response detailing strengths and weaknesses. |
| 198 | + |
| 199 | +#### Tool Validation |
| 200 | +Generate a summary of the generated tool validation response detailing strengths and weaknesses. |
| 201 | + |
| 202 | +Now display all the tool details in markdown table format: |
| 203 | + |
| 204 | +| # | Tool Name | Tool Status | |
| 205 | +|---|---|---| |
| 206 | +| <id> | <tool name> | <tool status> | |
| 207 | + |
| 208 | +REMEMBER to generate each value completely. NO TRUNCATION. |
| 209 | + |
| 210 | +After presenting each trace result, write it to a markdown file inside the report directory. Read `report_dir` from the batch_validate.py JSON output. Use the trace index (1-based) and first 12 characters of the trace ID for the filename: |
| 211 | + |
| 212 | +```bash |
| 213 | +cat > "<report_dir>/trace_<N>_<first_12_chars_of_id>.md" <<'TRACE_EOF' |
| 214 | +<full per-trace result output from above> |
| 215 | +TRACE_EOF |
| 216 | +``` |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +### Step 5: Cross-Trace Comprehensive Summary (for all evaluations) |
| 221 | + |
| 222 | +After presenting all individual trace results, generate a comprehensive summary: |
| 223 | + |
| 224 | +#### Overall Score Summary in markdown table format |
| 225 | + |
| 226 | +| Criteria | Average Score | Min | Max | Traces Evaluated | |
| 227 | +|---|---|---|---|---| |
| 228 | +| **Groundedness** | <avg>/5 | <min>/5 | <max>/5 | <count> | |
| 229 | +| **Validity** | <avg>/5 | <min>/5 | <max>/5 | <count> | |
| 230 | +| **Coherence** | <avg>/5 | <min>/5 | <max>/5 | <count> | |
| 231 | +| **Utility** | <avg>/5 | <min>/5 | <max>/5 | <count> | |
| 232 | +| **Tool Validation** | <avg>/5 | <min>/5 | <max>/5 | <count> | |
| 233 | + |
| 234 | +Use the scores AFTER semantic matching corrections from Step 2, and reasons AFTER semantic reason generation from Step 3. |
| 235 | + |
| 236 | +#### Per-Trace Score Breakdown in markdown table format |
| 237 | + |
| 238 | +| Trace ID | Groundedness | Validity | Coherence | Utility | Tool Validation | |
| 239 | +|---|---|---|---|---|---| |
| 240 | +| <id> | <score>/5 | <score>/5 | <score>/5 | <score>/5 | <score>/5 | |
| 241 | + |
| 242 | +#### Category-Wise Analysis |
| 243 | + |
| 244 | +For EACH category, provide: |
| 245 | +- **Common Strengths**: Patterns of success observed across traces |
| 246 | +- **Common Weaknesses**: Recurring issues found across traces |
| 247 | +- **Recommendations**: Actionable improvements based on the analysis |
| 248 | + |
| 249 | +After generating the overall summary, write it to `SUMMARY.md` inside the report directory: |
| 250 | + |
| 251 | +```bash |
| 252 | +cat > "<report_dir>/SUMMARY.md" <<'SUMMARY_MD_EOF' |
| 253 | +<full cross-trace summary output from above> |
| 254 | +SUMMARY_MD_EOF |
| 255 | +``` |
| 256 | + |
| 257 | +--- |
| 258 | + |
| 259 | +### Step 6: Log Summary to File |
| 260 | + |
| 261 | +Append the comprehensive summary (with semantic matching corrections and semantic reasons) to the log file. Read the `log_file` path from the batch_validate.py output and append: |
| 262 | + |
| 263 | +```bash |
| 264 | +cat >> <log_file_path> <<'SUMMARY_EOF' |
| 265 | +
|
| 266 | +=== COMPREHENSIVE SUMMARY (with semantic matching corrections and semantic reasons) === |
| 267 | +<paste the cross-trace summary and per-trace corrected scores here> |
| 268 | +SUMMARY_EOF |
| 269 | +``` |
| 270 | + |
| 271 | +This ensures the log file contains both the raw API results and the post-processed summary with semantic matching corrections and semantic reasons. |
0 commit comments