Skip to content

Commit eb25894

Browse files
feat: validation feature inclusion into altimate-code
1 parent 5740133 commit eb25894

File tree

13 files changed

+1014
-17
lines changed

13 files changed

+1014
-17
lines changed

.gitignore

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,37 @@ target
2323
.scripts
2424
.direnv/
2525

26+
# Python
27+
__pycache__/
28+
*.pyc
29+
*.pyo
30+
*.egg-info/
31+
32+
# SQLite databases (feedback store creates these at runtime)
33+
*.db
34+
35+
# Runtime logs
36+
*.log
37+
logs/
38+
39+
# Large intermediate files at repo root (generated during benchmark runs)
40+
/queries.json
41+
/queries_1k.json
42+
/results/
43+
44+
# Local runtime config
45+
.altimate-code/
46+
47+
# Commit message scratch files
48+
.github/meta/
49+
50+
# Experiment / simulation artifacts
51+
/data/
52+
/experiments/
53+
/models/
54+
/simulation/
55+
2656
# Local dev files
2757
opencode-dev
28-
logs/
2958
*.bun-build
3059
tsconfig.tsbuildinfo
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"hooks": {}
3+
}
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
---
2+
name: validate
3+
description: Run the validation framework against one or more trace IDs, traces in a date range, or all traces in a session
4+
argument-hint: <trace_id(s) | --from <datetime> --to <datetime> | --session-id <id>>
5+
allowed-tools: Bash, Read, Write
6+
---
7+
8+
## Instructions
9+
10+
Run the validation framework using the provided input. The skill supports:
11+
- **Single trace**: `/validate <trace_id>`
12+
- **Multiple traces**: `/validate <id1>,<id2>,<id3>`
13+
- **Date range**: `/validate --from 2026-02-10T00:00:00+00:00 --to 2026-02-15T23:59:59+00:00`
14+
- **Date range with user filter**: `/validate --from <datetime> --to <datetime> --user-id <user_id>`
15+
- **Session ID**: `/validate --session-id <langfuse_session_id>`
16+
17+
---
18+
19+
### Step 1: Determine Input Mode and Run batch_validate.py
20+
21+
**If `$ARGUMENTS` is empty or blank**, read the latest trace ID from the persistent state file before proceeding:
22+
23+
```bash
24+
python3 -c "
25+
import json, pathlib
26+
# Walk up from CWD to find the .claude directory
27+
d = pathlib.Path.cwd()
28+
while d != d.parent:
29+
candidate = d / '.claude' / 'state' / 'current_trace.json'
30+
if candidate.exists():
31+
print(json.loads(candidate.read_text())['trace_id'])
32+
break
33+
d = d.parent
34+
"
35+
```
36+
37+
Use the printed trace ID as `$ARGUMENTS` for the rest of this step.
38+
39+
First, resolve the project root directory and the script path:
40+
41+
```bash
42+
# PROJECT_ROOT is the current working directory (the repo root containing .altimate-code/ or .claude/)
43+
PROJECT_ROOT="$(pwd)"
44+
VALIDATE_SCRIPT="$(find "$PROJECT_ROOT/.altimate-code/skills/validate" "$HOME/.altimate-code/skills/validate" "$PROJECT_ROOT/.claude/skills/validate" "$HOME/.claude/skills/validate" -name "batch_validate.py" 2>/dev/null | head -1)"
45+
```
46+
47+
Parse `$ARGUMENTS` to determine the mode and construct the command:
48+
- If it contains `--session-id` → session mode: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --session-id "<session_id>"`
49+
- If it contains `--from` → date range mode: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --from-time "<from>" --to-time "<to>"` (add `--user-id` if present)
50+
- If it contains commas (and NOT `--from` or `--session-id`) → multiple trace IDs: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --trace-ids "$ARGUMENTS"`
51+
- Otherwise → single trace ID: `uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" --trace-ids "$ARGUMENTS"`
52+
53+
Run the command:
54+
55+
```bash
56+
uv run --with python-dotenv --with langfuse python "$VALIDATE_SCRIPT" --project-root "$PROJECT_ROOT" <appropriate_args>
57+
```
58+
59+
The script will:
60+
- Query Langfuse for date range traces (if applicable)
61+
- Call the validation API for each trace
62+
- Write raw JSON results to `logs/batch_validation_<timestamp>.json`
63+
- Create a report folder `logs/batch_validation_<timestamp>/` containing:
64+
- Individual per-trace markdown reports: `trace_1_<id>.md`, `trace_2_<id>.md`, ...
65+
- Cross-trace summary: `SUMMARY.md`
66+
- Output JSON to stdout
67+
68+
**IMPORTANT**: The stdout output may be very large. Read the output carefully. The JSON structure is:
69+
```json
70+
{
71+
"total_traces": N,
72+
"results": [
73+
{
74+
"trace_id": "...",
75+
"status_code": 200,
76+
"result": {
77+
"trace_id": "...",
78+
"status": "success",
79+
"error_count": 0,
80+
"criteria_results": {
81+
"Groundedness": {"text_response": "...", "input_tokens": ..., "output_tokens": ..., "total_tokens": ..., "model_name": "..."},
82+
"Validity": {...},
83+
"Coherence": {...},
84+
"Utility": {...},
85+
"Tool Validation": {...}
86+
},
87+
"observation_count": N,
88+
"elapsed_seconds": N
89+
}
90+
}
91+
],
92+
"summary": {
93+
"Groundedness": {"average_score": N, "min_score": N, "max_score": N, ...},
94+
...
95+
},
96+
"log_file": "logs/batch_validation_...",
97+
"report_dir": "logs/batch_validation_<timestamp>"
98+
}
99+
```
100+
101+
---
102+
103+
### Step 2: For Each Trace - Semantic Matching (Groundedness Post-Processing)
104+
105+
For EACH trace in the results array, apply semantic matching to Groundedness:
106+
107+
1. Parse the `criteria_results.Groundedness.text_response` and identify all **failed claims** whose `input_transformation_type` is `string_match`.
108+
2. If there are claims identified:
109+
2.1. **For each claim , check whether `claim_text` and `source_data` are semantically the same.
110+
- 2 statements are considered **semantically same** if they talk about the same topics.
111+
- 2 statements are considered **semantically different** if they talk about different topics.
112+
- If semantically same → update claim status to `SUCCESS`.
113+
2.2. Re-count the number of failing claims whose status is `FAILURE`.
114+
2.3. Update `failed_count` with the re-counted number.
115+
2.4. Re-calculate OverallScore as `round(((total length of claims - failed_count)/total length of claims) * 5, 2)`
116+
3. If no claims identified, do nothing.
117+
118+
**This is being done for semantic matching as the deterministic tool did not do semantic matching.**
119+
120+
When doing this task, first generate a sequence of steps as a plan and execute step by step for consistency.
121+
122+
---
123+
124+
### Step 3: For Each Trace - Semantic Reason Generation (Groundedness Post-Processing)
125+
126+
For EACH trace in the results array, apply semantic reason generation to Groundedness:
127+
128+
1. Parse the `criteria_results.Groundedness.text_response` and identify all **claims**.
129+
2. If there are claims identified, then **for each claim**:
130+
2.1. If claim status is `SUCCESS` → generate a brief and complete reason explaining **why it succeeded** (e.g. the claim matches the source data, the value is within acceptable error, etc.) and update the claim's `reason` field with the generated reason.
131+
- REMEMBER to provide full proof details in the reason with tool calculated claims as well as actual claim.
132+
2.2. If claim status is `FAILURE` → generate a brief and complete reason explaining **why it failed** (e.g. the claimed value differs from source data, the error exceeds the threshold, etc.) and update the claim's `reason` field with the generated reason.
133+
- REMEMBER to provide full proof details in the reason with tool calculated claims as well as actual claim.
134+
3. If no claims identified, do nothing.
135+
136+
**This ensures every claim has a human-readable, semantically generated reason regardless of its outcome.**
137+
138+
When doing this task, first generate a sequence of steps as a plan and execute step by step for consistency.
139+
140+
---
141+
142+
### Step 4: Present Per-Trace Results
143+
144+
For EACH trace, present the results in the following format:
145+
146+
---
147+
148+
## Trace: `<trace_id>`
149+
150+
### Criteria Summary Table in markdown table
151+
152+
| Criteria | Status | Score |
153+
|---|---|---|
154+
| **Groundedness** | <status> | <score>/5 |
155+
| **Validity** | <status> | <score>/5 |
156+
| **Coherence** | <status> | <score>/5 |
157+
| **Utility** | <status> | <score>/5 |
158+
| **Tool Validation** | <status> | <score>/5 |
159+
160+
P.S. **Consider 'RIGHT NODE' as 'SUCCESS' and 'WRONG NODE' as 'FAILURE' IF PRESENT.**
161+
162+
### Per-Criteria Node Results in markdown table
163+
164+
For **Validity**, **Coherence**, and **Utility**, show a node-level breakdown table:
165+
166+
| Node | Score | Status |
167+
|---|---|---|
168+
| <node_name> | <score> | <status> |
169+
170+
### Individual Criteria Results
171+
172+
#### Groundedness
173+
174+
Generate a summary of the generated groundedness response detailing strengths and weaknesses.
175+
176+
Now display **ALL the claims in the **markdown table format** with these columns**:
177+
178+
| # | Source Tool | Source Data| Input Data | Claim Text | Claimed | Input | Conversion Statement | Calculated | Error | Status | Reason |
179+
|---|---|-----------------------|-------------------------------------------|------------------------------------------|------------------------------|---|---|---|---|---|---|
180+
| <claim_id> | <source tool id> | <claim_text> | <source_data>| <input_data>| <claimed_value> <claim_unit> | <input data> | <input to claim conversion statement> | <Calculated claim> <claim_unit> | <Error in claim as %> | SUCCESS/FAILURE | <reason> |
181+
182+
Then show a separate **Failed Claims Summary in markdown table format** with only the failed claims:
183+
184+
| # | Claim | Claimed | Source Tool ID | Actual Text | Actual Data | Error | Root Cause |
185+
|---|---|---|------------------|---------------|--------------|---|-------------|
186+
| <claim_id> | <claim_text> | <claimed_value> | <source_tool_id> | <source_data> | <Input data> | <error %> | <reasoning> |
187+
188+
REMEMBER to generate each value COMPLETELY. DO NOT TRUNCATE.
189+
190+
#### Validity
191+
Generate a summary of the generated validity response detailing strengths and weaknesses.
192+
193+
#### Coherence
194+
Generate a summary of the generated coherence response detailing strengths and weaknesses.
195+
196+
#### Utility
197+
Generate a summary of the generated utility response detailing strengths and weaknesses.
198+
199+
#### Tool Validation
200+
Generate a summary of the generated tool validation response detailing strengths and weaknesses.
201+
202+
Now display all the tool details in markdown table format:
203+
204+
| # | Tool Name | Tool Status |
205+
|---|---|---|
206+
| <id> | <tool name> | <tool status> |
207+
208+
REMEMBER to generate each value completely. NO TRUNCATION.
209+
210+
After presenting each trace result, write it to a markdown file inside the report directory. Read `report_dir` from the batch_validate.py JSON output. Use the trace index (1-based) and first 12 characters of the trace ID for the filename:
211+
212+
```bash
213+
cat > "<report_dir>/trace_<N>_<first_12_chars_of_id>.md" <<'TRACE_EOF'
214+
<full per-trace result output from above>
215+
TRACE_EOF
216+
```
217+
218+
---
219+
220+
### Step 5: Cross-Trace Comprehensive Summary (for all evaluations)
221+
222+
After presenting all individual trace results, generate a comprehensive summary:
223+
224+
#### Overall Score Summary in markdown table format
225+
226+
| Criteria | Average Score | Min | Max | Traces Evaluated |
227+
|---|---|---|---|---|
228+
| **Groundedness** | <avg>/5 | <min>/5 | <max>/5 | <count> |
229+
| **Validity** | <avg>/5 | <min>/5 | <max>/5 | <count> |
230+
| **Coherence** | <avg>/5 | <min>/5 | <max>/5 | <count> |
231+
| **Utility** | <avg>/5 | <min>/5 | <max>/5 | <count> |
232+
| **Tool Validation** | <avg>/5 | <min>/5 | <max>/5 | <count> |
233+
234+
Use the scores AFTER semantic matching corrections from Step 2, and reasons AFTER semantic reason generation from Step 3.
235+
236+
#### Per-Trace Score Breakdown in markdown table format
237+
238+
| Trace ID | Groundedness | Validity | Coherence | Utility | Tool Validation |
239+
|---|---|---|---|---|---|
240+
| <id> | <score>/5 | <score>/5 | <score>/5 | <score>/5 | <score>/5 |
241+
242+
#### Category-Wise Analysis
243+
244+
For EACH category, provide:
245+
- **Common Strengths**: Patterns of success observed across traces
246+
- **Common Weaknesses**: Recurring issues found across traces
247+
- **Recommendations**: Actionable improvements based on the analysis
248+
249+
After generating the overall summary, write it to `SUMMARY.md` inside the report directory:
250+
251+
```bash
252+
cat > "<report_dir>/SUMMARY.md" <<'SUMMARY_MD_EOF'
253+
<full cross-trace summary output from above>
254+
SUMMARY_MD_EOF
255+
```
256+
257+
---
258+
259+
### Step 6: Log Summary to File
260+
261+
Append the comprehensive summary (with semantic matching corrections and semantic reasons) to the log file. Read the `log_file` path from the batch_validate.py output and append:
262+
263+
```bash
264+
cat >> <log_file_path> <<'SUMMARY_EOF'
265+
266+
=== COMPREHENSIVE SUMMARY (with semantic matching corrections and semantic reasons) ===
267+
<paste the cross-trace summary and per-trace corrected scores here>
268+
SUMMARY_EOF
269+
```
270+
271+
This ensures the log file contains both the raw API results and the post-processed summary with semantic matching corrections and semantic reasons.

0 commit comments

Comments
 (0)