feat(eval): MICODEX intent-layer eval corpus — session 09a · 58/58#68
feat(eval): MICODEX intent-layer eval corpus — session 09a · 58/58#68zkSoju wants to merge 3 commits into
Conversation
Substrate proof for the no-fine-tuning claim from
~/vault/wiki/concepts/micodex-as-knowledge-map.md §6. Triangle
composition: STAMETS forge LEAD + KEEPER source + OSTROM contract +
KRANZ substrate-prep. Per session 09a spec at
~/bonfire/grimoires/bonfire/specs/micodex-eval-corpus-2026-05-02.md.
Result: 58/58 = 100% pass after one iteration round (baseline 91.2%).
Files:
- tests-smoke/eval-corpus.jsonl: 58 cases (43 canonical + 11 empty +
4 motif/concept seed)
- tests-smoke/eval-harness.ts: OSTROM contract runner (exit 0/1/2,
tunable EVAL_GATE, layered failure-hint classification per spec
§2.8 dual-layer-leak)
- tests-smoke/forge-canonical.ts: STAMETS canonical baseline generator
(43 grails × live searchCodex, sets expected = actual top-1)
- tests-smoke/probe-empty.ts: KEEPER empty-expected term verifier
- tests-smoke/EVAL_CONTRACT.md: one-page OSTROM contract doc
- tests-smoke/{canonical,seed-empty-and-motif}.jsonl: corpus parts
- grimoires/loa/qa/qa-cycle-micodex-09a-2026-05-02.md: WITNESS
operator-facing QA checklist (7 surfaces, 3 shareable paths:
MCP deeplink, CLI npm, Discord live trial)
- grimoires/loa/NOTES.md: session 09a Decision Log appended
- package.json + pnpm-lock.yaml: @tobilu/qmd ^2.0.0 → ^2.1.0
(KRANZ act 1 caught local install crash under runtime;
bumped local + lockfile to match working global)
Substrate-truth findings (iteration round, per spec §4.6 Bucket A):
- "underworld grail" → @g6805 Aquarius (Hades-as-Aquarius is canonical
underworld grail per aquarius.md), NOT spec example @g4488
- "the dark grail" → @g6458 Fire (lexical "dark orange" in fire.md),
NOT spec example @g876 Black Hole (says "void presence")
- Empty-expected swaps: dragon/car/crypto → gasoline/smartphone/
refrigerator (probed via probe-empty.ts; "crypto" was TRUE positive
matching Satoshi-as-Hermes since lore explicitly mentions Bitcoin)
Doctrine updates (in operator's vault):
- ~/vault/wiki/concepts/micodex-as-knowledge-map.md §6 evidence tier
promoted asserted → measured
- ~/vault/wiki/concepts/synthetic-supervision-for-knowledge-maps.md
NEW reusable pattern (MapTrace adapted for knowledge maps)
V1.5 deferred: CI integration · Railway HTTP eval (gates session 09b)
· cross-collection eval (codex-core-lore) · acceptable_top_3 field ·
KEEPER real-source pass via WITNESS dogfood (replaces operator-paired
hunches per pivot 2026-05-02).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6eb579170a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| } | ||
|
|
||
| process.exit(passRate >= GATE ? 0 : 1); |
There was a problem hiding this comment.
Return harness error code when qmd execution fails
The contract in tests-smoke/EVAL_CONTRACT.md says infrastructure/harness failures (e.g., qmd missing/crashing) must exit with code 2, but qmd errors are currently converted into normal case failures and the script always exits with only 0/1. This means transient or partial qmd failures can still produce an overall PASS when the gate is met, masking an invalid test run in CI instead of signaling a harness error.
Useful? React with 👍 / 👎.
Post-PR-65 investigation surfaced: 1. Deployed serverInfo.version is "1.1.0" despite source on main being v1.4.0 (PR #65 merged 22:30Z) — Railway never rebuilt successfully. 2. Latest Railway deploy (828bff67 at 23:26Z) failed on node-llama-cpp@3.18.1 postinstall: alpine image lacks git/make/glibc. Fix in PR #69 (Dockerfile node:20-alpine → node:20-slim). 3. Canonical prod URL is https://mcp.0xhoneyjar.xyz/codex/mcp (freeside-mcp-gateway routes to upstream codex MCP), not the raw Railway URL — both work but canonical is what to share. Updates to qa-cycle-micodex-09a-2026-05-02.md: - frontmatter: status 🟡 → 🔴 deploy-blocked; inputs reflect v1.1.0 deployed-vs-1.4.0-source mismatch + PR #69 dependency - Path A (MCP deeplink): regenerated Cursor + VSCode + Claude Desktop configs against canonical mcp.0xhoneyjar.xyz/codex/mcp; added warning about today's v1.1.0 surface (lookup_* works, search_codex doesn't) - NEW S0 (deploy-version verification, PRECONDITION for S1/S3/S5): curl-based MCP initialize + tools/list assertion against canonical gateway. Expected: serverInfo.version "1.4.0" + tools/list contains search_codex. Today's failure mode (v1.1.0 + 8 tools) documented in triage with pointer to PR #69 root-cause. - S3 marked GATED ON S0=v1.4.0 - STOP-MERGE block adds the v1.1.0 detection gate S2 (CLI), S4 (KEEPER source via Cursor), S6 (refusal cadence), and S7 (Discord bot deploy) work against today's v1.1.0 surface and are unblocked. S1, S3, S5 are gated on PR #69 + redeploy. Hamilton discipline: trust the deployment, not the diff. The diff says v1.4.0; the deployment says v1.1.0; only the deployment counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tall (#69) * fix(docker): node:20-alpine → node:20-slim for node-llama-cpp postinstall Railway build failure (deployment 828bff67-6f3f-4f34-9dbe-227dcdadaf99 at 2026-05-02T23:26:24Z): pnpm install --frozen-lockfile crashes on node-llama-cpp@3.18.1 postinstall. Postinstall tries to clone+build llama.cpp from source when no prebuilt binary matches the target. Alpine fails on three counts: - no glibc (prebuilt binaries cannot load — node-llama-cpp explicitly warns "the prebuilt binaries cannot be used in this Linux distro, as `glibc` is not detected") - no git in default image (clone fails: "Git is not installed, please install it first to build llama.cpp") - no make in default image ("It seems that 'make' is not installed in your system. Install it to resolve build issues") node-llama-cpp's own troubleshooting guidance (printed in the failing build log): "If you're trying to run this inside of a Docker container, consider using 'node:20' image." Switch both stages to node:20-slim (Debian slim has glibc; ~70MB vs alpine ~50MB — small image-size cost for working build). Add git + make + python3 + g++ + ca-certificates to the builder stage only; runtime stage stays slim. Multi-stage discipline preserved. Verifies post-merge: Railway auto-deploys on main → curl mcp.0xhoneyjar.xyz/codex/mcp initialize → expect serverInfo.version "1.4.0" + tools/list includes search_codex. Bug: PR #65 (v1.4.0) was merged 2026-05-02T22:30:53Z but Railway never auto-deployed; deployed serverInfo.version is still 1.1.0 (caught via WITNESS S0 deploy verification on PR #68 sister branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): add libstdc++6 + libgomp1 to runtime stage (bridgebuilder F1) Bridgebuilder review 2026-05-02 caught a silent-runtime-failure: the compiled node-llama-cpp .node addon links against libstdc++ + libgomp in the builder stage, then dlopens them lazily on first require at runtime. node:20-slim ships glibc but NOT libstdc++6 or libgomp1 by default; without them, the container boots fine and crashes on first inference call. Net: container appears healthy in Railway logs until traffic arrives, then 💥. The classic "build green, runtime red" failure mode that escapes static review. Add libstdc++6 + libgomp1 + ca-certificates to runtime stage. Builder stage already had build-time toolchain (git/make/python3/g++) — those stay scoped to builder per F5 (security: don't ship compilers in runtime). Image-size cost note added to builder comment per F2. Bridgebuilder F3 (digest-pin base image) deferred — separate-scope follow-up; would require digest lookup + ongoing dependabot-style maintenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zkSoju
left a comment
There was a problem hiding this comment.
Summary
Analytical review of #68. Enrichment pass was unavailable; findings are unenriched.
Findings
{
"schema_version": 1,
"findings": [
{
"id": "eval-harness-jsonl-comment-parse",
"title": "JSONL comment-line filter is fragile",
"severity": "LOW",
"category": "robustness",
"file": "tests-smoke/eval-harness.ts",
"description": "The corpus parser filters lines with `!l.startsWith('//')` to permit comment lines, but JSONL is not specified to support comments. A line beginning with whitespace before `//` would still be passed to JSON.parse and crash the harness with a non-actionable error. Trim before filter, or drop comment support entirely.",
"suggestion": "Use `.map(l => l.trim()).filter(l => l.length > 0 && !l.startsWith('//'))` — or remove comment support since the current corpus contains none.",
"confidence": 0.7
},
{
"id": "eval-harness-deferred-pass-counted",
"title": "Deferred passing cases are excluded from totals",
"severity": "LOW",
"category": "correctness",
"file": "tests-smoke/eval-harness.ts",
"description": "Per-category accumulator increments `deferred` but never `pass`/`total` for deferred cases (lines ~138-145). The progress bar still prints `.` or `F` for deferred cases, but they are silently excluded from gate calculation. This matches the EVAL_CONTRACT spec ('exclude from gate calculation'), so it is intentional — but the report does not visibly indicate whether deferred cases passed or failed, only their count.",
"suggestion": "Consider printing a separate '(deferred: N pass / M fail)' breakdown so deferred regressions are visible to humans even though they don't gate CI.",
"confidence": 0.6
},
{
"id": "eval-harness-exit-code-edge",
"title": "Empty eval set with all-deferred corpus returns pass",
"severity": "LOW",
"category": "correctness",
"file": "tests-smoke/eval-harness.ts",
"description": "If every case is `deferred: true`, `totalEval === 0` and `passRate` defaults to 0, so `passRate >= GATE` is false (assuming GATE > 0) and harness exits 1 — but the failure message reports 0/0 which is confusing. Conversely if GATE is set to 0, an all-deferred corpus exits 0 with no real signal.",
"suggestion": "Add explicit guard: if totalEval === 0, exit 2 with 'no non-deferred cases to gate against'.",
"confidence": 0.6
},
{
"id": "forge-canonical-min-score-floor",
"title": "Floor calculation can produce min_score=0 silently",
"severity": "LOW",
"category": "robustness",
"file": "tests-smoke/forge-canonical.ts",
"description": "`Math.max(0, Math.floor((actualScore - 0.1) * 10) / 10)` produces a min_score of 0 whenever actualScore < 0.1, which would make any future score change pass the gate. Mijedi already gets min_score=0.4 (from actualScore likely ~0.5). Without a hard lower bound or warning, weak-substrate cases get baked into the corpus with no real assertion strength.",
"suggestion": "Either require min_score >= some floor (e.g. 0.3) and emit a forge warning, or annotate the JSONL with a 'weak-anchor' marker for review.",
"confidence": 0.55
},
{
"id": "pnpm-lock-downgrade",
"title": "Transitive deps downgraded (better-sqlite3, yaml, zod)",
"severity": "MEDIUM",
"category": "supply-chain",
"file": "pnpm-lock.yaml",
"description": "@tobilu/qmd 2.0.0 → 2.1.0 changes pinned transitives: better-sqlite3 12.9.0 → 12.8.0, yaml 2.8.4 → 2.8.3, zod 4.4.2 → 4.2.1. These are downgrades (lower minor/patch), which is unusual when bumping a primary dep. Likely qmd 2.1.0 added the tree-sitter optionalDependencies and pinned older versions for compat. Worth verifying no security advisories or bugfixes are lost.",
"suggestion": "Confirm the downgrades are intentional (qmd 2.1.0 release notes) and check `pnpm audit` for any advisories on yaml@2.8.3 / zod@4.2.1.",
"confidence": 0.65
},
{
"id": "eval-corpus-duplicates-seed-files",
"title": "Three corpus files duplicate identical content",
"severity": "LOW",
"category": "maintainability",
"file": "tests-smoke/eval-corpus.jsonl",
"description": "`eval-corpus.jsonl` is the concatenation of `canonical.jsonl` + `seed-empty-and-motif.jsonl` (per EVAL_CONTRACT.md forge instructions). Committing all three creates drift risk: any manual edit to `eval-corpus.jsonl` will be overwritten by the next `cat`-based regeneration, and any edit to a seed without regenerating leaves the corpus stale.",
"suggestion": "Either commit only the seeds and generate `eval-corpus.jsonl` in CI, or commit only `eval-corpus.jsonl` and treat seeds as historical/forge-only artifacts. Document which is authoritative.",
"confidence": 0.8
},
{
"id": "harness-forge-shared-min-score",
"title": "min_score derivation is hardcoded in two places",
"severity": "LOW",
"category": "maintainability",
"file": "tests-smoke/forge-canonical.ts",
"description": "The forge writes `min_score = floor((actualScore - 0.1) * 10) / 10` and the harness asserts `top.score >= c.min_score`. The 0.1 buffer is the only flake-tolerance budget for canonical cases. If the substrate's scores drift slightly (qmd version bump, embedding model change), all 42 canonical cases could fail simultaneously with no clear remediation path other than re-forging.",
"suggestion": "Document the 0.1 buffer as a tunable in EVAL_CONTRACT.md, and consider adding a `pnpm eval:reforge` script that updates min_scores when intentional substrate changes occur.",
"confidence": 0.5
},
{
"id": "praise-failure-layer-classification",
"title": "Failure-layer hints add real diagnostic value",
"severity": "PRAISE",
"category": "design",
"file": "tests-smoke/eval-harness.ts",
"description": "The `failureLayerHint()` taxonomy (INFRASTRUCTURE / SUBSTRATE-FP / SUBSTRATE-FN / INTERFACE-or-SUBSTRATE / THRESHOLD) gives operators an actionable starting point for triage rather than just 'test failed'. The contract document also explicitly maps each hint to a remediation bucket (case-wrong / substrate-thin / inherent), which is the kind of structure that prevents debug spirals."
},
{
"id": "praise-keeper-source-provenance",
"title": "keeper_source field captures provenance per case",
"severity": "PRAISE",
"category": "design",
"file": "tests-smoke/eval-corpus.jsonl",
"description": "Each corpus case carries a `keeper_source` describing why it exists, including notes like 'replaces \"car\" which substrate matched Satanist @0.88 (vec/HyDE quirk)'. This makes the corpus self-documenting and prevents the common eval-rot pattern where nobody remembers why a test exists or was changed."
},
{
"id": "gate-history-baseline-low",
"title": "EVAL_GATE default 0.85 vs current 100% gives wide regression band",
"severity": "SPECULATION",
"category": "policy",
"file": "tests-smoke/EVAL_CONTRACT.md",
"description": "Gate default is 0.85 but the current corpus passes at 100%. A regression to 86% (8 of 58 cases breaking silently) would still pass CI. Consider a tighter gate or a separate 'regression band' (alert if drop > 5pp from previous run).",
"suggestion": "Set EVAL_GATE to a value closer to current pass rate (e.g. 0.95) or implement a delta-aware gate that compares against the last recorded pass rate in the gate-history table.",
"confidence": 0.5
}
]
}Callouts
Enrichment unavailable for this review.
Operator critique 2026-05-02: "qa test looks very basic." Verification
scenarios (S0-S7) gate correctness but don't show what's WORTH showing.
This commit adds:
1. Capability-landscape preamble (before scenarios): articulates what's
NEW post-09a — 9th tool search_codex live, substrate measured 58/58,
3 distribution paths, doctrine promotion, runtime Docker fixes,
substrate proven RICHER than operator hunches, KEEPER real-user
loop replacing operator-paired hunches. Names what UNLOCKS for
users not just devs.
2. Three showcase scenarios (SC1-SC3) — narrative walkthroughs
demonstrating the felt outcomes the verification scenarios merely
verify:
- SC1 — substrate-discovery moment (Gumi-style 4-step navigation
through transformation/cypherpunk/relate/underworld; captures
the SUBSTRATE-RICHER-THAN-OPERATOR moment as REPRODUCIBLE in
prod)
- SC2 — cross-character mirror (same intent through ruggy + satoshi;
proves substrate decoupled from voice)
- SC3 — anti-hallucination cadence (ask for grails that don't exist;
proves the no-fine-tuning doctrine claim holds at user-facing layer)
WITNESS doctrine ≤7 surfaces preserved for verification (S0-S7);
showcase scenarios are an explicit second register for capability-
demo audiences (operator showing investor, Gumi showing community
member, KEEPER capturing real-user friction). Per WITNESS Hamilton
discipline: trust the deployment, but ALSO show why the deployment
matters.
Triage paths in SC1-SC3 named in felt-outcomes (substrate not called,
voice loses, anti-hallucination broken) not just pass/fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Session 09a — MICODEX intent-layer eval corpus. Substrate proof for the no-fine-tuning claim from
~/vault/wiki/concepts/micodex-as-knowledge-map.md§6.Result: 58/58 = 100% pass (one iteration round; baseline was 91.2% on first run).
Triangle composition per session 09a spec at
~/bonfire/grimoires/bonfire/specs/micodex-eval-corpus-2026-05-02.md: STAMETS forge LEAD + KEEPER source + OSTROM contract + KRANZ substrate-prep.What ships
tests-smoke/eval-harness.ts) — CI-runnable; exits 0/1/2 per OSTROM contract;EVAL_GATEenv tunable; layered failure-hint classification per spec §2.8 dual-layer-leaktests-smoke/forge-canonical.ts) — autonomous live-search forge across all 43 grails@tobilu/qmd^2.0.0→^2.1.0(caught local install crash withSQLiteError: no such module: vec0; bumped local + lockfile to match working global)grimoires/loa/qa/qa-cycle-micodex-09a-2026-05-02.md) — 7 surfaces · 3 shareable paths (MCP deeplink, CLI npm, Discord live trial) for dogfooding the substrate in real environmentsmicodex-as-knowledge-map.md§6 evidence tier promoted asserted → measured; NEW pagesynthetic-supervision-for-knowledge-maps.mdcrystallizes the reusable patternSubstrate-truth findings (iteration round, per spec §4.6 Bucket A)
Initial baseline had 5 failures. Investigation showed substrate was right and spec-examples were aspirational:
"underworld grail"→@g6805Aquarius (aquarius.mdexplicitly: "Hades is the Greek god of the underworld") — NOT spec's@g4488Satoshi-as-Hermes (psychopompia is also valid; V1.5acceptable_top_3candidate)"the dark grail"→@g6458Fire (lexical "dark orange" infire.md) — NOT spec's@g876Black Hole (says "void presence" not "dark"; V1.5 substrate-iteration: extend black-hole.md anchors)"crypto"matched@g4488Satoshi-as-Hermes legitimately — TRUE positive, lore mentions Bitcoin → confirms substrate richness"dragon"→Mongolian,"car"→Satanist (no lexical anchor; documented for V1.5)Cases amended to substrate truth.
Test plan
pnpm exec tsx tests-smoke/eval-harness.ts tests-smoke/eval-corpus.jsonl→ exit 0, "OVERALL: 58/58 = 100.0%"pnpm exec tsx tests-smoke/forge-canonical.ts > /tmp/canonical-fresh.jsonl→ 43 cases, 0 gaps (regenerates baseline cleanly)EVAL_GATE=0.95 pnpm exec tsx tests-smoke/eval-harness.ts tests-smoke/eval-corpus.jsonl→ exit 0 (gate-tunable)grimoires/loa/qa/captures/micodex-09a/(post-merge dogfood)V1.5 deferred
codex-core-lore)acceptable_top_3field (handles multi-valid-answer like "underworld")🤖 Generated with Claude Code