feat(eval): MICODEX intent-layer eval corpus — session 09a · 58/58 by zkSoju · Pull Request #68 · 0xHoneyJar/construct-mibera-codex

zkSoju · 2026-05-02T23:19:12Z

Summary

Session 09a — MICODEX intent-layer eval corpus. Substrate proof for the no-fine-tuning claim from ~/vault/wiki/concepts/micodex-as-knowledge-map.md §6.

Result: 58/58 = 100% pass (one iteration round; baseline was 91.2% on first run).

Triangle composition per session 09a spec at ~/bonfire/grimoires/bonfire/specs/micodex-eval-corpus-2026-05-02.md: STAMETS forge LEAD + KEEPER source + OSTROM contract + KRANZ substrate-prep.

What ships

58-case eval corpus (43 canonical + 11 empty-expected refusals + 4 motif/concept seed cases)
Headless harness (tests-smoke/eval-harness.ts) — CI-runnable; exits 0/1/2 per OSTROM contract; EVAL_GATE env tunable; layered failure-hint classification per spec §2.8 dual-layer-leak
STAMETS canonical baseline generator (tests-smoke/forge-canonical.ts) — autonomous live-search forge across all 43 grails
KRANZ act 1 substrate-prep runtime fix: @tobilu/qmd ^2.0.0 → ^2.1.0 (caught local install crash with SQLiteError: no such module: vec0; bumped local + lockfile to match working global)
WITNESS QA checklist (grimoires/loa/qa/qa-cycle-micodex-09a-2026-05-02.md) — 7 surfaces · 3 shareable paths (MCP deeplink, CLI npm, Discord live trial) for dogfooding the substrate in real environments
Doctrine (in operator's vault): micodex-as-knowledge-map.md §6 evidence tier promoted asserted → measured; NEW page synthetic-supervision-for-knowledge-maps.md crystallizes the reusable pattern

Substrate-truth findings (iteration round, per spec §4.6 Bucket A)

Initial baseline had 5 failures. Investigation showed substrate was right and spec-examples were aspirational:

"underworld grail" → @g6805 Aquarius (aquarius.md explicitly: "Hades is the Greek god of the underworld") — NOT spec's @g4488 Satoshi-as-Hermes (psychopompia is also valid; V1.5 acceptable_top_3 candidate)
"the dark grail" → @g6458 Fire (lexical "dark orange" in fire.md) — NOT spec's @g876 Black Hole (says "void presence" not "dark"; V1.5 substrate-iteration: extend black-hole.md anchors)
"crypto" matched @g4488 Satoshi-as-Hermes legitimately — TRUE positive, lore mentions Bitcoin → confirms substrate richness
2 vec/HyDE quirks: "dragon"→Mongolian, "car"→Satanist (no lexical anchor; documented for V1.5)

Cases amended to substrate truth.

Test plan

pnpm exec tsx tests-smoke/eval-harness.ts tests-smoke/eval-corpus.jsonl → exit 0, "OVERALL: 58/58 = 100.0%"
pnpm exec tsx tests-smoke/forge-canonical.ts > /tmp/canonical-fresh.jsonl → 43 cases, 0 gaps (regenerates baseline cleanly)
EVAL_GATE=0.95 pnpm exec tsx tests-smoke/eval-harness.ts tests-smoke/eval-corpus.jsonl → exit 0 (gate-tunable)
WITNESS QA checklist S1-S7 captures landed at grimoires/loa/qa/captures/micodex-09a/ (post-merge dogfood)

V1.5 deferred

Auto-eval on PR (CI integration of harness)
Eval against Railway HTTP transport (gates session 09b integration QA)
Cross-collection eval (codex-core-lore)
acceptable_top_3 field (handles multi-valid-answer like "underworld")
KEEPER real-source pass via WITNESS dogfood path (replaces operator-paired hunches per 2026-05-02 pivot)

🤖 Generated with Claude Code

@g876

Substrate proof for the no-fine-tuning claim from ~/vault/wiki/concepts/micodex-as-knowledge-map.md §6. Triangle composition: STAMETS forge LEAD + KEEPER source + OSTROM contract + KRANZ substrate-prep. Per session 09a spec at ~/bonfire/grimoires/bonfire/specs/micodex-eval-corpus-2026-05-02.md. Result: 58/58 = 100% pass after one iteration round (baseline 91.2%). Files: - tests-smoke/eval-corpus.jsonl: 58 cases (43 canonical + 11 empty + 4 motif/concept seed) - tests-smoke/eval-harness.ts: OSTROM contract runner (exit 0/1/2, tunable EVAL_GATE, layered failure-hint classification per spec §2.8 dual-layer-leak) - tests-smoke/forge-canonical.ts: STAMETS canonical baseline generator (43 grails × live searchCodex, sets expected = actual top-1) - tests-smoke/probe-empty.ts: KEEPER empty-expected term verifier - tests-smoke/EVAL_CONTRACT.md: one-page OSTROM contract doc - tests-smoke/{canonical,seed-empty-and-motif}.jsonl: corpus parts - grimoires/loa/qa/qa-cycle-micodex-09a-2026-05-02.md: WITNESS operator-facing QA checklist (7 surfaces, 3 shareable paths: MCP deeplink, CLI npm, Discord live trial) - grimoires/loa/NOTES.md: session 09a Decision Log appended - package.json + pnpm-lock.yaml: @tobilu/qmd ^2.0.0 → ^2.1.0 (KRANZ act 1 caught local install crash under runtime; bumped local + lockfile to match working global) Substrate-truth findings (iteration round, per spec §4.6 Bucket A): - "underworld grail" → @g6805 Aquarius (Hades-as-Aquarius is canonical underworld grail per aquarius.md), NOT spec example @g4488 - "the dark grail" → @g6458 Fire (lexical "dark orange" in fire.md), NOT spec example @g876 Black Hole (says "void presence") - Empty-expected swaps: dragon/car/crypto → gasoline/smartphone/ refrigerator (probed via probe-empty.ts; "crypto" was TRUE positive matching Satoshi-as-Hermes since lore explicitly mentions Bitcoin) Doctrine updates (in operator's vault): - ~/vault/wiki/concepts/micodex-as-knowledge-map.md §6 evidence tier promoted asserted → measured - ~/vault/wiki/concepts/synthetic-supervision-for-knowledge-maps.md NEW reusable pattern (MapTrace adapted for knowledge maps) V1.5 deferred: CI integration · Railway HTTP eval (gates session 09b) · cross-collection eval (codex-core-lore) · acceptable_top_3 field · KEEPER real-source pass via WITNESS dogfood (replaces operator-paired hunches per pivot 2026-05-02). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-02T23:19:14Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docs	Ready	Preview, Comment	May 2, 2026 11:54pm

socket-security · 2026-05-02T23:19:52Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	@tobilu/qmd@2.0.0 ⏵ 2.1.0	^-1			^-1

View full report

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6eb579170a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T23:22:20Z

+  }
+}
+
+process.exit(passRate >= GATE ? 0 : 1);


Return harness error code when qmd execution fails

The contract in tests-smoke/EVAL_CONTRACT.md says infrastructure/harness failures (e.g., qmd missing/crashing) must exit with code 2, but qmd errors are currently converted into normal case failures and the script always exits with only 0/1. This means transient or partial qmd failures can still produce an overall PASS when the gate is met, masking an invalid test run in CI instead of signaling a harness error.

Useful? React with 👍 / 👎.

Post-PR-65 investigation surfaced: 1. Deployed serverInfo.version is "1.1.0" despite source on main being v1.4.0 (PR #65 merged 22:30Z) — Railway never rebuilt successfully. 2. Latest Railway deploy (828bff67 at 23:26Z) failed on node-llama-cpp@3.18.1 postinstall: alpine image lacks git/make/glibc. Fix in PR #69 (Dockerfile node:20-alpine → node:20-slim). 3. Canonical prod URL is https://mcp.0xhoneyjar.xyz/codex/mcp (freeside-mcp-gateway routes to upstream codex MCP), not the raw Railway URL — both work but canonical is what to share. Updates to qa-cycle-micodex-09a-2026-05-02.md: - frontmatter: status 🟡 → 🔴 deploy-blocked; inputs reflect v1.1.0 deployed-vs-1.4.0-source mismatch + PR #69 dependency - Path A (MCP deeplink): regenerated Cursor + VSCode + Claude Desktop configs against canonical mcp.0xhoneyjar.xyz/codex/mcp; added warning about today's v1.1.0 surface (lookup_* works, search_codex doesn't) - NEW S0 (deploy-version verification, PRECONDITION for S1/S3/S5): curl-based MCP initialize + tools/list assertion against canonical gateway. Expected: serverInfo.version "1.4.0" + tools/list contains search_codex. Today's failure mode (v1.1.0 + 8 tools) documented in triage with pointer to PR #69 root-cause. - S3 marked GATED ON S0=v1.4.0 - STOP-MERGE block adds the v1.1.0 detection gate S2 (CLI), S4 (KEEPER source via Cursor), S6 (refusal cadence), and S7 (Discord bot deploy) work against today's v1.1.0 surface and are unblocked. S1, S3, S5 are gated on PR #69 + redeploy. Hamilton discipline: trust the deployment, not the diff. The diff says v1.4.0; the deployment says v1.1.0; only the deployment counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tall (#69) * fix(docker): node:20-alpine → node:20-slim for node-llama-cpp postinstall Railway build failure (deployment 828bff67-6f3f-4f34-9dbe-227dcdadaf99 at 2026-05-02T23:26:24Z): pnpm install --frozen-lockfile crashes on node-llama-cpp@3.18.1 postinstall. Postinstall tries to clone+build llama.cpp from source when no prebuilt binary matches the target. Alpine fails on three counts: - no glibc (prebuilt binaries cannot load — node-llama-cpp explicitly warns "the prebuilt binaries cannot be used in this Linux distro, as `glibc` is not detected") - no git in default image (clone fails: "Git is not installed, please install it first to build llama.cpp") - no make in default image ("It seems that 'make' is not installed in your system. Install it to resolve build issues") node-llama-cpp's own troubleshooting guidance (printed in the failing build log): "If you're trying to run this inside of a Docker container, consider using 'node:20' image." Switch both stages to node:20-slim (Debian slim has glibc; ~70MB vs alpine ~50MB — small image-size cost for working build). Add git + make + python3 + g++ + ca-certificates to the builder stage only; runtime stage stays slim. Multi-stage discipline preserved. Verifies post-merge: Railway auto-deploys on main → curl mcp.0xhoneyjar.xyz/codex/mcp initialize → expect serverInfo.version "1.4.0" + tools/list includes search_codex. Bug: PR #65 (v1.4.0) was merged 2026-05-02T22:30:53Z but Railway never auto-deployed; deployed serverInfo.version is still 1.1.0 (caught via WITNESS S0 deploy verification on PR #68 sister branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): add libstdc++6 + libgomp1 to runtime stage (bridgebuilder F1) Bridgebuilder review 2026-05-02 caught a silent-runtime-failure: the compiled node-llama-cpp .node addon links against libstdc++ + libgomp in the builder stage, then dlopens them lazily on first require at runtime. node:20-slim ships glibc but NOT libstdc++6 or libgomp1 by default; without them, the container boots fine and crashes on first inference call. Net: container appears healthy in Railway logs until traffic arrives, then 💥. The classic "build green, runtime red" failure mode that escapes static review. Add libstdc++6 + libgomp1 + ca-certificates to runtime stage. Builder stage already had build-time toolchain (git/make/python3/g++) — those stay scoped to builder per F5 (security: don't ship compilers in runtime). Image-size cost note added to builder comment per F2. Bridgebuilder F3 (digest-pin base image) deferred — separate-scope follow-up; would require digest lookup + ongoing dependabot-style maintenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zkSoju

Summary

Analytical review of #68. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "eval-harness-jsonl-comment-parse",
      "title": "JSONL comment-line filter is fragile",
      "severity": "LOW",
      "category": "robustness",
      "file": "tests-smoke/eval-harness.ts",
      "description": "The corpus parser filters lines with `!l.startsWith('//')` to permit comment lines, but JSONL is not specified to support comments. A line beginning with whitespace before `//` would still be passed to JSON.parse and crash the harness with a non-actionable error. Trim before filter, or drop comment support entirely.",
      "suggestion": "Use `.map(l => l.trim()).filter(l => l.length > 0 && !l.startsWith('//'))` — or remove comment support since the current corpus contains none.",
      "confidence": 0.7
    },
    {
      "id": "eval-harness-deferred-pass-counted",
      "title": "Deferred passing cases are excluded from totals",
      "severity": "LOW",
      "category": "correctness",
      "file": "tests-smoke/eval-harness.ts",
      "description": "Per-category accumulator increments `deferred` but never `pass`/`total` for deferred cases (lines ~138-145). The progress bar still prints `.` or `F` for deferred cases, but they are silently excluded from gate calculation. This matches the EVAL_CONTRACT spec ('exclude from gate calculation'), so it is intentional — but the report does not visibly indicate whether deferred cases passed or failed, only their count.",
      "suggestion": "Consider printing a separate '(deferred: N pass / M fail)' breakdown so deferred regressions are visible to humans even though they don't gate CI.",
      "confidence": 0.6
    },
    {
      "id": "eval-harness-exit-code-edge",
      "title": "Empty eval set with all-deferred corpus returns pass",
      "severity": "LOW",
      "category": "correctness",
      "file": "tests-smoke/eval-harness.ts",
      "description": "If every case is `deferred: true`, `totalEval === 0` and `passRate` defaults to 0, so `passRate >= GATE` is false (assuming GATE > 0) and harness exits 1 — but the failure message reports 0/0 which is confusing. Conversely if GATE is set to 0, an all-deferred corpus exits 0 with no real signal.",
      "suggestion": "Add explicit guard: if totalEval === 0, exit 2 with 'no non-deferred cases to gate against'.",
      "confidence": 0.6
    },
    {
      "id": "forge-canonical-min-score-floor",
      "title": "Floor calculation can produce min_score=0 silently",
      "severity": "LOW",
      "category": "robustness",
      "file": "tests-smoke/forge-canonical.ts",
      "description": "`Math.max(0, Math.floor((actualScore - 0.1) * 10) / 10)` produces a min_score of 0 whenever actualScore < 0.1, which would make any future score change pass the gate. Mijedi already gets min_score=0.4 (from actualScore likely ~0.5). Without a hard lower bound or warning, weak-substrate cases get baked into the corpus with no real assertion strength.",
      "suggestion": "Either require min_score >= some floor (e.g. 0.3) and emit a forge warning, or annotate the JSONL with a 'weak-anchor' marker for review.",
      "confidence": 0.55
    },
    {
      "id": "pnpm-lock-downgrade",
      "title": "Transitive deps downgraded (better-sqlite3, yaml, zod)",
      "severity": "MEDIUM",
      "category": "supply-chain",
      "file": "pnpm-lock.yaml",
      "description": "@tobilu/qmd 2.0.0 → 2.1.0 changes pinned transitives: better-sqlite3 12.9.0 → 12.8.0, yaml 2.8.4 → 2.8.3, zod 4.4.2 → 4.2.1. These are downgrades (lower minor/patch), which is unusual when bumping a primary dep. Likely qmd 2.1.0 added the tree-sitter optionalDependencies and pinned older versions for compat. Worth verifying no security advisories or bugfixes are lost.",
      "suggestion": "Confirm the downgrades are intentional (qmd 2.1.0 release notes) and check `pnpm audit` for any advisories on yaml@2.8.3 / zod@4.2.1.",
      "confidence": 0.65
    },
    {
      "id": "eval-corpus-duplicates-seed-files",
      "title": "Three corpus files duplicate identical content",
      "severity": "LOW",
      "category": "maintainability",
      "file": "tests-smoke/eval-corpus.jsonl",
      "description": "`eval-corpus.jsonl` is the concatenation of `canonical.jsonl` + `seed-empty-and-motif.jsonl` (per EVAL_CONTRACT.md forge instructions). Committing all three creates drift risk: any manual edit to `eval-corpus.jsonl` will be overwritten by the next `cat`-based regeneration, and any edit to a seed without regenerating leaves the corpus stale.",
      "suggestion": "Either commit only the seeds and generate `eval-corpus.jsonl` in CI, or commit only `eval-corpus.jsonl` and treat seeds as historical/forge-only artifacts. Document which is authoritative.",
      "confidence": 0.8
    },
    {
      "id": "harness-forge-shared-min-score",
      "title": "min_score derivation is hardcoded in two places",
      "severity": "LOW",
      "category": "maintainability",
      "file": "tests-smoke/forge-canonical.ts",
      "description": "The forge writes `min_score = floor((actualScore - 0.1) * 10) / 10` and the harness asserts `top.score >= c.min_score`. The 0.1 buffer is the only flake-tolerance budget for canonical cases. If the substrate's scores drift slightly (qmd version bump, embedding model change), all 42 canonical cases could fail simultaneously with no clear remediation path other than re-forging.",
      "suggestion": "Document the 0.1 buffer as a tunable in EVAL_CONTRACT.md, and consider adding a `pnpm eval:reforge` script that updates min_scores when intentional substrate changes occur.",
      "confidence": 0.5
    },
    {
      "id": "praise-failure-layer-classification",
      "title": "Failure-layer hints add real diagnostic value",
      "severity": "PRAISE",
      "category": "design",
      "file": "tests-smoke/eval-harness.ts",
      "description": "The `failureLayerHint()` taxonomy (INFRASTRUCTURE / SUBSTRATE-FP / SUBSTRATE-FN / INTERFACE-or-SUBSTRATE / THRESHOLD) gives operators an actionable starting point for triage rather than just 'test failed'. The contract document also explicitly maps each hint to a remediation bucket (case-wrong / substrate-thin / inherent), which is the kind of structure that prevents debug spirals."
    },
    {
      "id": "praise-keeper-source-provenance",
      "title": "keeper_source field captures provenance per case",
      "severity": "PRAISE",
      "category": "design",
      "file": "tests-smoke/eval-corpus.jsonl",
      "description": "Each corpus case carries a `keeper_source` describing why it exists, including notes like 'replaces \"car\" which substrate matched Satanist @0.88 (vec/HyDE quirk)'. This makes the corpus self-documenting and prevents the common eval-rot pattern where nobody remembers why a test exists or was changed."
    },
    {
      "id": "gate-history-baseline-low",
      "title": "EVAL_GATE default 0.85 vs current 100% gives wide regression band",
      "severity": "SPECULATION",
      "category": "policy",
      "file": "tests-smoke/EVAL_CONTRACT.md",
      "description": "Gate default is 0.85 but the current corpus passes at 100%. A regression to 86% (8 of 58 cases breaking silently) would still pass CI. Consider a tighter gate or a separate 'regression band' (alert if drop > 5pp from previous run).",
      "suggestion": "Set EVAL_GATE to a value closer to current pass rate (e.g. 0.95) or implement a delta-aware gate that compares against the last recorded pass rate in the gate-history table.",
      "confidence": 0.5
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Operator critique 2026-05-02: "qa test looks very basic." Verification scenarios (S0-S7) gate correctness but don't show what's WORTH showing. This commit adds: 1. Capability-landscape preamble (before scenarios): articulates what's NEW post-09a — 9th tool search_codex live, substrate measured 58/58, 3 distribution paths, doctrine promotion, runtime Docker fixes, substrate proven RICHER than operator hunches, KEEPER real-user loop replacing operator-paired hunches. Names what UNLOCKS for users not just devs. 2. Three showcase scenarios (SC1-SC3) — narrative walkthroughs demonstrating the felt outcomes the verification scenarios merely verify: - SC1 — substrate-discovery moment (Gumi-style 4-step navigation through transformation/cypherpunk/relate/underworld; captures the SUBSTRATE-RICHER-THAN-OPERATOR moment as REPRODUCIBLE in prod) - SC2 — cross-character mirror (same intent through ruggy + satoshi; proves substrate decoupled from voice) - SC3 — anti-hallucination cadence (ask for grails that don't exist; proves the no-fine-tuning doctrine claim holds at user-facing layer) WITNESS doctrine ≤7 surfaces preserved for verification (S0-S7); showcase scenarios are an explicit second register for capability- demo audiences (operator showing investor, Gumi showing community member, KEEPER capturing real-user friction). Per WITNESS Hamilton discipline: trust the deployment, but ALSO show why the deployment matters. Triage paths in SC1-SC3 named in felt-outcomes (substrate not called, voice loses, anti-hallucination broken) not just pass/fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 2, 2026 23:19 View deployment

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

zkSoju mentioned this pull request May 2, 2026

fix(docker): node:20-alpine → node:20-slim for node-llama-cpp postinstall #69

Merged

4 tasks

vercel Bot deployed to Preview May 2, 2026 23:32 View deployment

zkSoju commented May 2, 2026

View reviewed changes

zkSoju mentioned this pull request May 2, 2026

fix(docker): bundle qmd binary + prebuilt codex index for search_codex runtime #70

Merged

4 tasks

vercel Bot deployed to Preview May 2, 2026 23:54 View deployment

This was referenced May 2, 2026

WITNESS QA: codex MCP integration · live trial in dev guild (gated on construct-mibera-codex#69) 0xHoneyJar/freeside-characters#20

Open

fix(docker): copy qmd registry (~/.config/qmd) — search_codex still failed post-#70 #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): MICODEX intent-layer eval corpus — session 09a · 58/58#68

feat(eval): MICODEX intent-layer eval corpus — session 09a · 58/58#68
zkSoju wants to merge 3 commits into
mainfrom
feat/eval-corpus-09a

zkSoju commented May 2, 2026

Uh oh!

vercel Bot commented May 2, 2026 •

edited

Loading

Uh oh!

socket-security Bot commented May 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Uh oh!

zkSoju left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zkSoju commented May 2, 2026

Summary

What ships

Substrate-truth findings (iteration round, per spec §4.6 Bucket A)

Test plan

V1.5 deferred

Uh oh!

vercel Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

zkSoju left a comment

Choose a reason for hiding this comment

Summary

Findings

Callouts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 2, 2026 •

edited

Loading

socket-security Bot commented May 2, 2026 •

edited

Loading