Skip to content

Codify recurring Codex transcript and bugfix findings #622

Description

@dcramer

Summary

Codex transcript review and commit-history analysis surfaced several recurring, mechanically detectable failure patterns that should be codified through existing repo checks. The goal is to capture only clear wins: low-noise checks that would have prevented repeated review findings, bugfix follow-ups, or command/workflow mistakes.

Methodology

I scanned local Codex transcript data from /Users/dcramer/.codex and then cross-checked the themes against repository commit history.

Transcript coverage:

  • 1,239 Codex transcript JSONL files total.
  • 469 files with exact cwd /Users/dcramer/src/junior.
  • 581 files with Junior-like cwd, including Codex worktrees.
  • 797 sessions that materially reference Junior.
  • Final reference-inclusive pass found 7,870 Junior-relevant messages and 3,048 finding/review-like messages.

Signals used:

  • Structured JSONL parsing of session metadata, user messages, assistant messages, tool commands, and command failures.
  • Filtering for finding/review-like text: severity labels, impact:, fix:, violates, review task prompts, and subagent notifications.
  • Command-pattern scans for known repo workflow mistakes such as evals through raw Vitest or -- forwarding.
  • Commit-history scan over 1,645 visible commits, including 657 fix-like commits.

Strong commit-history clusters:

  • Slack/message/routing: 147 fix-like commits.
  • Auth/credentials: 85.
  • Sandbox/egress: 83.
  • Plugin/runtime boundaries: 72.
  • Observability: 66.
  • Deploy/build/package tracing: 64.
  • SQL/conversation metadata: 55.
  • Queue/task execution: 45.

Existing check surfaces to reuse:

  • pnpm lint, already runs oxlint, ast-grep scan, and package:lint.
  • sgconfig.yml with rules in ast-grep/rules.
  • Existing custom script precedent: scripts/check-release-config.mjs.
  • pnpm typecheck, focused package scripts, and targeted component/integration tests.

Tasks

  • Add an eval command/documentation check.

    • Detect and fail on eval guidance/examples that use raw pnpm exec vitest, pnpm exec vitest ... *.eval.ts, pnpm --filter @sentry/junior-evals evals -- ..., or evals ... -- -t.
    • Scope to AGENTS.md, packages/junior-evals/README.md, docs, specs, and eval package files.
    • Wire into pnpm lint.
    • Evidence: repeated transcript command drift plus fix: Route Sentry telemetry and simplify evals, fix(evals): Use runtime adapter overrides, and fix(junior-evals): Align harness with eval types.
  • Add an eval harness/schema drift check.

    • Detect old repo-local eval result surfaces such as result.output, assistant_posts, channel_posts, observed_tool_invocations, local transcript/event-log schemas, and contract/allow rubric remnants.
    • Prefer an ast-grep or focused text/AST check over broad grep where practical.
    • Wire into pnpm lint.
    • Evidence: eval cleanup transcripts and Hex auth gate blocks redirect; wrong data source for product analytics #615 cleanup work around native vitest-evals primitives.
  • Add a changed-spec metadata check.

    • For changed specs/**/*.md, require Last Edited and a changelog entry to be updated.
    • Use a small custom script with a base-ref fallback, following the style of scripts/check-release-config.mjs.
    • Wire into pnpm lint or pnpm docs:check.
    • Evidence: repeated review findings for stale Last Edited metadata after spec changes.
  • Add ast-grep rules for forbidden runtime singleton/test mutation patterns.

    • Ban new exported set*ForTests and reset*ForTests APIs in runtime code.
    • Ban new process-wide mutable runtime behavior globals such as top-level default*Store/singleton caches outside explicit allowlisted modules.
    • Scope first to packages/junior/src.
    • Evidence: transcript findings around mutable process-wide stores and repo rules banning test-only singleton mutation APIs.
  • Add a production composition-root import boundary check.

    • Ban imports of @/chat/app/production outside explicit app/composition-root allowlists.
    • Use oxlint no-restricted-imports or ast-grep, whichever produces the clearer exception model.
    • Evidence: repeated runtime-boundary findings where worker/runtime paths bypass injected app-scoped services.
  • Add a plugin/runtime boundary schema ownership check.

    • Start with packages/junior-plugin-api/src/**/*.
    • Require exported non-trivial plugin/runtime boundary contracts to be schema-owned or explicitly allowlisted.
    • Prefer checking for adjacent exported Zod schemas and inferred public types where practical.
    • Evidence: 221 runtime-boundary/schema finding-like transcript messages across 108 sessions, plus prompt hook and malformed-response bugfixes.
  • Add a public API barrel/export check.

    • Ban export * from package roots except explicit allowlists.
    • Start with packages/*/src/index.ts.
    • Keep existing intentional exports only after review.
    • Evidence: review findings around export * from "./prompt" publishing dormant plugin APIs and export * from "./json" adding unused public surface.
  • Add an env/config docs alignment check.

    • Compare known config/env schema names with packages/docs/src/content/docs/reference/config-and-env.md and deploy docs.
    • Start with env vars owned by packages/junior/src/chat/config.ts and SQL/plugin database config.
    • Evidence: commit history repeatedly required config docs updates alongside env/config changes, including SQL driver/env fixes.
  • Add narrow auth/egress structured-signal checks.

    • Ban heuristic parsing of sandbox stderr/output for auth outcomes in auth/egress modules when typed auth signals are available.
    • Do not ban all fallback logic; keep this scoped to auth/egress signal handling.
    • Evidence: fix(auth): consume egress auth signals unconditionally, remove heuristic detection, fix(sandbox): always consume egress auth signals regardless of exit code, and fix(auth): remove provider fallback.
  • Add SQL/conversation boundary checks and invariant tests.

    • Static checks: ban generic raw row claims like query<T>() and raw SQL execution outside provider/migration modules.
    • Test invariants: backfill is non-destructive, SQL execution state is monotonic, transient work/lease fields do not leak into conversation metadata records.
    • Evidence: dense SQL/conversation fix chain around projection isolation, monotonic execution, leases, migrations, and backfill.
  • Keep Slack routing/message behavior primarily in integration/eval coverage, not static lint.

    • Static lint should only cover hard boundary mistakes, such as forbidden imports or old Slack test helpers.
    • Evidence: Slack has the largest fix cluster, but most issues are behavior contracts rather than syntax-level mistakes.
  • Keep broad "no fallback" as review guidance, not a generic linter.

    • Codify only narrow typed-boundary variants, such as auth/egress structured signals.
    • Evidence: transcript count is high, but broad fallback detection would be noisy and intent-dependent.

Recommended First Slice

  1. Eval command/documentation check.
  2. Eval harness/schema drift check.
  3. Changed-spec metadata check.
  4. Runtime singleton/test mutation ast-grep rules.
  5. Production composition-root import boundary check.

These are the clearest low-noise wins and fit the existing pnpm lint/ast-grep setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions