Codify recurring Codex transcript and bugfix findings

## Summary

Codex transcript review and commit-history analysis surfaced several recurring, mechanically detectable failure patterns that should be codified through existing repo checks. The goal is to capture only clear wins: low-noise checks that would have prevented repeated review findings, bugfix follow-ups, or command/workflow mistakes.

## Methodology

I scanned local Codex transcript data from `/Users/dcramer/.codex` and then cross-checked the themes against repository commit history.

Transcript coverage:

- 1,239 Codex transcript JSONL files total.
- 469 files with exact cwd `/Users/dcramer/src/junior`.
- 581 files with Junior-like cwd, including Codex worktrees.
- 797 sessions that materially reference Junior.
- Final reference-inclusive pass found 7,870 Junior-relevant messages and 3,048 finding/review-like messages.

Signals used:

- Structured JSONL parsing of session metadata, user messages, assistant messages, tool commands, and command failures.
- Filtering for finding/review-like text: severity labels, `impact:`, `fix:`, `violates`, review task prompts, and subagent notifications.
- Command-pattern scans for known repo workflow mistakes such as evals through raw Vitest or `--` forwarding.
- Commit-history scan over 1,645 visible commits, including 657 fix-like commits.

Strong commit-history clusters:

- Slack/message/routing: 147 fix-like commits.
- Auth/credentials: 85.
- Sandbox/egress: 83.
- Plugin/runtime boundaries: 72.
- Observability: 66.
- Deploy/build/package tracing: 64.
- SQL/conversation metadata: 55.
- Queue/task execution: 45.

Existing check surfaces to reuse:

- `pnpm lint`, already runs `oxlint`, `ast-grep scan`, and `package:lint`.
- `sgconfig.yml` with rules in `ast-grep/rules`.
- Existing custom script precedent: `scripts/check-release-config.mjs`.
- `pnpm typecheck`, focused package scripts, and targeted component/integration tests.

## Tasks

- [ ] Add an eval command/documentation check.
  - Detect and fail on eval guidance/examples that use raw `pnpm exec vitest`, `pnpm exec vitest ... *.eval.ts`, `pnpm --filter @sentry/junior-evals evals -- ...`, or `evals ... -- -t`.
  - Scope to `AGENTS.md`, `packages/junior-evals/README.md`, docs, specs, and eval package files.
  - Wire into `pnpm lint`.
  - Evidence: repeated transcript command drift plus `fix: Route Sentry telemetry and simplify evals`, `fix(evals): Use runtime adapter overrides`, and `fix(junior-evals): Align harness with eval types`.

- [ ] Add an eval harness/schema drift check.
  - Detect old repo-local eval result surfaces such as `result.output`, `assistant_posts`, `channel_posts`, `observed_tool_invocations`, local transcript/event-log schemas, and `contract`/`allow` rubric remnants.
  - Prefer an ast-grep or focused text/AST check over broad grep where practical.
  - Wire into `pnpm lint`.
  - Evidence: eval cleanup transcripts and GH-615 cleanup work around native `vitest-evals` primitives.

- [ ] Add a changed-spec metadata check.
  - For changed `specs/**/*.md`, require `Last Edited` and a changelog entry to be updated.
  - Use a small custom script with a base-ref fallback, following the style of `scripts/check-release-config.mjs`.
  - Wire into `pnpm lint` or `pnpm docs:check`.
  - Evidence: repeated review findings for stale `Last Edited` metadata after spec changes.

- [ ] Add ast-grep rules for forbidden runtime singleton/test mutation patterns.
  - Ban new exported `set*ForTests` and `reset*ForTests` APIs in runtime code.
  - Ban new process-wide mutable runtime behavior globals such as top-level `default*Store`/singleton caches outside explicit allowlisted modules.
  - Scope first to `packages/junior/src`.
  - Evidence: transcript findings around mutable process-wide stores and repo rules banning test-only singleton mutation APIs.

- [ ] Add a production composition-root import boundary check.
  - Ban imports of `@/chat/app/production` outside explicit app/composition-root allowlists.
  - Use `oxlint no-restricted-imports` or ast-grep, whichever produces the clearer exception model.
  - Evidence: repeated runtime-boundary findings where worker/runtime paths bypass injected app-scoped services.

- [ ] Add a plugin/runtime boundary schema ownership check.
  - Start with `packages/junior-plugin-api/src/**/*`.
  - Require exported non-trivial plugin/runtime boundary contracts to be schema-owned or explicitly allowlisted.
  - Prefer checking for adjacent exported Zod schemas and inferred public types where practical.
  - Evidence: 221 runtime-boundary/schema finding-like transcript messages across 108 sessions, plus prompt hook and malformed-response bugfixes.

- [ ] Add a public API barrel/export check.
  - Ban `export *` from package roots except explicit allowlists.
  - Start with `packages/*/src/index.ts`.
  - Keep existing intentional exports only after review.
  - Evidence: review findings around `export * from "./prompt"` publishing dormant plugin APIs and `export * from "./json"` adding unused public surface.

- [ ] Add an env/config docs alignment check.
  - Compare known config/env schema names with `packages/docs/src/content/docs/reference/config-and-env.md` and deploy docs.
  - Start with env vars owned by `packages/junior/src/chat/config.ts` and SQL/plugin database config.
  - Evidence: commit history repeatedly required config docs updates alongside env/config changes, including SQL driver/env fixes.

- [ ] Add narrow auth/egress structured-signal checks.
  - Ban heuristic parsing of sandbox stderr/output for auth outcomes in auth/egress modules when typed auth signals are available.
  - Do not ban all fallback logic; keep this scoped to auth/egress signal handling.
  - Evidence: `fix(auth): consume egress auth signals unconditionally, remove heuristic detection`, `fix(sandbox): always consume egress auth signals regardless of exit code`, and `fix(auth): remove provider fallback`.

- [ ] Add SQL/conversation boundary checks and invariant tests.
  - Static checks: ban generic raw row claims like `query<T>()` and raw SQL execution outside provider/migration modules.
  - Test invariants: backfill is non-destructive, SQL execution state is monotonic, transient work/lease fields do not leak into conversation metadata records.
  - Evidence: dense SQL/conversation fix chain around projection isolation, monotonic execution, leases, migrations, and backfill.

- [ ] Keep Slack routing/message behavior primarily in integration/eval coverage, not static lint.
  - Static lint should only cover hard boundary mistakes, such as forbidden imports or old Slack test helpers.
  - Evidence: Slack has the largest fix cluster, but most issues are behavior contracts rather than syntax-level mistakes.

- [ ] Keep broad "no fallback" as review guidance, not a generic linter.
  - Codify only narrow typed-boundary variants, such as auth/egress structured signals.
  - Evidence: transcript count is high, but broad fallback detection would be noisy and intent-dependent.

## Recommended First Slice

1. Eval command/documentation check.
2. Eval harness/schema drift check.
3. Changed-spec metadata check.
4. Runtime singleton/test mutation ast-grep rules.
5. Production composition-root import boundary check.

These are the clearest low-noise wins and fit the existing `pnpm lint`/`ast-grep` setup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codify recurring Codex transcript and bugfix findings #622

Summary

Methodology

Tasks

Recommended First Slice

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Codify recurring Codex transcript and bugfix findings #622

Description

Summary

Methodology

Tasks

Recommended First Slice

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions