diff --git a/CHANGELOG.md b/CHANGELOG.md index 0769d7d8ea..1bce3443b7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,16 @@ # Changelog +## [0.12.5.0] - 2026-03-26 — Fix Codex Hangs: 30-Minute Waits Are Gone + +Three bugs in `/codex` caused 30+ minute hangs with zero output during plan reviews and adversarial checks. All three are fixed. + +### Fixed + +- **Plan files now visible to Codex sandbox.** Codex runs sandboxed to the repo root and couldn't see plan files at `~/.claude/plans/`. It would waste 10+ tool calls searching before giving up. Now the plan content is embedded directly in the prompt, and referenced source files are listed so Codex reads them immediately. +- **Streaming output actually streams.** Python's stdout buffering meant zero output visible until the process exited. Added `PYTHONUNBUFFERED=1`, `python3 -u`, and `flush=True` on every print call across all three Codex modes. +- **Sane reasoning effort defaults.** Replaced hardcoded `xhigh` (23x more tokens, known 50+ min hangs per OpenAI issues #8545, #8402, #6931) with per-mode defaults: `high` for review and challenge, `medium` for consult. Users can override with `--xhigh` flag when they want maximum reasoning. +- **`--xhigh` override works in all modes.** The override reminder was missing from challenge and consult mode instructions. Found by adversarial review. + ## [0.12.4.0] - 2026-03-26 — Full Commit Coverage in /ship When you ship a branch with 12 commits spanning performance work, dead code removal, and test infra, the PR should mention all three. It wasn't. The CHANGELOG and PR summary biased toward whatever happened most recently, silently dropping earlier work. diff --git a/VERSION b/VERSION index 01bd748cde..cce9c8ee1e 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.12.4.0 +0.12.5.0 diff --git a/codex/SKILL.md b/codex/SKILL.md index 0b0e587ba3..2cabff5c6d 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -407,6 +407,14 @@ Parse the user's input to determine which mode to run: - Otherwise, ask: "What would you like to ask Codex?" 4. `/codex ` — **Consult mode** (Step 2C), where the remaining text is the prompt +**Reasoning effort override:** If the user's input contains `--xhigh` anywhere, +note it and remove it from the prompt text before passing to Codex. When `--xhigh` +is present, use `model_reasoning_effort="xhigh"` for all modes regardless of the +per-mode default below. Otherwise, use the per-mode defaults: +- Review (2A): `high` — bounded diff input, needs thoroughness +- Challenge (2B): `high` — adversarial but bounded by diff +- Consult (2C): `medium` — large context, interactive, needs speed + --- ## Step 2A: Review Mode @@ -420,13 +428,15 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` +If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. + Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review "focus on security" --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -563,8 +573,11 @@ With focus (e.g., "security"): "Review the changes on this branch against the base branch. Run `git diff origin/` to see the diff. Focus specifically on SECURITY. Your job is to find every way an attacker could exploit this code. Think about injection vectors, auth bypasses, privilege escalation, data exposure, and timing attacks. Be adversarial." 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout): + +If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. + ```bash -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c " +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>/dev/null | PYTHONUNBUFFERED=1 python3 -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -577,17 +590,17 @@ for line in sys.stdin: itype = item.get('type','') text = item.get('text','') if itype == 'reasoning' and text: - print(f'[codex thinking] {text}') - print() + print(f'[codex thinking] {text}', flush=True) + print(flush=True) elif itype == 'agent_message' and text: - print(text) + print(text, flush=True) elif itype == 'command_execution': cmd = item.get('command','') - if cmd: print(f'[codex ran] {cmd}') + if cmd: print(f'[codex ran] {cmd}', flush=True) elif t == 'turn.completed': usage = obj.get('usage',{}) tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0) - if tokens: print(f'\ntokens used: {tokens}') + if tokens: print(f'\ntokens used: {tokens}', flush=True) except: pass " ``` @@ -636,20 +649,34 @@ ls -t ~/.claude/plans/*.md 2>/dev/null | xargs grep -l "$(basename $(pwd))" 2>/d ``` If no project-scoped match, fall back to `ls -t ~/.claude/plans/*.md 2>/dev/null | head -1` but warn: "Note: this plan may be from a different project — verify before sending to Codex." -Read the plan file and prepend the persona to the user's prompt: + +**IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo +root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST +read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell +Codex the file path or ask it to read the plan file — it will waste 10+ tool calls +searching and fail. + +Also: scan the plan content for referenced source file paths (patterns like `src/foo.ts`, +`lib/bar.py`, paths containing `/` that exist in the repo). If found, list them in the +prompt so Codex reads them directly instead of discovering them via rg/find. + +Prepend the persona to the user's prompt: "You are a brutally honest technical reviewer. Review this plan for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems. +Also review these source files referenced in the plan: . THE PLAN: -" +" 4. Run codex exec with **JSONL output** to capture reasoning traces (5-minute timeout): +If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`. + For a **new session:** ```bash -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -659,31 +686,31 @@ for line in sys.stdin: t = obj.get('type','') if t == 'thread.started': tid = obj.get('thread_id','') - if tid: print(f'SESSION_ID:{tid}') + if tid: print(f'SESSION_ID:{tid}', flush=True) elif t == 'item.completed' and 'item' in obj: item = obj['item'] itype = item.get('type','') text = item.get('text','') if itype == 'reasoning' and text: - print(f'[codex thinking] {text}') - print() + print(f'[codex thinking] {text}', flush=True) + print(flush=True) elif itype == 'agent_message' and text: - print(text) + print(text, flush=True) elif itype == 'command_execution': cmd = item.get('command','') - if cmd: print(f'[codex ran] {cmd}') + if cmd: print(f'[codex ran] {cmd}', flush=True) elif t == 'turn.completed': usage = obj.get('usage',{}) tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0) - if tokens: print(f'\ntokens used: {tokens}') + if tokens: print(f'\ntokens used: {tokens}', flush=True) except: pass " ``` For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " - +codex exec resume "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " + " ``` @@ -718,7 +745,14 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. +**Reasoning effort (per-mode defaults):** +- **Review (2A):** `high` — bounded diff input, needs thoroughness but not max tokens +- **Challenge (2B):** `high` — adversarial but bounded by diff size +- **Consult (2C):** `medium` — large context (plans, codebase), interactive, needs speed + +`xhigh` uses ~23x more tokens than `high` and causes 50+ minute hangs on large context +tasks (OpenAI issues #8545, #8402, #6931). Users can override with `--xhigh` flag +(e.g., `/codex review --xhigh`) when they want maximum reasoning and are willing to wait. **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. diff --git a/codex/SKILL.md.tmpl b/codex/SKILL.md.tmpl index 77021c8237..4a8fbbe846 100644 --- a/codex/SKILL.md.tmpl +++ b/codex/SKILL.md.tmpl @@ -67,6 +67,14 @@ Parse the user's input to determine which mode to run: - Otherwise, ask: "What would you like to ask Codex?" 4. `/codex ` — **Consult mode** (Step 2C), where the remaining text is the prompt +**Reasoning effort override:** If the user's input contains `--xhigh` anywhere, +note it and remove it from the prompt text before passing to Codex. When `--xhigh` +is present, use `model_reasoning_effort="xhigh"` for all modes regardless of the +per-mode default below. Otherwise, use the per-mode defaults: +- Review (2A): `high` — bounded diff input, needs thoroughness +- Challenge (2B): `high` — adversarial but bounded by diff +- Consult (2C): `medium` — large context, interactive, needs speed + --- ## Step 2A: Review Mode @@ -80,13 +88,15 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt) 2. Run the review (5-minute timeout): ```bash -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` +If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. + Use `timeout: 300000` on the Bash call. If the user provided custom instructions (e.g., `/codex review focus on security`), pass them as the prompt argument: ```bash -codex review "focus on security" --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review "focus on security" --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` 3. Capture the output. Then parse cost from stderr: @@ -158,8 +168,11 @@ With focus (e.g., "security"): "Review the changes on this branch against the base branch. Run `git diff origin/` to see the diff. Focus specifically on SECURITY. Your job is to find every way an attacker could exploit this code. Think about injection vectors, auth bypasses, privilege escalation, data exposure, and timing attacks. Be adversarial." 2. Run codex exec with **JSONL output** to capture reasoning traces and tool calls (5-minute timeout): + +If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. + ```bash -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>/dev/null | python3 -c " +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>/dev/null | PYTHONUNBUFFERED=1 python3 -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -172,17 +185,17 @@ for line in sys.stdin: itype = item.get('type','') text = item.get('text','') if itype == 'reasoning' and text: - print(f'[codex thinking] {text}') - print() + print(f'[codex thinking] {text}', flush=True) + print(flush=True) elif itype == 'agent_message' and text: - print(text) + print(text, flush=True) elif itype == 'command_execution': cmd = item.get('command','') - if cmd: print(f'[codex ran] {cmd}') + if cmd: print(f'[codex ran] {cmd}', flush=True) elif t == 'turn.completed': usage = obj.get('usage',{}) tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0) - if tokens: print(f'\ntokens used: {tokens}') + if tokens: print(f'\ntokens used: {tokens}', flush=True) except: pass " ``` @@ -231,20 +244,34 @@ ls -t ~/.claude/plans/*.md 2>/dev/null | xargs grep -l "$(basename $(pwd))" 2>/d ``` If no project-scoped match, fall back to `ls -t ~/.claude/plans/*.md 2>/dev/null | head -1` but warn: "Note: this plan may be from a different project — verify before sending to Codex." -Read the plan file and prepend the persona to the user's prompt: + +**IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo +root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST +read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell +Codex the file path or ask it to read the plan file — it will waste 10+ tool calls +searching and fail. + +Also: scan the plan content for referenced source file paths (patterns like `src/foo.ts`, +`lib/bar.py`, paths containing `/` that exist in the repo). If found, list them in the +prompt so Codex reads them directly instead of discovering them via rg/find. + +Prepend the persona to the user's prompt: "You are a brutally honest technical reviewer. Review this plan for: logical gaps and unstated assumptions, missing error handling or edge cases, overcomplexity (is there a simpler approach?), feasibility risks (what could go wrong?), and missing dependencies or sequencing issues. Be direct. Be terse. No compliments. Just the problems. +Also review these source files referenced in the plan: . THE PLAN: -" +" 4. Run codex exec with **JSONL output** to capture reasoning traces (5-minute timeout): +If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`. + For a **new session:** ```bash -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -254,31 +281,31 @@ for line in sys.stdin: t = obj.get('type','') if t == 'thread.started': tid = obj.get('thread_id','') - if tid: print(f'SESSION_ID:{tid}') + if tid: print(f'SESSION_ID:{tid}', flush=True) elif t == 'item.completed' and 'item' in obj: item = obj['item'] itype = item.get('type','') text = item.get('text','') if itype == 'reasoning' and text: - print(f'[codex thinking] {text}') - print() + print(f'[codex thinking] {text}', flush=True) + print(flush=True) elif itype == 'agent_message' and text: - print(text) + print(text, flush=True) elif itype == 'command_execution': cmd = item.get('command','') - if cmd: print(f'[codex ran] {cmd}') + if cmd: print(f'[codex ran] {cmd}', flush=True) elif t == 'turn.completed': usage = obj.get('usage',{}) tokens = usage.get('input_tokens',0) + usage.get('output_tokens',0) - if tokens: print(f'\ntokens used: {tokens}') + if tokens: print(f'\ntokens used: {tokens}', flush=True) except: pass " ``` For a **resumed session** (user chose "Continue"): ```bash -codex exec resume "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c " - +codex exec resume "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " + " ``` @@ -313,7 +340,14 @@ Session saved — run /codex again to continue this conversation. agentic coding model). This means as OpenAI ships newer models, /codex automatically uses them. If the user wants a specific model, pass `-m` through to codex. -**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible. +**Reasoning effort (per-mode defaults):** +- **Review (2A):** `high` — bounded diff input, needs thoroughness but not max tokens +- **Challenge (2B):** `high` — adversarial but bounded by diff size +- **Consult (2C):** `medium` — large context (plans, codebase), interactive, needs speed + +`xhigh` uses ~23x more tokens than `high` and causes 50+ minute hangs on large context +tasks (OpenAI issues #8545, #8402, #6931). Users can override with `--xhigh` flag +(e.g., `/codex review --xhigh`) when they want maximum reasoning and are willing to wait. **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up docs and APIs during review. This is OpenAI's cached index — fast, no extra cost. diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index 6e1a5927b9..f66092367f 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -714,7 +714,7 @@ Write the full prompt (context block + instructions) to this file. Use the mode- ```bash TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX) -codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_OH" +codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_OH" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: diff --git a/package.json b/package.json index c06c150b7f..1964b7132a 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "0.12.3.0", + "version": "0.12.5.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index 4449c987a1..9ca6f1b180 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -1091,7 +1091,7 @@ THE PLAN: ```bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_PV" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 8d2bd80087..93a3a8f1e3 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -749,7 +749,7 @@ THE PLAN: ```bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_PV" ``` Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr: diff --git a/review/SKILL.md b/review/SKILL.md index 591fbeb4f7..2e0951016c 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -979,7 +979,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_ADV" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -1024,7 +1024,7 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. Present output under `CODEX SAYS (code review):` header. diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index 172c0b6d0e..750a43969b 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -2196,7 +2196,7 @@ Write the full prompt (context block + instructions) to this file. Use the mode- \`\`\`bash TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX) -codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_OH" +codex exec "$(cat "$CODEX_PROMPT_FILE")" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_OH" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: @@ -2280,7 +2280,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal \`\`\`bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_ADV" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -2325,7 +2325,7 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** \`\`\`bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. Present output under \`CODEX SAYS (code review):\` header. @@ -2435,7 +2435,7 @@ THE PLAN: \`\`\`bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_PV" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts index 423002aa13..9a9954c7b4 100644 --- a/scripts/resolvers/review.ts +++ b/scripts/resolvers/review.ts @@ -292,7 +292,7 @@ Write the full prompt (context block + instructions) to this file. Use the mode- \`\`\`bash TMPERR_OH=$(mktemp /tmp/codex-oh-err-XXXXXXXX) -codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_OH" +codex exec "$(cat "$CODEX_PROMPT_FILE")" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_OH" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: @@ -376,7 +376,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal \`\`\`bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_ADV" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -421,7 +421,7 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** \`\`\`bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" \`\`\` Set the Bash tool's \`timeout\` parameter to \`300000\` (5 minutes). Do NOT use the \`timeout\` shell command — it doesn't exist on macOS. Present output under \`CODEX SAYS (code review):\` header. @@ -531,7 +531,7 @@ THE PLAN: \`\`\`bash TMPERR_PV=$(mktemp /tmp/codex-planreview-XXXXXXXX) -codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_PV" +codex exec "" -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_PV" \`\`\` Use a 5-minute timeout (\`timeout: 300000\`). After the command completes, read stderr: diff --git a/ship/SKILL.md b/ship/SKILL.md index 6d8f3b6afd..5ea3026422 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -1469,7 +1469,7 @@ Claude's structured review already ran. Now add a **cross-model adversarial chal ```bash TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX) -codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV" +codex exec "Review the changes on this branch against the base branch. Run git diff origin/ to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -C "$(git rev-parse --show-toplevel)" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_ADV" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. After the command completes, read stderr: @@ -1514,7 +1514,7 @@ Claude's structured review already ran. Now run **all three remaining passes** f **1. Codex structured review (if available):** ```bash TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX) -codex review --base -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR" +codex review --base -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR" ``` Set the Bash tool's `timeout` parameter to `300000` (5 minutes). Do NOT use the `timeout` shell command — it doesn't exist on macOS. Present output under `CODEX SAYS (code review):` header. diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index 655a454b42..7bb163d84e 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -1325,7 +1325,7 @@ describe('Codex skill', () => { expect(content).toContain('fall back to the Claude adversarial subagent'); // Review log uses new skill name expect(content).toContain('adversarial-review'); - expect(content).toContain('xhigh'); + expect(content).toContain('reasoning_effort="high"'); expect(content).toContain('ADVERSARIAL REVIEW SYNTHESIS'); }); @@ -1335,7 +1335,7 @@ describe('Codex skill', () => { expect(content).toContain('< 50'); expect(content).toContain('200+'); expect(content).toContain('adversarial-review'); - expect(content).toContain('xhigh'); + expect(content).toContain('reasoning_effort="high"'); expect(content).toContain('Investigate and fix'); });