When the test command has no mode flags, all four modes become the same command

## What happens

GEAK turns the user's test command into a COMMANDMENT with four modes: `--correctness`, `--benchmark`, `--full-benchmark`, and `--profile`.

But if the user's command has none of GEAK's mode flags - for example a plain:

```
timeout 600 python op_tests/test_rmsnorm2dFusedAddQuant.py
```

then GEAK puts the **same unchanged command** into all four modes. Every mode just re-runs the exact same thing. In a real run, the generated COMMANDMENT.md had four identical `timeout 600 python ...` bodies.

## Why it happens

- `src/minisweagent/run/preprocess_v3/tools.py:111` - `_substitute_mode_flag`: when the command has no known flag, it returns the command unchanged (line 127). So nothing distinguishes the four modes.
- `_make_tool_commandment_from_user_command` then emits that same command for every mode.
- The harness-synthesis path (`tools.py:184`, `_try_synthesize_shell_contract_harness`) only triggers for compound `cmd_a && cmd_b` commands (`if "&&" not in cmd: return None`), so a single plain command never gets a proper per-mode harness.

## Real-world impact

Two problems showed up in a kernel run because of this:

1. **Wrong operation benchmarked.** The harness defaults to `--mode 1` (plain `rmsnorm`), but one kernel was `add_rmsnorm`, which needs `--mode 2`. Since no mode was ever chosen, that kernel was measured on the wrong op.
2. **Timeouts.** With no shape/size arguments chosen per mode, every mode ran the harness's full default sweep and hit the 600s limit before optimization could begin.

## Expected vs actual

- **Expected:** the four modes are genuinely different - correctness runs a small quick check, benchmark/full-benchmark use the right op and sizes, profile wraps the benchmark. If a mode truly cannot be built, say so clearly.
- **Actual:** all four modes are identical copies of the user's command, and a missing flag silently produces duplicates instead of a warning.

## Suggested direction

Let the preprocessor's orchestrator LLM author the actual command for each of the four modes (choosing the right op / `--mode`, small shapes that finish under the timeout, and iteration counts that honor `GEAK_BENCHMARK_ITERATIONS`), instead of mechanically copying one command into all four. When a mode cannot be authored, emit a visible warning rather than a silent duplicate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When the test command has no mode flags, all four modes become the same command #258

What happens

Why it happens

Real-world impact

Expected vs actual

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

When the test command has no mode flags, all four modes become the same command #258

Description

What happens

Why it happens

Real-world impact

Expected vs actual

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions