Skip to content

When the test command has no mode flags, all four modes become the same command #258

Description

@sdubagun-amd

What happens

GEAK turns the user's test command into a COMMANDMENT with four modes: --correctness, --benchmark, --full-benchmark, and --profile.

But if the user's command has none of GEAK's mode flags - for example a plain:

timeout 600 python op_tests/test_rmsnorm2dFusedAddQuant.py

then GEAK puts the same unchanged command into all four modes. Every mode just re-runs the exact same thing. In a real run, the generated COMMANDMENT.md had four identical timeout 600 python ... bodies.

Why it happens

  • src/minisweagent/run/preprocess_v3/tools.py:111 - _substitute_mode_flag: when the command has no known flag, it returns the command unchanged (line 127). So nothing distinguishes the four modes.
  • _make_tool_commandment_from_user_command then emits that same command for every mode.
  • The harness-synthesis path (tools.py:184, _try_synthesize_shell_contract_harness) only triggers for compound cmd_a && cmd_b commands (if "&&" not in cmd: return None), so a single plain command never gets a proper per-mode harness.

Real-world impact

Two problems showed up in a kernel run because of this:

  1. Wrong operation benchmarked. The harness defaults to --mode 1 (plain rmsnorm), but one kernel was add_rmsnorm, which needs --mode 2. Since no mode was ever chosen, that kernel was measured on the wrong op.
  2. Timeouts. With no shape/size arguments chosen per mode, every mode ran the harness's full default sweep and hit the 600s limit before optimization could begin.

Expected vs actual

  • Expected: the four modes are genuinely different - correctness runs a small quick check, benchmark/full-benchmark use the right op and sizes, profile wraps the benchmark. If a mode truly cannot be built, say so clearly.
  • Actual: all four modes are identical copies of the user's command, and a missing flag silently produces duplicates instead of a warning.

Suggested direction

Let the preprocessor's orchestrator LLM author the actual command for each of the four modes (choosing the right op / --mode, small shapes that finish under the timeout, and iteration counts that honor GEAK_BENCHMARK_ITERATIONS), instead of mechanically copying one command into all four. When a mode cannot be authored, emit a visible warning rather than a silent duplicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions