Context
Refs #2791 and uses #2851 as the reference/proof branch for the larger command-strategy direction.
The command refactor is being split into smaller, mergeable layers. Before moving more command ownership and routing code, we should add an acceptance-test layer that describes the expected user-visible behavior in product language and then backs that language with executable tests.
Why this is useful
-
Owner-written acceptance criteria
The owner or feature proponent can write the expected behavior directly in the issue using Gherkin-style Given / When / Then. Engineering can then map those scenarios to executable tests without translating the acceptance criteria into a completely different format.
-
Real integration visibility
Existing unit tests and narrower integration tests are still necessary, but they usually validate one layer at a time. The acceptance layer should show the complete flow across public borders:
- user prompt / composer input
- LLM provider request
- mocked LLM tool decision
- tool registry and dispatch
- real tool execution
- Task List / WorkBench update
- Statusline behavior when applicable
- BlueWhale activity indicator while work is running
- tool result returned to the LLM in the next request
- final assistant output rendered to the user
-
Refactor safety before architecture changes
The command-strategy refactor will move code across boundaries. Acceptance tests protect behavior before the structure changes, so smaller PRs can prove they preserved the observable workflow.
-
Executable documentation
A Gherkin scenario documents the feature in language the owner can review, while the step definitions prove the behavior through real public interfaces.
-
Better failure narratives
When a scenario fails, the failure should say which part of the user-visible flow broke: LLM request, tool selection, Task List state, Statusline/BlueWhale feedback, tool result, follow-up LLM request, or rendered answer.
Proposed layers
Layer 1: non-interactive public border
Use codewhale-tui exec --auto --output-format stream-json with a local mocked LLM endpoint.
This can assert:
- CodeWhale sends the user prompt to the mocked LLM.
- The mocked LLM requests a tool, for example
list_dir with {"path":"."}.
- CodeWhale emits a public
tool_use event.
- CodeWhale executes the real tool against an offline temp workspace.
- CodeWhale emits a public
tool_result event containing the expected entries.
- CodeWhale sends the tool result back to the mocked LLM in the next
/v1/chat/completions request.
- CodeWhale emits the final assistant answer.
Layer 2: interactive TUI screen border
Use the PTY/frame-capture harness to drive the real TUI like a user.
This should assert:
-
The user prompt appears in the composer before submit.
-
The Task List shows the running tool row:
Then the task list should show:
| status | marker | tool | input |
| running | [~] | list_dir | . |
-
The BlueWhale activity indicator moves while the tool or turn is running.
-
The Statusline shows the active turn state if this workflow affects it.
-
After completion, the Task List shows:
Then the task list should show:
| status | marker | tool | input |
| completed | ✓ | list_dir | . |
-
The BlueWhale activity indicator stops or changes to the completed state.
-
The rendered transcript/output includes the formatted assistant answer.
Layer 3: non-happy paths
Add scenarios for:
- LLM requests an unknown tool.
- Tool returns an error.
- Tool returns an empty result.
- LLM produces malformed tool arguments.
- The follow-up LLM answer omits the expected formatted summary.
Example Gherkin
Feature: Tool call lifecycle
Scenario: Happy path lists the current directory through a tool
Given an offline CodeWhale workspace containing:
| path | kind |
| README.md | file |
| notes.txt | file |
| src | folder |
And the mocked LLM will request the "list_dir" tool with:
| path |
| . |
And the mocked LLM will answer after the tool result:
| content |
| The directory contains README.md, notes.txt, and src/. |
When the user asks "list the current directory"
Then CodeWhale should send the user request to the mocked LLM
And the task list should show a running tool:
| status | marker | tool | input |
| running | [~] | list_dir | . |
And the BlueWhale activity indicator should move while the tool is running
And the Statusline should show the active turn state if this workflow updates it
And the tool result should return directory entries:
| entry | kind |
| README.md | file |
| notes.txt | file |
| src | folder |
And CodeWhale should send the tool result back to the mocked LLM
And the task list should show a completed tool:
| status | marker | tool | input |
| completed | ✓ | list_dir | . |
And the BlueWhale activity indicator should stop or show completed state
And the public output should include "The directory contains README.md, notes.txt, and src/."
Open questions
- What exact Statusline text or state should be expected for this workflow?
- What is the most stable public signal for BlueWhale movement: frame-to-frame text/position changes, animation phase, or a dedicated accessible/status marker?
- Should Windows ConPTY be enabled for this PTY test layer now, or should the first screen-level scenario stay Unix-only until the Windows input/rendering path is audited?
Paulo Aboim Pinto
Context
Refs #2791 and uses #2851 as the reference/proof branch for the larger command-strategy direction.
The command refactor is being split into smaller, mergeable layers. Before moving more command ownership and routing code, we should add an acceptance-test layer that describes the expected user-visible behavior in product language and then backs that language with executable tests.
Why this is useful
Owner-written acceptance criteria
The owner or feature proponent can write the expected behavior directly in the issue using Gherkin-style
Given / When / Then. Engineering can then map those scenarios to executable tests without translating the acceptance criteria into a completely different format.Real integration visibility
Existing unit tests and narrower integration tests are still necessary, but they usually validate one layer at a time. The acceptance layer should show the complete flow across public borders:
Refactor safety before architecture changes
The command-strategy refactor will move code across boundaries. Acceptance tests protect behavior before the structure changes, so smaller PRs can prove they preserved the observable workflow.
Executable documentation
A Gherkin scenario documents the feature in language the owner can review, while the step definitions prove the behavior through real public interfaces.
Better failure narratives
When a scenario fails, the failure should say which part of the user-visible flow broke: LLM request, tool selection, Task List state, Statusline/BlueWhale feedback, tool result, follow-up LLM request, or rendered answer.
Proposed layers
Layer 1: non-interactive public border
Use
codewhale-tui exec --auto --output-format stream-jsonwith a local mocked LLM endpoint.This can assert:
list_dirwith{"path":"."}.tool_useevent.tool_resultevent containing the expected entries./v1/chat/completionsrequest.Layer 2: interactive TUI screen border
Use the PTY/frame-capture harness to drive the real TUI like a user.
This should assert:
The user prompt appears in the composer before submit.
The Task List shows the running tool row:
The BlueWhale activity indicator moves while the tool or turn is running.
The Statusline shows the active turn state if this workflow affects it.
After completion, the Task List shows:
The BlueWhale activity indicator stops or changes to the completed state.
The rendered transcript/output includes the formatted assistant answer.
Layer 3: non-happy paths
Add scenarios for:
Example Gherkin
Open questions
Paulo Aboim Pinto