Enhancement: add Gherkin acceptance E2E coverage for tool lifecycle

﻿## Context

Refs #2791 and uses #2851 as the reference/proof branch for the larger command-strategy direction.

The command refactor is being split into smaller, mergeable layers. Before moving more command ownership and routing code, we should add an acceptance-test layer that describes the expected user-visible behavior in product language and then backs that language with executable tests.

## Why this is useful

1. **Owner-written acceptance criteria**

   The owner or feature proponent can write the expected behavior directly in the issue using Gherkin-style `Given / When / Then`. Engineering can then map those scenarios to executable tests without translating the acceptance criteria into a completely different format.

2. **Real integration visibility**

   Existing unit tests and narrower integration tests are still necessary, but they usually validate one layer at a time. The acceptance layer should show the complete flow across public borders:

   - user prompt / composer input
   - LLM provider request
   - mocked LLM tool decision
   - tool registry and dispatch
   - real tool execution
   - Task List / WorkBench update
   - Statusline behavior when applicable
   - BlueWhale activity indicator while work is running
   - tool result returned to the LLM in the next request
   - final assistant output rendered to the user

3. **Refactor safety before architecture changes**

   The command-strategy refactor will move code across boundaries. Acceptance tests protect behavior before the structure changes, so smaller PRs can prove they preserved the observable workflow.

4. **Executable documentation**

   A Gherkin scenario documents the feature in language the owner can review, while the step definitions prove the behavior through real public interfaces.

5. **Better failure narratives**

   When a scenario fails, the failure should say which part of the user-visible flow broke: LLM request, tool selection, Task List state, Statusline/BlueWhale feedback, tool result, follow-up LLM request, or rendered answer.

## Proposed layers

### Layer 1: non-interactive public border

Use `codewhale-tui exec --auto --output-format stream-json` with a local mocked LLM endpoint.

This can assert:

- CodeWhale sends the user prompt to the mocked LLM.
- The mocked LLM requests a tool, for example `list_dir` with `{"path":"."}`.
- CodeWhale emits a public `tool_use` event.
- CodeWhale executes the real tool against an offline temp workspace.
- CodeWhale emits a public `tool_result` event containing the expected entries.
- CodeWhale sends the tool result back to the mocked LLM in the next `/v1/chat/completions` request.
- CodeWhale emits the final assistant answer.

### Layer 2: interactive TUI screen border

Use the PTY/frame-capture harness to drive the real TUI like a user.

This should assert:

- The user prompt appears in the composer before submit.
- The Task List shows the running tool row:

  ```gherkin
  Then the task list should show:
    | status  | marker | tool     | input |
    | running | [~]    | list_dir | .     |
  ```

- The BlueWhale activity indicator moves while the tool or turn is running.
- The Statusline shows the active turn state if this workflow affects it.
- After completion, the Task List shows:

  ```gherkin
  Then the task list should show:
    | status    | marker | tool     | input |
    | completed | ✓      | list_dir | .     |
  ```

- The BlueWhale activity indicator stops or changes to the completed state.
- The rendered transcript/output includes the formatted assistant answer.

### Layer 3: non-happy paths

Add scenarios for:

- LLM requests an unknown tool.
- Tool returns an error.
- Tool returns an empty result.
- LLM produces malformed tool arguments.
- The follow-up LLM answer omits the expected formatted summary.

## Example Gherkin

```gherkin
Feature: Tool call lifecycle
  Scenario: Happy path lists the current directory through a tool
    Given an offline CodeWhale workspace containing:
      | path      | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And the mocked LLM will request the "list_dir" tool with:
      | path |
      | .    |
    And the mocked LLM will answer after the tool result:
      | content                                               |
      | The directory contains README.md, notes.txt, and src/. |
    When the user asks "list the current directory"
    Then CodeWhale should send the user request to the mocked LLM
    And the task list should show a running tool:
      | status  | marker | tool     | input |
      | running | [~]    | list_dir | .     |
    And the BlueWhale activity indicator should move while the tool is running
    And the Statusline should show the active turn state if this workflow updates it
    And the tool result should return directory entries:
      | entry     | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And CodeWhale should send the tool result back to the mocked LLM
    And the task list should show a completed tool:
      | status    | marker | tool     | input |
      | completed | ✓      | list_dir | .     |
    And the BlueWhale activity indicator should stop or show completed state
    And the public output should include "The directory contains README.md, notes.txt, and src/."
```

## Open questions

- What exact Statusline text or state should be expected for this workflow?
- What is the most stable public signal for BlueWhale movement: frame-to-frame text/position changes, animation phase, or a dedicated accessible/status marker?
- Should Windows ConPTY be enabled for this PTY test layer now, or should the first screen-level scenario stay Unix-only until the Windows input/rendering path is audited?

Paulo Aboim Pinto


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: add Gherkin acceptance E2E coverage for tool lifecycle #2886

Context

Why this is useful

Proposed layers

Layer 1: non-interactive public border

Layer 2: interactive TUI screen border

Layer 3: non-happy paths

Example Gherkin

Open questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhancement: add Gherkin acceptance E2E coverage for tool lifecycle #2886

Description

Context

Why this is useful

Proposed layers

Layer 1: non-interactive public border

Layer 2: interactive TUI screen border

Layer 3: non-happy paths

Example Gherkin

Open questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions