Skip to content

Enhancement: add Gherkin acceptance E2E coverage for tool lifecycle #2886

@aboimpinto

Description

@aboimpinto

Context

Refs #2791 and uses #2851 as the reference/proof branch for the larger command-strategy direction.

The command refactor is being split into smaller, mergeable layers. Before moving more command ownership and routing code, we should add an acceptance-test layer that describes the expected user-visible behavior in product language and then backs that language with executable tests.

Why this is useful

  1. Owner-written acceptance criteria

    The owner or feature proponent can write the expected behavior directly in the issue using Gherkin-style Given / When / Then. Engineering can then map those scenarios to executable tests without translating the acceptance criteria into a completely different format.

  2. Real integration visibility

    Existing unit tests and narrower integration tests are still necessary, but they usually validate one layer at a time. The acceptance layer should show the complete flow across public borders:

    • user prompt / composer input
    • LLM provider request
    • mocked LLM tool decision
    • tool registry and dispatch
    • real tool execution
    • Task List / WorkBench update
    • Statusline behavior when applicable
    • BlueWhale activity indicator while work is running
    • tool result returned to the LLM in the next request
    • final assistant output rendered to the user
  3. Refactor safety before architecture changes

    The command-strategy refactor will move code across boundaries. Acceptance tests protect behavior before the structure changes, so smaller PRs can prove they preserved the observable workflow.

  4. Executable documentation

    A Gherkin scenario documents the feature in language the owner can review, while the step definitions prove the behavior through real public interfaces.

  5. Better failure narratives

    When a scenario fails, the failure should say which part of the user-visible flow broke: LLM request, tool selection, Task List state, Statusline/BlueWhale feedback, tool result, follow-up LLM request, or rendered answer.

Proposed layers

Layer 1: non-interactive public border

Use codewhale-tui exec --auto --output-format stream-json with a local mocked LLM endpoint.

This can assert:

  • CodeWhale sends the user prompt to the mocked LLM.
  • The mocked LLM requests a tool, for example list_dir with {"path":"."}.
  • CodeWhale emits a public tool_use event.
  • CodeWhale executes the real tool against an offline temp workspace.
  • CodeWhale emits a public tool_result event containing the expected entries.
  • CodeWhale sends the tool result back to the mocked LLM in the next /v1/chat/completions request.
  • CodeWhale emits the final assistant answer.

Layer 2: interactive TUI screen border

Use the PTY/frame-capture harness to drive the real TUI like a user.

This should assert:

  • The user prompt appears in the composer before submit.

  • The Task List shows the running tool row:

    Then the task list should show:
      | status  | marker | tool     | input |
      | running | [~]    | list_dir | .     |
  • The BlueWhale activity indicator moves while the tool or turn is running.

  • The Statusline shows the active turn state if this workflow affects it.

  • After completion, the Task List shows:

    Then the task list should show:
      | status    | marker | tool     | input |
      | completed | ✓      | list_dir | .     |
  • The BlueWhale activity indicator stops or changes to the completed state.

  • The rendered transcript/output includes the formatted assistant answer.

Layer 3: non-happy paths

Add scenarios for:

  • LLM requests an unknown tool.
  • Tool returns an error.
  • Tool returns an empty result.
  • LLM produces malformed tool arguments.
  • The follow-up LLM answer omits the expected formatted summary.

Example Gherkin

Feature: Tool call lifecycle
  Scenario: Happy path lists the current directory through a tool
    Given an offline CodeWhale workspace containing:
      | path      | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And the mocked LLM will request the "list_dir" tool with:
      | path |
      | .    |
    And the mocked LLM will answer after the tool result:
      | content                                               |
      | The directory contains README.md, notes.txt, and src/. |
    When the user asks "list the current directory"
    Then CodeWhale should send the user request to the mocked LLM
    And the task list should show a running tool:
      | status  | marker | tool     | input |
      | running | [~]    | list_dir | .     |
    And the BlueWhale activity indicator should move while the tool is running
    And the Statusline should show the active turn state if this workflow updates it
    And the tool result should return directory entries:
      | entry     | kind   |
      | README.md | file   |
      | notes.txt | file   |
      | src       | folder |
    And CodeWhale should send the tool result back to the mocked LLM
    And the task list should show a completed tool:
      | status    | marker | tool     | input |
      | completed | ✓      | list_dir | .     |
    And the BlueWhale activity indicator should stop or show completed state
    And the public output should include "The directory contains README.md, notes.txt, and src/."

Open questions

  • What exact Statusline text or state should be expected for this workflow?
  • What is the most stable public signal for BlueWhale movement: frame-to-frame text/position changes, animation phase, or a dedicated accessible/status marker?
  • Should Windows ConPTY be enabled for this PTY test layer now, or should the first screen-level scenario stay Unix-only until the Windows input/rendering path is audited?

Paulo Aboim Pinto

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions