Skip to content

refactor(web): per-block chat timeline + Layer 1 aggregator#271

Merged
mgoldsborough merged 9 commits into
mainfrom
feat/chronological-turn-segments
May 23, 2026
Merged

refactor(web): per-block chat timeline + Layer 1 aggregator#271
mgoldsborough merged 9 commits into
mainfrom
feat/chronological-turn-segments

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

@mgoldsborough mgoldsborough commented May 23, 2026

Summary

Rewrites the assistant chat rendering from a single per-turn pill to a per-block inline timeline. Each thing the model emits — text, reasoning, tool calls — renders at the spot it streamed; a small <LiveCursor> covers the gaps between blocks. The chip label that summarizes a phase of work now comes from a pure aggregateGroup function in tool-display, not inline React logic.

Originally opened as a narrower segmentation fix (commit 85114fa); the subsequent 8 commits replaced that approach with the per-block + aggregator design documented here. If squash-merging, please use this body as the commit message.

What this fixes

  • Hoisted ordering. Before: a single pill anchored at the top of every assistant message, ahead of the text. After: blocks render in stream order, so text → tool → text reads as exactly that.
  • Reasoning was disposable. Before: the visibility gate hid reasoning-only pills once streamingState left thinking. After: the chip stays as a clickable "Thought · N tokens" artifact persistently.
  • Inconsistent loading states. Before: running chip was a boxed surface; LiveCursor was inline text — same conceptual state, different shapes. After: one inline shape (spinner + brand color); the box appears only when the user explicitly expands a chip.
  • "Used tools" generic fallback. Before: any mix of tool names degraded to "Used tools". After: cross-tool verb agreement (e.g. three different *_search tools all sharing verb "Searched") produces "Searched"; when verbs truly disagree the label becomes count-led ("3 actions") instead of verb-shaped scaffolding.
  • Self-corrected errors flagged as failures. Before: any errored call escalated the chip head to red. After: the head reflects the terminal outcome — error → … → success reads as success (recovery). Per-call rows in the body still show their individual tones.
  • No-input calls rendered as empty rows. Before: current_user() and friends showed just ● 7ms with no identity. After: the tool name fills the slot when no input summary exists.
  • Card-in-card on reasoning expansion. Before: opening a reasoning chip showed a bordered panel with an inner bordered <pre>. After: flat panel with normalized padding.

New architecture

Module Responsibility
web/src/components/BlockTimeline.tsx (new) Iterates a message's blocks[], folds contiguous reasoning + tool blocks into one phase chip per activity slice, renders text and chips in stream order. Hosts ActivityChip, ToolRow, ToolCallRow, LiveCursor, ToolWidgets. Shares one label-composition kernel (formatGroupLabel) between phase chips and tool rows so fallback / tense / count-suffix rules stay aligned.
web/src/lib/tool-display/aggregate.ts (new) Layer 1 of the tool-display aggregation stack. Pure function: aggregateGroup(descriptions[]) → GroupDescription with verb, object, subject, count, totalMs, tone, and verbIsFallback. Documents what's intentionally absent (Layer 2 verb taxonomy, Layer 3 plugin renderers) and why.
MessageList.tsx Now delegates assistant-body rendering to <BlockTimeline>. ~75 lines of inline aggregation + widget plumbing removed (−89/+14).
index.css New .live-cursor styles; reasoning content de-bordered to fix card-in-card; running chip no longer gets box treatment (only [data-expanded="true"] does).

Removed

  • TurnActivityPill.tsx (~515 lines) — per-turn aggregation, visibility gate, trailing-pill logic, head-label state machine.
  • tool-display/turn.ts (~113 lines) — segmentTurn, groupTurn, describeTurn, TimelineEntry, TurnSegment, TurnSummary. Aggregation moved to aggregate.ts; segmentation became unnecessary once blocks render directly.
  • TurnActivityPill.test.tsx (~471 lines) — replaced by BlockTimeline.test.tsx (22 tests covering the same scenarios at the new architecture's seams).

UX rules that emerged

  1. Faithful timeline — blocks render at the spot they streamed; no hoisting.
  2. One chip per phase — contiguous reasoning + tool blocks (no text between) collapse to one collapsible chip whose body lists each row.
  3. Self-stating chips — chip status (running / done / error) derives from its own contents; no separate turn-level status surface.
  4. One live cursor for the gaps — message-level live indicator only renders when no block is absorbing the state.
  5. Honest fallbacks — when the aggregator can't pick a verb, the chip says "N actions" (count-led) rather than dressing up the unknown.
  6. Terminal outcome wins — recovered errors don't escalate the chip head; the user still sees them per-row on expand.
  7. Rows are self-identifying — a per-call row always shows the tool name (falling back from the input preview when there is none).

Known heuristic limit

The "terminal outcome wins" rule (#6) is a heuristic: it correctly reads error → retry → success as recovery, but it can't distinguish that from independent operation failed → unrelated operation succeeded within the same phase — e.g. [delete error, search ok] would show the chip head as ok even though the delete genuinely failed. The error is still visible on the per-call row inside, so the user can find it on expand; the head just doesn't shout about it. The tradeoff is deliberate: false negatives on independent trailing successes are rarer and less costly than false positives on every recovered error.

Files

12 changed, +1689 / −1241:

  • new web/src/components/BlockTimeline.tsx (737 lines)
  • new web/src/lib/tool-display/aggregate.ts (137 lines)
  • new web/test/BlockTimeline.test.tsx (438 lines, 22 tests)
  • new web/test/aggregate.test.ts (303 lines, 29 tests)
  • modified web/src/components/MessageList.tsx (+14 / −89)
  • modified web/src/components/MessageInput.tsx (comment reference)
  • modified web/src/index.css (+30 / −14)
  • modified web/src/lib/tool-display/index.ts
  • modified web/src/lib/tool-display/types.ts (removed unused types)
  • deleted web/src/components/TurnActivityPill.tsx
  • deleted web/src/lib/tool-display/turn.ts
  • deleted web/test/TurnActivityPill.test.tsx

Test plan

  • bun run verify:static (format, lint, tsc strict for src + web, all custom checks)
  • bun run test:unit — 3015 pass (server)
  • bun run test:web — 290 pass (web), 51 new tests across BlockTimeline.test.tsx (22) and aggregate.test.ts (29)
  • Manual: stream a turn with text → tools → text; verify pills render between text spans, not above.
  • Manual: stream a long-running tool; verify spinner on the chip during work, no separate LiveCursor below.
  • Manual: trigger a self-correcting tool flow (error → retry → success); verify chip head is muted (not red), error visible on row expand.
  • Manual: trigger a synapse-app tool that returns a ui:// resource; verify the inline widget renders below the chip in its activity slice, not at message top.
  • Manual: trigger a phase where multiple tools mix verbs; verify head reads "N actions · subject" (no "×N" suffix and no "Worked" verb).

Deliberately deferred

Per the file header in aggregate.ts:

  • Layer 2 — verb taxonomy. Lifting surface verbs into semantic categories (e.g. "Searched"/"Listed"/"Looked up" → SEARCH) so synonyms cluster before majority voting. Cheap, deterministic improvement when we have evidence verbs don't agree often enough to matter.
  • Layer 3 — group renderer plugins. A registerGroupRenderer API parallel to registerToolRenderer so domain bundles can describe their own cross-tool workflows. No bundle has asked for this yet; building it on spec would be speculative API surface.

Both are documented inline so the next person to open aggregate.ts knows what's absent and why.

The single TurnActivityPill anchored at the top of every assistant
message hoisted reasoning + tool activity above any text — so a turn
that streamed [text, tools, text] read as [pill, text, text], with the
pill sitting ahead of the work it summarized.

Split the turn into chronological slices via the new `segmentTurn`:
text blocks become their own slices; contiguous reasoning + tool blocks
coalesce into `activity` slices. Each activity slice gets its own pill;
the message body interleaves pills with prose in stream order.

The live "Thinking…" / "Calling X…" state attaches to the last activity
slice; when the model emits text and is then preparing the next tool, a
trailing live pill carries the indicator until the next reasoning/tool
block lands and a new activity slice absorbs it.

Cross-block tool grouping (the Mercury repro) is preserved within each
slice — segmentation happens before grouping, so a tool used both
before and after a text block reads as two phases of work rather than
silently merging across the boundary.
First-principles rewrite of the assistant turn rendering. The old single
TurnActivityPill aggregated all reasoning + tool calls into one head at
the top of the message; this hid the timeline (text rendered below an
already-completed pill) and threw away the reasoning surface entirely
once streamingState left "thinking" (the visibility gate dropped
reasoning-only pills, so the user lost the ability to drill into what
the model thought).

New model:

  1. Every block (text / reasoning / tool) renders inline as its own
     element at the spot it streamed. No per-turn aggregation. The
     summary IS the order.
  2. Non-text blocks render as chips that are self-stating: a reasoning
     chip carries its own "Thinking…" / "Thought · N tokens" label and
     is always clickable on settled turns; a tool chip carries its own
     spinner / past-tense + duration and per-call detail in the body.
  3. Consecutive tool blocks whose calls share a single tool name fold
     into one ×N chip. Reasoning or text between tool blocks always
     breaks the fold — those represent separate phases of work.
  4. A single LiveCursor at the bottom of the message body covers the
     transitions the engine spends between blocks: pre-first-block
     warm-up (Thinking…), tool being built before the block lands
     (Calling X…), post-tool-result digest (Analyzing…). When a block
     is actively absorbing the state (streaming text or running tool),
     the cursor hides — the block has its own visual.

Removed:

  - TurnActivityPill (~520 lines): per-turn aggregation, visibility
    gate, trailing-pill logic, head-label state machine.
  - segmentTurn / TurnSegment / TimelineEntry / TurnSummary / groupTurn
    / describeTurn (~150 lines): grouping across reasoning boundaries
    misrepresented the timeline; segmentation became unnecessary once
    each block renders directly.
  - AssistantTurnBody / ActivitySegmentView (~170 lines from the
    chronological PR): superseded by direct block iteration.

Net: ~650 lines smaller. 18 new BlockTimeline tests cover the
scenarios — chronological order, consecutive-name fold (and refusal to
fold across reasoning), tone transitions, all six streamingState gap
combinations, reasoning persistence on settled turns.
Two problems addressed:

1. Contiguous reasoning + tool blocks rendered as two stacked chips —
   "Thought · 25 tokens" and "Used tools · headlines news ×4 · 441ms"
   sitting next to each other for what is plainly one phase of work.
   The user expected one collapsible chip.

2. Expanding the reasoning chip showed a card-in-card: the body's outer
   bordered panel wrapped an inner bordered reasoning pre block, with
   asymmetric padding between them.

New rendering model:

  - foldBlocks now partitions the turn into `text` slices and `activity`
    slices. An `activity` slice is a contiguous run of reasoning + tool
    blocks (any text breaks the phase). Each activity slice maps to one
    chip whose body lists the rows in stream order.
  - Within a slice, consecutive same-name tool blocks still fold into
    one tool row with ×N. Reasoning rows each keep their own row.
  - Chip head label: pure-reasoning slices read "Thought · N tokens"
    (or "Thinking…" while live). Tool-bearing slices lead with the
    dominant tool verb + subject + count + duration; reasoning token
    count rides along as a footnote.
  - Single-row slices (just reasoning, or one tool group) render the
    content directly in the chip body — no nested row chrome. Multi-row
    slices render row-by-row, each independently expandable.

CSS: `.turn-pill__reasoning` no longer paints its own border/background;
it inherits the chip body's panel. Padding normalized via
`.turn-pill__reasoning-wrap` / `.turn-pill__tool-wrap`.
The chip head had inline aggregation: when tool names matched it used
the call's verb; otherwise it bailed to a hard-coded "Used tools" /
"Working" string. Three calls of different *_search-shaped tools (which
share verb "Searched" via inferVerb) lost the verb signal and read as
the generic fallback.

Extract the aggregation into a pure function: `aggregateGroup` takes
N `ToolDescription`s and returns one `GroupDescription` with verb,
object, subject, count, total duration, and tone. Rules:

  - verb       majority verb if it covers >50% of calls; else "Worked"
  - object     only when every non-null value agrees
  - subject    same rule as object — agreement or null
  - totalMs    sum of known durations (null when none)
  - tone       running > error > ok

Pure, deterministic, never fails. The chip head composes its label
from this shape; React does no aggregation logic of its own.

User-visible effect: in the screenshot scenario (three search-shaped
tools with different names), the chip head now reads
"Searched · news headlines · ×3 · 423ms" instead of "Used tools · …".

Architecture: this is Layer 1 of a deliberately layered design. Verb
synonymy (e.g. "Searched" / "Looked up" / "Fetched" collapsing into
one category) is a future Layer 2 taxonomy; a registry-style group
renderer plugin is a future Layer 3. Neither is in this commit — both
are documented in the aggregate.ts file header so future maintainers
know what's intentionally absent and why.
Three different shapes were showing up for the same conceptual state
("model is doing something"):

  - Running activity chip: boxed surface with border + card background
  - LiveCursor: inline text with spinner, no chrome
  - Settled chip + LiveCursor: muted text chip then inline cursor below

Two changes to land on one loading treatment:

1. CSS — drop the box treatment on running chips. The box now only
   appears when the user has explicitly expanded a chip (a real
   surface to read). Active state is communicated by the spinner and
   brand color, the same as <LiveCursor>. In-flight chrome is now
   consistent regardless of which surface owns the work.

2. Aggregator — when the verb falls back to "Worked" (no majority),
   suppress the agreed object as well. Pairing a fallback verb with
   an object reads as nonsense: "Worked manage tools" pretends we
   know what happened when the verb just admitted we don't. Subject
   stays (it comes from the user's input and is true regardless of
   which tools ran), and the count + duration + token suffix
   continue to communicate scale.

Net: one "loading" shape across the chat — spinner + brand color +
label, inline, no box. Box appears only when the user is reading
content inside an expanded chip.
The chip head was painting red whenever any call in the group errored,
even when the model self-corrected and the final outcome was success.
Real agentic example: model called `filters` (errored), `add`ed the
missing tool, then called `filter` (succeeded). The user got what they
asked for, but the chip read "error" and pulled attention to a fixed
problem.

The first-principles cost: when the chip head flags every recovered
error, users learn to ignore the red icon, and a genuine terminal
failure no longer surfaces clearly.

Rule change in aggregateGroup:

  before: any running → running; any error → error; else ok
  after:  any running → running; else the LAST call's tone

Recovery (error → success) reads as success in the head. Terminal
failure (success → error, or all errors) reads as error. The per-call
rows in the chip body still show their own tones, so the user can
expand and see exactly what failed and what recovered — the chip head
just stops crying wolf about it.
Calls with no input arguments (e.g. current_user(), list_active_apps())
rendered as bare "● 7ms" rows — a dot, a duration, and nothing else.
The reader couldn't tell what ran without expanding the row.

The row label was conditional on the input summary being non-null;
when summarizeInput returned null (empty input), the label was
omitted entirely. Surface the stripped tool name as a fallback so
every row identifies its call.

A row's job is to be self-identifying. Input preview is the
*differentiator* between sibling calls, but when there isn't one the
tool name is still a true identity. Duration is metadata; it should
never be the only thing on the row.
"Worked" was verb-shaped scaffolding — it occupied the verb slot but
carried no real signal. When the group has no majority verb, the
strongest truth we have is the count: "N actions happened, here's what
was consistent across them."

  before: ● Worked · ×3 · news headlines · 423ms
  after:  ● 3 actions · news headlines · 423ms

The aggregator now exposes `verbIsFallback` so renderers can detect
the fallback path without comparing against a magic string. The chip
head and tool row both swap to the count-led label when the flag is
true, and suppress the redundant `×N` suffix (the count is already
in the label).

Why this beats the alternatives:

  - More informative: the count is real signal; a generic verb is not.
  - More honest: the fallback path admits we can't characterize the
    work; a verb word pretends we can.
  - Removes redundancy: count was being shown twice (in the verb-paired
    `×N` and conceptually as the multiplicity).
  - Rhythm preserved: the dot icon anchors the eye, not the verb
    shape — a noun-led label still reads as a chip row.

The verb stays a real verb when a majority exists (the common case);
fallback only fires when the model genuinely mixed verbs.
@mgoldsborough mgoldsborough changed the title fix(web): render assistant turn segments in stream order refactor(web): per-block chat timeline + Layer 1 aggregator May 23, 2026
…chip labels

ActivityChip's chipHead and ToolRow each independently computed
verbIsFallback ? "${count} actions" : object ? "${verb} ${object}" : verb
plus the running ? verbPresent : verb tense pick. The decision must
stay aligned across both surfaces; without a shared kernel, future
changes to fallback rendering would have to land in two places and
silently drift if one was forgotten.

Pull both call sites onto one formatGroupLabel(group, { running }) →
{ label, showCountSuffix } helper. `running` is computed by the
caller because the phase chip combines group.tone with the
reasoning-tail-streaming flag (state the aggregator can't see),
while rows just pass group.tone === "running".

Net: ~10 lines of new helper, ~22 lines of duplication removed.
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label May 23, 2026
@mgoldsborough mgoldsborough merged commit a0addc4 into main May 23, 2026
4 checks passed
@mgoldsborough mgoldsborough deleted the feat/chronological-turn-segments branch May 23, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant