refactor(web): per-block chat timeline + Layer 1 aggregator#271
Merged
Conversation
The single TurnActivityPill anchored at the top of every assistant message hoisted reasoning + tool activity above any text — so a turn that streamed [text, tools, text] read as [pill, text, text], with the pill sitting ahead of the work it summarized. Split the turn into chronological slices via the new `segmentTurn`: text blocks become their own slices; contiguous reasoning + tool blocks coalesce into `activity` slices. Each activity slice gets its own pill; the message body interleaves pills with prose in stream order. The live "Thinking…" / "Calling X…" state attaches to the last activity slice; when the model emits text and is then preparing the next tool, a trailing live pill carries the indicator until the next reasoning/tool block lands and a new activity slice absorbs it. Cross-block tool grouping (the Mercury repro) is preserved within each slice — segmentation happens before grouping, so a tool used both before and after a text block reads as two phases of work rather than silently merging across the boundary.
First-principles rewrite of the assistant turn rendering. The old single
TurnActivityPill aggregated all reasoning + tool calls into one head at
the top of the message; this hid the timeline (text rendered below an
already-completed pill) and threw away the reasoning surface entirely
once streamingState left "thinking" (the visibility gate dropped
reasoning-only pills, so the user lost the ability to drill into what
the model thought).
New model:
1. Every block (text / reasoning / tool) renders inline as its own
element at the spot it streamed. No per-turn aggregation. The
summary IS the order.
2. Non-text blocks render as chips that are self-stating: a reasoning
chip carries its own "Thinking…" / "Thought · N tokens" label and
is always clickable on settled turns; a tool chip carries its own
spinner / past-tense + duration and per-call detail in the body.
3. Consecutive tool blocks whose calls share a single tool name fold
into one ×N chip. Reasoning or text between tool blocks always
breaks the fold — those represent separate phases of work.
4. A single LiveCursor at the bottom of the message body covers the
transitions the engine spends between blocks: pre-first-block
warm-up (Thinking…), tool being built before the block lands
(Calling X…), post-tool-result digest (Analyzing…). When a block
is actively absorbing the state (streaming text or running tool),
the cursor hides — the block has its own visual.
Removed:
- TurnActivityPill (~520 lines): per-turn aggregation, visibility
gate, trailing-pill logic, head-label state machine.
- segmentTurn / TurnSegment / TimelineEntry / TurnSummary / groupTurn
/ describeTurn (~150 lines): grouping across reasoning boundaries
misrepresented the timeline; segmentation became unnecessary once
each block renders directly.
- AssistantTurnBody / ActivitySegmentView (~170 lines from the
chronological PR): superseded by direct block iteration.
Net: ~650 lines smaller. 18 new BlockTimeline tests cover the
scenarios — chronological order, consecutive-name fold (and refusal to
fold across reasoning), tone transitions, all six streamingState gap
combinations, reasoning persistence on settled turns.
Two problems addressed:
1. Contiguous reasoning + tool blocks rendered as two stacked chips —
"Thought · 25 tokens" and "Used tools · headlines news ×4 · 441ms"
sitting next to each other for what is plainly one phase of work.
The user expected one collapsible chip.
2. Expanding the reasoning chip showed a card-in-card: the body's outer
bordered panel wrapped an inner bordered reasoning pre block, with
asymmetric padding between them.
New rendering model:
- foldBlocks now partitions the turn into `text` slices and `activity`
slices. An `activity` slice is a contiguous run of reasoning + tool
blocks (any text breaks the phase). Each activity slice maps to one
chip whose body lists the rows in stream order.
- Within a slice, consecutive same-name tool blocks still fold into
one tool row with ×N. Reasoning rows each keep their own row.
- Chip head label: pure-reasoning slices read "Thought · N tokens"
(or "Thinking…" while live). Tool-bearing slices lead with the
dominant tool verb + subject + count + duration; reasoning token
count rides along as a footnote.
- Single-row slices (just reasoning, or one tool group) render the
content directly in the chip body — no nested row chrome. Multi-row
slices render row-by-row, each independently expandable.
CSS: `.turn-pill__reasoning` no longer paints its own border/background;
it inherits the chip body's panel. Padding normalized via
`.turn-pill__reasoning-wrap` / `.turn-pill__tool-wrap`.
The chip head had inline aggregation: when tool names matched it used the call's verb; otherwise it bailed to a hard-coded "Used tools" / "Working" string. Three calls of different *_search-shaped tools (which share verb "Searched" via inferVerb) lost the verb signal and read as the generic fallback. Extract the aggregation into a pure function: `aggregateGroup` takes N `ToolDescription`s and returns one `GroupDescription` with verb, object, subject, count, total duration, and tone. Rules: - verb majority verb if it covers >50% of calls; else "Worked" - object only when every non-null value agrees - subject same rule as object — agreement or null - totalMs sum of known durations (null when none) - tone running > error > ok Pure, deterministic, never fails. The chip head composes its label from this shape; React does no aggregation logic of its own. User-visible effect: in the screenshot scenario (three search-shaped tools with different names), the chip head now reads "Searched · news headlines · ×3 · 423ms" instead of "Used tools · …". Architecture: this is Layer 1 of a deliberately layered design. Verb synonymy (e.g. "Searched" / "Looked up" / "Fetched" collapsing into one category) is a future Layer 2 taxonomy; a registry-style group renderer plugin is a future Layer 3. Neither is in this commit — both are documented in the aggregate.ts file header so future maintainers know what's intentionally absent and why.
Three different shapes were showing up for the same conceptual state
("model is doing something"):
- Running activity chip: boxed surface with border + card background
- LiveCursor: inline text with spinner, no chrome
- Settled chip + LiveCursor: muted text chip then inline cursor below
Two changes to land on one loading treatment:
1. CSS — drop the box treatment on running chips. The box now only
appears when the user has explicitly expanded a chip (a real
surface to read). Active state is communicated by the spinner and
brand color, the same as <LiveCursor>. In-flight chrome is now
consistent regardless of which surface owns the work.
2. Aggregator — when the verb falls back to "Worked" (no majority),
suppress the agreed object as well. Pairing a fallback verb with
an object reads as nonsense: "Worked manage tools" pretends we
know what happened when the verb just admitted we don't. Subject
stays (it comes from the user's input and is true regardless of
which tools ran), and the count + duration + token suffix
continue to communicate scale.
Net: one "loading" shape across the chat — spinner + brand color +
label, inline, no box. Box appears only when the user is reading
content inside an expanded chip.
The chip head was painting red whenever any call in the group errored, even when the model self-corrected and the final outcome was success. Real agentic example: model called `filters` (errored), `add`ed the missing tool, then called `filter` (succeeded). The user got what they asked for, but the chip read "error" and pulled attention to a fixed problem. The first-principles cost: when the chip head flags every recovered error, users learn to ignore the red icon, and a genuine terminal failure no longer surfaces clearly. Rule change in aggregateGroup: before: any running → running; any error → error; else ok after: any running → running; else the LAST call's tone Recovery (error → success) reads as success in the head. Terminal failure (success → error, or all errors) reads as error. The per-call rows in the chip body still show their own tones, so the user can expand and see exactly what failed and what recovered — the chip head just stops crying wolf about it.
Calls with no input arguments (e.g. current_user(), list_active_apps()) rendered as bare "● 7ms" rows — a dot, a duration, and nothing else. The reader couldn't tell what ran without expanding the row. The row label was conditional on the input summary being non-null; when summarizeInput returned null (empty input), the label was omitted entirely. Surface the stripped tool name as a fallback so every row identifies its call. A row's job is to be self-identifying. Input preview is the *differentiator* between sibling calls, but when there isn't one the tool name is still a true identity. Duration is metadata; it should never be the only thing on the row.
"Worked" was verb-shaped scaffolding — it occupied the verb slot but
carried no real signal. When the group has no majority verb, the
strongest truth we have is the count: "N actions happened, here's what
was consistent across them."
before: ● Worked · ×3 · news headlines · 423ms
after: ● 3 actions · news headlines · 423ms
The aggregator now exposes `verbIsFallback` so renderers can detect
the fallback path without comparing against a magic string. The chip
head and tool row both swap to the count-led label when the flag is
true, and suppress the redundant `×N` suffix (the count is already
in the label).
Why this beats the alternatives:
- More informative: the count is real signal; a generic verb is not.
- More honest: the fallback path admits we can't characterize the
work; a verb word pretends we can.
- Removes redundancy: count was being shown twice (in the verb-paired
`×N` and conceptually as the multiplicity).
- Rhythm preserved: the dot icon anchors the eye, not the verb
shape — a noun-led label still reads as a chip row.
The verb stays a real verb when a majority exists (the common case);
fallback only fires when the model genuinely mixed verbs.
…chip labels
ActivityChip's chipHead and ToolRow each independently computed
verbIsFallback ? "${count} actions" : object ? "${verb} ${object}" : verb
plus the running ? verbPresent : verb tense pick. The decision must
stay aligned across both surfaces; without a shared kernel, future
changes to fallback rendering would have to land in two places and
silently drift if one was forgotten.
Pull both call sites onto one formatGroupLabel(group, { running }) →
{ label, showCountSuffix } helper. `running` is computed by the
caller because the phase chip combines group.tone with the
reasoning-tail-streaming flag (state the aggregator can't see),
while rows just pass group.tone === "running".
Net: ~10 lines of new helper, ~22 lines of duplication removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rewrites the assistant chat rendering from a single per-turn pill to a per-block inline timeline. Each thing the model emits — text, reasoning, tool calls — renders at the spot it streamed; a small
<LiveCursor>covers the gaps between blocks. The chip label that summarizes a phase of work now comes from a pureaggregateGroupfunction intool-display, not inline React logic.Originally opened as a narrower segmentation fix (commit
85114fa); the subsequent 8 commits replaced that approach with the per-block + aggregator design documented here. If squash-merging, please use this body as the commit message.What this fixes
text → tool → textreads as exactly that.streamingStateleftthinking. After: the chip stays as a clickable "Thought · N tokens" artifact persistently.*_searchtools all sharing verb "Searched") produces "Searched"; when verbs truly disagree the label becomes count-led ("3 actions") instead of verb-shaped scaffolding.error → … → successreads as success (recovery). Per-call rows in the body still show their individual tones.current_user()and friends showed just● 7mswith no identity. After: the tool name fills the slot when no input summary exists.<pre>. After: flat panel with normalized padding.New architecture
web/src/components/BlockTimeline.tsx(new)blocks[], folds contiguous reasoning + tool blocks into one phase chip per activity slice, renders text and chips in stream order. HostsActivityChip,ToolRow,ToolCallRow,LiveCursor,ToolWidgets. Shares one label-composition kernel (formatGroupLabel) between phase chips and tool rows so fallback / tense / count-suffix rules stay aligned.web/src/lib/tool-display/aggregate.ts(new)aggregateGroup(descriptions[]) → GroupDescriptionwithverb,object,subject,count,totalMs,tone, andverbIsFallback. Documents what's intentionally absent (Layer 2 verb taxonomy, Layer 3 plugin renderers) and why.MessageList.tsx<BlockTimeline>. ~75 lines of inline aggregation + widget plumbing removed (−89/+14).index.css.live-cursorstyles; reasoning content de-bordered to fix card-in-card; running chip no longer gets box treatment (only[data-expanded="true"]does).Removed
TurnActivityPill.tsx(~515 lines) — per-turn aggregation, visibility gate, trailing-pill logic, head-label state machine.tool-display/turn.ts(~113 lines) —segmentTurn,groupTurn,describeTurn,TimelineEntry,TurnSegment,TurnSummary. Aggregation moved toaggregate.ts; segmentation became unnecessary once blocks render directly.TurnActivityPill.test.tsx(~471 lines) — replaced byBlockTimeline.test.tsx(22 tests covering the same scenarios at the new architecture's seams).UX rules that emerged
Known heuristic limit
The "terminal outcome wins" rule (#6) is a heuristic: it correctly reads
error → retry → successas recovery, but it can't distinguish that fromindependent operation failed → unrelated operation succeededwithin the same phase — e.g.[delete error, search ok]would show the chip head as ok even though the delete genuinely failed. The error is still visible on the per-call row inside, so the user can find it on expand; the head just doesn't shout about it. The tradeoff is deliberate: false negatives on independent trailing successes are rarer and less costly than false positives on every recovered error.Files
12 changed, +1689 / −1241:
web/src/components/BlockTimeline.tsx(737 lines)web/src/lib/tool-display/aggregate.ts(137 lines)web/test/BlockTimeline.test.tsx(438 lines, 22 tests)web/test/aggregate.test.ts(303 lines, 29 tests)web/src/components/MessageList.tsx(+14 / −89)web/src/components/MessageInput.tsx(comment reference)web/src/index.css(+30 / −14)web/src/lib/tool-display/index.tsweb/src/lib/tool-display/types.ts(removed unused types)web/src/components/TurnActivityPill.tsxweb/src/lib/tool-display/turn.tsweb/test/TurnActivityPill.test.tsxTest plan
bun run verify:static(format, lint, tsc strict for src + web, all custom checks)bun run test:unit— 3015 pass (server)bun run test:web— 290 pass (web), 51 new tests acrossBlockTimeline.test.tsx(22) andaggregate.test.ts(29)text → tools → text; verify pills render between text spans, not above.ui://resource; verify the inline widget renders below the chip in its activity slice, not at message top.Deliberately deferred
Per the file header in
aggregate.ts:registerGroupRendererAPI parallel toregisterToolRendererso domain bundles can describe their own cross-tool workflows. No bundle has asked for this yet; building it on spec would be speculative API surface.Both are documented inline so the next person to open
aggregate.tsknows what's absent and why.