refactor(web): per-block chat timeline + Layer 1 aggregator by mgoldsborough · Pull Request #271 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-05-23T18:46:18Z

Summary

Rewrites the assistant chat rendering from a single per-turn pill to a per-block inline timeline. Each thing the model emits — text, reasoning, tool calls — renders at the spot it streamed; a small <LiveCursor> covers the gaps between blocks. The chip label that summarizes a phase of work now comes from a pure aggregateGroup function in tool-display, not inline React logic.

Originally opened as a narrower segmentation fix (commit 85114fa); the subsequent 8 commits replaced that approach with the per-block + aggregator design documented here. If squash-merging, please use this body as the commit message.

What this fixes

Hoisted ordering. Before: a single pill anchored at the top of every assistant message, ahead of the text. After: blocks render in stream order, so text → tool → text reads as exactly that.
Reasoning was disposable. Before: the visibility gate hid reasoning-only pills once streamingState left thinking. After: the chip stays as a clickable "Thought · N tokens" artifact persistently.
Inconsistent loading states. Before: running chip was a boxed surface; LiveCursor was inline text — same conceptual state, different shapes. After: one inline shape (spinner + brand color); the box appears only when the user explicitly expands a chip.
"Used tools" generic fallback. Before: any mix of tool names degraded to "Used tools". After: cross-tool verb agreement (e.g. three different *_search tools all sharing verb "Searched") produces "Searched"; when verbs truly disagree the label becomes count-led ("3 actions") instead of verb-shaped scaffolding.
Self-corrected errors flagged as failures. Before: any errored call escalated the chip head to red. After: the head reflects the terminal outcome — error → … → success reads as success (recovery). Per-call rows in the body still show their individual tones.
No-input calls rendered as empty rows. Before: current_user() and friends showed just ● 7ms with no identity. After: the tool name fills the slot when no input summary exists.
Card-in-card on reasoning expansion. Before: opening a reasoning chip showed a bordered panel with an inner bordered <pre>. After: flat panel with normalized padding.

New architecture

Module	Responsibility
`web/src/components/BlockTimeline.tsx` (new)	Iterates a message's `blocks[]`, folds contiguous reasoning + tool blocks into one phase chip per activity slice, renders text and chips in stream order. Hosts `ActivityChip`, `ToolRow`, `ToolCallRow`, `LiveCursor`, `ToolWidgets`. Shares one label-composition kernel (`formatGroupLabel`) between phase chips and tool rows so fallback / tense / count-suffix rules stay aligned.
`web/src/lib/tool-display/aggregate.ts` (new)	Layer 1 of the tool-display aggregation stack. Pure function: `aggregateGroup(descriptions[]) → GroupDescription` with `verb`, `object`, `subject`, `count`, `totalMs`, `tone`, and `verbIsFallback`. Documents what's intentionally absent (Layer 2 verb taxonomy, Layer 3 plugin renderers) and why.
`MessageList.tsx`	Now delegates assistant-body rendering to `<BlockTimeline>`. ~75 lines of inline aggregation + widget plumbing removed (−89/+14).
`index.css`	New `.live-cursor` styles; reasoning content de-bordered to fix card-in-card; running chip no longer gets box treatment (only `[data-expanded="true"]` does).

Removed

TurnActivityPill.tsx (~515 lines) — per-turn aggregation, visibility gate, trailing-pill logic, head-label state machine.
tool-display/turn.ts (~113 lines) — segmentTurn, groupTurn, describeTurn, TimelineEntry, TurnSegment, TurnSummary. Aggregation moved to aggregate.ts; segmentation became unnecessary once blocks render directly.
TurnActivityPill.test.tsx (~471 lines) — replaced by BlockTimeline.test.tsx (22 tests covering the same scenarios at the new architecture's seams).

UX rules that emerged

Faithful timeline — blocks render at the spot they streamed; no hoisting.
One chip per phase — contiguous reasoning + tool blocks (no text between) collapse to one collapsible chip whose body lists each row.
Self-stating chips — chip status (running / done / error) derives from its own contents; no separate turn-level status surface.
One live cursor for the gaps — message-level live indicator only renders when no block is absorbing the state.
Honest fallbacks — when the aggregator can't pick a verb, the chip says "N actions" (count-led) rather than dressing up the unknown.
Terminal outcome wins — recovered errors don't escalate the chip head; the user still sees them per-row on expand.
Rows are self-identifying — a per-call row always shows the tool name (falling back from the input preview when there is none).

Known heuristic limit

The "terminal outcome wins" rule (#6) is a heuristic: it correctly reads error → retry → success as recovery, but it can't distinguish that from independent operation failed → unrelated operation succeeded within the same phase — e.g. [delete error, search ok] would show the chip head as ok even though the delete genuinely failed. The error is still visible on the per-call row inside, so the user can find it on expand; the head just doesn't shout about it. The tradeoff is deliberate: false negatives on independent trailing successes are rarer and less costly than false positives on every recovered error.

Files

12 changed, +1689 / −1241:

new web/src/components/BlockTimeline.tsx (737 lines)
new web/src/lib/tool-display/aggregate.ts (137 lines)
new web/test/BlockTimeline.test.tsx (438 lines, 22 tests)
new web/test/aggregate.test.ts (303 lines, 29 tests)
modified web/src/components/MessageList.tsx (+14 / −89)
modified web/src/components/MessageInput.tsx (comment reference)
modified web/src/index.css (+30 / −14)
modified web/src/lib/tool-display/index.ts
modified web/src/lib/tool-display/types.ts (removed unused types)
deleted web/src/components/TurnActivityPill.tsx
deleted web/src/lib/tool-display/turn.ts
deleted web/test/TurnActivityPill.test.tsx

Test plan

bun run verify:static (format, lint, tsc strict for src + web, all custom checks)
bun run test:unit — 3015 pass (server)
bun run test:web — 290 pass (web), 51 new tests across BlockTimeline.test.tsx (22) and aggregate.test.ts (29)
Manual: stream a turn with text → tools → text; verify pills render between text spans, not above.
Manual: stream a long-running tool; verify spinner on the chip during work, no separate LiveCursor below.
Manual: trigger a self-correcting tool flow (error → retry → success); verify chip head is muted (not red), error visible on row expand.
Manual: trigger a synapse-app tool that returns a ui:// resource; verify the inline widget renders below the chip in its activity slice, not at message top.
Manual: trigger a phase where multiple tools mix verbs; verify head reads "N actions · subject" (no "×N" suffix and no "Worked" verb).

Deliberately deferred

Per the file header in aggregate.ts:

Layer 2 — verb taxonomy. Lifting surface verbs into semantic categories (e.g. "Searched"/"Listed"/"Looked up" → SEARCH) so synonyms cluster before majority voting. Cheap, deterministic improvement when we have evidence verbs don't agree often enough to matter.
Layer 3 — group renderer plugins. A registerGroupRenderer API parallel to registerToolRenderer so domain bundles can describe their own cross-tool workflows. No bundle has asked for this yet; building it on spec would be speculative API surface.

Both are documented inline so the next person to open aggregate.ts knows what's absent and why.

The single TurnActivityPill anchored at the top of every assistant message hoisted reasoning + tool activity above any text — so a turn that streamed [text, tools, text] read as [pill, text, text], with the pill sitting ahead of the work it summarized. Split the turn into chronological slices via the new `segmentTurn`: text blocks become their own slices; contiguous reasoning + tool blocks coalesce into `activity` slices. Each activity slice gets its own pill; the message body interleaves pills with prose in stream order. The live "Thinking…" / "Calling X…" state attaches to the last activity slice; when the model emits text and is then preparing the next tool, a trailing live pill carries the indicator until the next reasoning/tool block lands and a new activity slice absorbs it. Cross-block tool grouping (the Mercury repro) is preserved within each slice — segmentation happens before grouping, so a tool used both before and after a text block reads as two phases of work rather than silently merging across the boundary.

First-principles rewrite of the assistant turn rendering. The old single TurnActivityPill aggregated all reasoning + tool calls into one head at the top of the message; this hid the timeline (text rendered below an already-completed pill) and threw away the reasoning surface entirely once streamingState left "thinking" (the visibility gate dropped reasoning-only pills, so the user lost the ability to drill into what the model thought). New model: 1. Every block (text / reasoning / tool) renders inline as its own element at the spot it streamed. No per-turn aggregation. The summary IS the order. 2. Non-text blocks render as chips that are self-stating: a reasoning chip carries its own "Thinking…" / "Thought · N tokens" label and is always clickable on settled turns; a tool chip carries its own spinner / past-tense + duration and per-call detail in the body. 3. Consecutive tool blocks whose calls share a single tool name fold into one ×N chip. Reasoning or text between tool blocks always breaks the fold — those represent separate phases of work. 4. A single LiveCursor at the bottom of the message body covers the transitions the engine spends between blocks: pre-first-block warm-up (Thinking…), tool being built before the block lands (Calling X…), post-tool-result digest (Analyzing…). When a block is actively absorbing the state (streaming text or running tool), the cursor hides — the block has its own visual. Removed: - TurnActivityPill (~520 lines): per-turn aggregation, visibility gate, trailing-pill logic, head-label state machine. - segmentTurn / TurnSegment / TimelineEntry / TurnSummary / groupTurn / describeTurn (~150 lines): grouping across reasoning boundaries misrepresented the timeline; segmentation became unnecessary once each block renders directly. - AssistantTurnBody / ActivitySegmentView (~170 lines from the chronological PR): superseded by direct block iteration. Net: ~650 lines smaller. 18 new BlockTimeline tests cover the scenarios — chronological order, consecutive-name fold (and refusal to fold across reasoning), tone transitions, all six streamingState gap combinations, reasoning persistence on settled turns.

Two problems addressed: 1. Contiguous reasoning + tool blocks rendered as two stacked chips — "Thought · 25 tokens" and "Used tools · headlines news ×4 · 441ms" sitting next to each other for what is plainly one phase of work. The user expected one collapsible chip. 2. Expanding the reasoning chip showed a card-in-card: the body's outer bordered panel wrapped an inner bordered reasoning pre block, with asymmetric padding between them. New rendering model: - foldBlocks now partitions the turn into `text` slices and `activity` slices. An `activity` slice is a contiguous run of reasoning + tool blocks (any text breaks the phase). Each activity slice maps to one chip whose body lists the rows in stream order. - Within a slice, consecutive same-name tool blocks still fold into one tool row with ×N. Reasoning rows each keep their own row. - Chip head label: pure-reasoning slices read "Thought · N tokens" (or "Thinking…" while live). Tool-bearing slices lead with the dominant tool verb + subject + count + duration; reasoning token count rides along as a footnote. - Single-row slices (just reasoning, or one tool group) render the content directly in the chip body — no nested row chrome. Multi-row slices render row-by-row, each independently expandable. CSS: `.turn-pill__reasoning` no longer paints its own border/background; it inherits the chip body's panel. Padding normalized via `.turn-pill__reasoning-wrap` / `.turn-pill__tool-wrap`.

The chip head had inline aggregation: when tool names matched it used the call's verb; otherwise it bailed to a hard-coded "Used tools" / "Working" string. Three calls of different *_search-shaped tools (which share verb "Searched" via inferVerb) lost the verb signal and read as the generic fallback. Extract the aggregation into a pure function: `aggregateGroup` takes N `ToolDescription`s and returns one `GroupDescription` with verb, object, subject, count, total duration, and tone. Rules: - verb majority verb if it covers >50% of calls; else "Worked" - object only when every non-null value agrees - subject same rule as object — agreement or null - totalMs sum of known durations (null when none) - tone running > error > ok Pure, deterministic, never fails. The chip head composes its label from this shape; React does no aggregation logic of its own. User-visible effect: in the screenshot scenario (three search-shaped tools with different names), the chip head now reads "Searched · news headlines · ×3 · 423ms" instead of "Used tools · …". Architecture: this is Layer 1 of a deliberately layered design. Verb synonymy (e.g. "Searched" / "Looked up" / "Fetched" collapsing into one category) is a future Layer 2 taxonomy; a registry-style group renderer plugin is a future Layer 3. Neither is in this commit — both are documented in the aggregate.ts file header so future maintainers know what's intentionally absent and why.

Three different shapes were showing up for the same conceptual state ("model is doing something"): - Running activity chip: boxed surface with border + card background - LiveCursor: inline text with spinner, no chrome - Settled chip + LiveCursor: muted text chip then inline cursor below Two changes to land on one loading treatment: 1. CSS — drop the box treatment on running chips. The box now only appears when the user has explicitly expanded a chip (a real surface to read). Active state is communicated by the spinner and brand color, the same as <LiveCursor>. In-flight chrome is now consistent regardless of which surface owns the work. 2. Aggregator — when the verb falls back to "Worked" (no majority), suppress the agreed object as well. Pairing a fallback verb with an object reads as nonsense: "Worked manage tools" pretends we know what happened when the verb just admitted we don't. Subject stays (it comes from the user's input and is true regardless of which tools ran), and the count + duration + token suffix continue to communicate scale. Net: one "loading" shape across the chat — spinner + brand color + label, inline, no box. Box appears only when the user is reading content inside an expanded chip.

The chip head was painting red whenever any call in the group errored, even when the model self-corrected and the final outcome was success. Real agentic example: model called `filters` (errored), `add`ed the missing tool, then called `filter` (succeeded). The user got what they asked for, but the chip read "error" and pulled attention to a fixed problem. The first-principles cost: when the chip head flags every recovered error, users learn to ignore the red icon, and a genuine terminal failure no longer surfaces clearly. Rule change in aggregateGroup: before: any running → running; any error → error; else ok after: any running → running; else the LAST call's tone Recovery (error → success) reads as success in the head. Terminal failure (success → error, or all errors) reads as error. The per-call rows in the chip body still show their own tones, so the user can expand and see exactly what failed and what recovered — the chip head just stops crying wolf about it.

Calls with no input arguments (e.g. current_user(), list_active_apps()) rendered as bare "● 7ms" rows — a dot, a duration, and nothing else. The reader couldn't tell what ran without expanding the row. The row label was conditional on the input summary being non-null; when summarizeInput returned null (empty input), the label was omitted entirely. Surface the stripped tool name as a fallback so every row identifies its call. A row's job is to be self-identifying. Input preview is the *differentiator* between sibling calls, but when there isn't one the tool name is still a true identity. Duration is metadata; it should never be the only thing on the row.

"Worked" was verb-shaped scaffolding — it occupied the verb slot but carried no real signal. When the group has no majority verb, the strongest truth we have is the count: "N actions happened, here's what was consistent across them." before: ● Worked · ×3 · news headlines · 423ms after: ● 3 actions · news headlines · 423ms The aggregator now exposes `verbIsFallback` so renderers can detect the fallback path without comparing against a magic string. The chip head and tool row both swap to the count-led label when the flag is true, and suppress the redundant `×N` suffix (the count is already in the label). Why this beats the alternatives: - More informative: the count is real signal; a generic verb is not. - More honest: the fallback path admits we can't characterize the work; a verb word pretends we can. - Removes redundancy: count was being shown twice (in the verb-paired `×N` and conceptually as the multiplicity). - Rhythm preserved: the dot icon anchors the eye, not the verb shape — a noun-led label still reads as a chip row. The verb stays a real verb when a majority exists (the common case); fallback only fires when the model genuinely mixed verbs.

…chip labels ActivityChip's chipHead and ToolRow each independently computed verbIsFallback ? "${count} actions" : object ? "${verb} ${object}" : verb plus the running ? verbPresent : verb tense pick. The decision must stay aligned across both surfaces; without a shared kernel, future changes to fallback rendering would have to land in two places and silently drift if one was forgotten. Pull both call sites onto one formatGroupLabel(group, { running }) → { label, showCountSuffix } helper. `running` is computed by the caller because the phase chip combines group.tone with the reasoning-tail-streaming flag (state the aggregator can't see), while rows just pass group.tone === "running". Net: ~10 lines of new helper, ~22 lines of duplication removed.

mgoldsborough added 8 commits May 23, 2026 08:45

mgoldsborough changed the title ~~fix(web): render assistant turn segments in stream order~~ refactor(web): per-block chat timeline + Layer 1 aggregator May 23, 2026

mgoldsborough added the qa-reviewed QA review completed with no critical issues label May 23, 2026

mgoldsborough merged commit a0addc4 into main May 23, 2026
4 checks passed

mgoldsborough deleted the feat/chronological-turn-segments branch May 23, 2026 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(web): per-block chat timeline + Layer 1 aggregator#271

refactor(web): per-block chat timeline + Layer 1 aggregator#271
mgoldsborough merged 9 commits into
mainfrom
feat/chronological-turn-segments

mgoldsborough commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this fixes

New architecture

Removed

UX rules that emerged

Known heuristic limit

Files

Test plan

Deliberately deferred

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoldsborough commented May 23, 2026 •

edited

Loading