Design refactor: live transcription, runtime echo cancellation, per-track live-write pipeline

## 1. Summary

This umbrella captures three design directions we're exploring for Blackbox v1 (live transcription, runtime echo cancellation, live-write substrate), the engine-landscape evidence collected on 2026-04-24, and the invariants any proposed change must preserve. Not all decisions are made. Sections below distinguish **evidence** (facts about the current landscape) from **direction** (today's leaning and what would change it). A fresh session picking this up should start at section 11 to rebuild context before acting.

## 2. Current-state observations

Three pressure points in the current design. Observations, not a case being built.

1. **Transcription is offline-only and brittle at length.** The current path (`TranscriptionService.swift` via Soniox async) uploads the finished M4A, queues for async processing, and returns a transcript minutes later. Recordings over roughly one hour have been observed to fail at the provider's finalize step. There is no transcript available during the call itself.

2. **Echo cancellation runs as a post-recording pass.** `AECProcessor.swift` produces `audio-processed.m4a` alongside `audio.m4a`, forcing the playback UI (MainWindowView.swift:149-151) to branch on which file exists. The post-pass is fire-and-forget and has a silent-failure path. The user's mental model has to accommodate both artifacts.

3. **The live-writer carries complexity specific to AVAssetWriter's constraints.** Several architectural constraints in `docs/specification.md` (D8 silence gap filling, tail padding, leading silence, first-buffer-format immutability) exist because AVAssetWriter collapses PTS gaps and freezes track format on the first appended sample. This complexity is not a product requirement; it is a tax paid for using AVAssetWriter as the live target.

## 3. Invariants to preserve

Hard constraints any candidate change must respect. Non-negotiable.

- **No silent audio loss mid-call.** Users are on the call and unaware that recording is happening; the system must self-heal without their intervention.
- **Automatic recovery from device switches.** D12's three-layer mic recovery (AVAudioEngineConfigurationChange + kAudioHardwarePropertyDefaultInputDevice listener + buffer-arrival watchdog) stays.
- **Crash-safe partial recordings.** Any point-in-time file on disk must be recoverable to the extent audio was captured.
- **Independent failure domains.** One sub-component failing (system audio, mic, transcript, AEC) cannot take down the others.
- **Existing product surfaces stay.** D3 polling call detection, D6 permission-loss handling, D9 device latency compensation, AudioMonitor restart budget + permission-terminal state, AudioRecorder actor + DispatchSerialQueue executor, SCStream display-wide system audio capture, hardware smoke tests.

## 4. Design philosophy

- **D2 independent failure domains remain load-bearing.** No new cross-component coupling.
- **Audio capture never loses samples.** Every other pipeline output is best-effort with graceful degradation.
- **Prefer a simpler failure-mode story over a smaller line count.** Lines are cheap; silent-failure paths are expensive.
- **Accept and label quality limits.** Limitations should be documented rather than hidden.

## 5. Design surfaces under consideration

Three areas where change is being evaluated. Candidate directions, not committed decisions.

### 5.A. Transcription timing

**Current:** offline-after-stop, single provider (Soniox async).

**Observed pain:** no live feedback during call; 1hr+ finalize failure; upload+queue latency on long files.

**Candidate direction:** on-device streaming ASR + speaker diarization alongside the recorder, with a user-selectable cloud fallback retained for failure cases, legacy recordings, and explicit user preference. Transcript is a sidecar file (extending the current `transcript.json` convention) and may gain provenance metadata (engine, language, live vs cloud, confidence).

**Tradeoffs of the candidate:**
- On-device eliminates network failure domain in the recording path (respects D2).
- On-device adds model-weight storage cost (DMG or first-launch download, see section 8).
- Multilingual streaming today is sliding-window over batch, not true cache-aware. Chunk cadence on the order of 1-2 seconds.
- Mixed-language calls (EN+RU+UK) have known accuracy weakness until upstream language-hint API lands.

**What would change the preference:** Apple SpeechTranscriber adding Ukrainian; FluidAudio shipping a multilingual cache-aware streaming model; mlx-audio-swift reaching production maturity; a cloud provider shipping native multi-channel diarization with a multi-hour session cap.

### 5.B. Echo cancellation timing

**Current:** post-recording CoreML DTLN-aec 256-unit pass writing `audio-processed.m4a`.

**Observed pain:** two-file user model, UI branching on file existence, silent failure path, fire-and-forget lifecycle.

**Candidate direction:** runtime AEC in the capture path, consuming the mic signal with system audio as reference (same reference relationship as today's post-pass, different lifecycle). Output is a single authoritative file per recording; post-recording pass and its sidecar artifact go away. Explicitly **not** Voice-Processed I/O (D7 rejected VPIO because it silences SCStream capture); a streaming DTLN-family AEC processes captured signals without touching the output path.

**Tradeoffs of the candidate:**
- Collapses two-file model to one.
- Moves compute from post-stop to realtime, adding steady-state CPU/ANE load during the call.
- Needs verification that the streaming AEC library does not collide with SCStream in practice.

**What would change the preference:** no drop-in RT-safe AEC model available; SCStream interaction turns out to be problematic; battery cost on M1/M2 is unacceptable.

### 5.C. Live-write substrate

**Current:** AVAssetWriter multi-track M4A live-writer with `movieFragmentInterval = 10s` for crash safety.

**Observed pain:** D8, tail padding, leading silence, first-buffer-format immutability all exist to work around AVAssetWriter's constraints. A single track failure puts the whole writer in `.failed` state, potentially losing both tracks.

**Candidate direction:** parallel per-track raw PCM (WAV) files during capture, muxed to a single multi-track M4A at stop. Periodic WAV header rewrite (every ~10s matching current fragment cadence) preserves "natively playable at any point in time" guarantee. Final user-visible artifact unchanged.

**Tradeoffs of the candidate:**
- Deletes D8 (WAV has no PTS to collapse), tail padding, leading silence, first-buffer-format immutability.
- Independent per-track failure domains (one track failing cannot corrupt the other).
- Handles mid-stream format changes by closing old WAV and opening new; mux concatenates.
- Temporary disk footprint during recording: ~1.5 GB of raw PCM scratch for a 2-hour stereo 48kHz call, released after mux at stop.
- 5-15 seconds of mux pass at stop before the M4A is finalized.

**What would change the preference:** AVAssetWriter gains per-input failure isolation and PTS-gap preservation (neither plausible); scratch disk becomes a real constraint on target hardware.

## 6. Engine landscape (2026-04-24 snapshot)

Evidence per option, then a decision matrix, then today's leaning.

### 6.1 Evaluated options (evidence only, no verdicts)

**FluidAudio** - Swift Package, Apache-2.0, CoreML + Apple Neural Engine.
- 25 European languages including Ukrainian (7.2% WER on FLEURS-uk, 5.1% WER on CoVoST-uk) via Parakeet TDT v3. Russian also 7.2% WER.
- Multilingual streaming = sliding-window over batch Parakeet TDT v3 (~1-2s chunk cadence), not cache-aware.
- True cache-aware streaming available only for English (Parakeet EOU 120M).
- Live diarization: streaming Sortformer (4 speakers, ~1s latency, stronger identity stability) or LS-EEND (10 speakers, 100ms finalized + 900ms preview).
- v0.14.x as of 2026-04-24. 10 releases in prior 30 days (rapid churn).
- No language-hint API yet (tracked upstream).

**Apple SpeechTranscriber** - `Speech` framework, macOS 26.0+, on-device.
- Runtime-verified on macOS 26.5: `SpeechTranscriber.supportedLocales` returns 30 locales (varieties of EN, DE, ES, FR, IT, JA, KO, PT, ZH, yue).
- Ukrainian (`uk_UA`) and Russian (`ru_RU`) not present.
- Apple does not ship first-party on-device speaker diarization on macOS.

**Apple Foundation Models** - text-only on-device LLM, macOS 26.0+. Not an ASR option. Could be useful later for post-transcript summarization or titling.

**Soniox async** (currently integrated) - REST upload+poll+fetch.
- Ukrainian supported.
- Known limitation: finalize step observed to fail on recordings roughly beyond 1 hour.

**Soniox realtime WebSocket** - distinct from their async.
- 5-hour hard session cap.
- Diarization in realtime is mono acoustic detection only (no channel-based); docs explicitly flag accuracy as lower than async.
- No native session-resume protocol; drops = new session, context lost.
- ~$0.12/hr streaming (vs $0.10/hr async).

**Google Cloud Speech-to-Text Chirp 3** - STT v2 API.
- Ukrainian GA in supported-languages table across `eu`, `us`, `us-central1`, `europe-west4`, `asia-southeast1`, `global` regions.
- Streaming supported; diarization supported.
- Whether streaming + Ukrainian + diarization simultaneously cover the same recognition path needs verification (markdown extract lost the feature-column markers).

**Gemini Live API** - bidirectional WebSocket.
- Audio-only session cap: 15 minutes without compression.
- Requires session-resumption token dance to extend.
- "70 supported languages" claimed; Ukrainian not explicitly confirmed in overview.
- No documented speaker diarization.

**Gladia realtime** - WebSocket.
- Native multi-channel demux: `channels: 2` in config splits stereo stream and tags each utterance with `channel`. Unique among realtime providers; enables mic-left + system-right → per-speaker labeling in a single session.
- pyannoteAI Precision-2 diarization under the hood.
- ~$0.25-$0.75/hr depending on tier.

**Deepgram Nova-3 streaming** - WebSocket.
- `multichannel=true` returns separate per-channel messages, up to 20 channels. Billed per channel.
- Nova-3 claims 53.4% WER reduction vs competitors on streaming.
- ~$0.0077/min = $0.462/hr Nova-3 PAYG.

**mlx-audio-swift** - Swift companion to mlx-audio Python library.
- MIT license, 589 stars, v0.1.2, macOS 14+, Swift 5.9+ (not Swift 6).
- Supports Parakeet v3 (multilingual incl. Ukrainian inherited).
- Streaming capability documented (`generateStream` API).
- Experimental pre-1.0 maturity; no production shipping app reference.

### 6.2 Decision matrix (for primary live-transcription path)

| Option | Ukrainian | Live streaming | Diarization (live) | On-device | D2 compatible | 2h+ session durable | Swift integration |
|---|---|---|---|---|---|---|---|
| FluidAudio Parakeet v3 + Sortformer/LS-EEND | Yes (7.2% WER) | Sliding-window, ~1-2s chunks | Yes | Yes | Yes | Yes (no session boundary) | Native Swift package |
| Apple SpeechTranscriber | No | Yes | No | Yes | Yes | Yes | Native |
| Google Chirp 3 | Yes (GA batch) | Yes (streaming TBV for uk) | Yes (TBV for uk) | No | No | TBV | gRPC, no Swift SDK |
| Soniox realtime | Yes | Yes | Mono acoustic only | No | No | No (5h cap) | WebSocket (URLSession) |
| Gemini Live | TBV | Yes | No | No | No | No (15min cap) | WebSocket (URLSession) |
| Gladia | Likely | Yes | Native 2-channel | No | No | TBV | WebSocket |
| Deepgram Nova-3 | Likely | Yes | Multichannel=true | No | No | No doc'd cap | WebSocket |
| mlx-audio-swift Parakeet v3 | Yes (inherited) | Claimed | TBV | Yes | Yes | Yes | Experimental |

TBV = to be verified.

### 6.3 Today's leaning

**Primary path:** FluidAudio leads on the evidence above. It is the only option that satisfies all hard requirements (Ukrainian + live + diarization + on-device + D2 + multi-hour durability) today. Expected cost: ~1-2 second chunk cadence (vs sub-second from cloud realtime), mixed-language accuracy tax until language-hint lands upstream.

**Cloud fallback:** Soniox async stays (currently integrated). Google Chirp 3 is a credible alternate worth adding once streaming+Ukrainian+diarization coverage is verified.

**Re-verify this leaning if any of the following changes:**
- Apple adds Ukrainian to SpeechTranscriber in a future macOS release.
- FluidAudio ships a multilingual cache-aware streaming model.
- mlx-audio-swift reaches production maturity (v1.0+, Swift 6, shipping app references).
- A cloud provider ships native multi-channel diarization with a 2h+ session cap and confirmed Ukrainian.

## 7. Dismissed for v1 with reasons

Explicit to prevent relitigation without new evidence.

- **Apple SpeechTranscriber:** no Ukrainian in the macOS 26.5 runtime locale list (verified). No first-party Apple diarization on macOS. Reconsider if Apple adds `uk_UA`.
- **Apple Foundation Models:** not an ASR framework. Usable only for post-transcript text work.
- **Soniox realtime WebSocket:** 5-hour hard session cap plus no native resume conflicts with multi-hour call recording; realtime diarization is mono acoustic only. Reconsider if Soniox raises the cap and adds channel-based diarization.
- **Gemini Live API:** 15-minute audio-only session cap plus no documented speaker diarization make this a non-starter for call recording today. Reconsider if both constraints are lifted.
- **Gladia:** credible for cloud-primary (native 2-channel demux is uniquely well-suited to mic+system labeling), but kept out of v1 because cloud-primary violates D2. Keep on radar for a future "cloud-primary with no on-device dependency" configuration if demand exists.
- **Deepgram Nova-3 streaming:** similar to Gladia (strong cloud realtime, multichannel supported), same D2 concern. Keep on radar.
- **mlx-audio-swift:** too early (v0.1.2, experimental, Swift 5.9+, no production references). Reconsider if it reaches v1.0+ with Swift 6 compatibility and shipping app adoption.

## 8. Implementation reference facts

Hard-to-rediscover specifics. The things that actually slow a fresh session down if lost.

### 8.1 FluidAudio integration

- Package: https://github.com/FluidInference/FluidAudio (Swift Package, Apache-2.0)
- Pin a specific minor version (v0.14.x baseline as of 2026-04-24); release cadence is high (10 releases in prior 30 days), do not track `main`.
- Multilingual ASR entry point: `SlidingWindowAsrManager` with Parakeet TDT v3 (sliding-window, not cache-aware).
- Live diarization entry points: `LSEENDDiarizer` (up to 10 speakers) or `SortformerDiarizer(config: .default)` (4 speakers, stronger identity stability).
- Model loader pattern: `SortformerModels.loadFromHuggingFace(config: .default)` etc.

### 8.2 HuggingFace model IDs

- ASR: `FluidInference/parakeet-tdt-0.6b-v3-coreml`
- Streaming diarizer (Sortformer variant): `FluidInference/diar-streaming-sortformer-coreml` (CC-BY-4.0)
- If LS-EEND is chosen instead, locate its repo in FluidInference's HuggingFace org.

### 8.3 Parakeet runtime-required file subset

The full HuggingFace repo is 2.69 GB but contains legacy variants.

**Runtime-required (newest):** `Decoder.mlmodelc`, `Encoder.mlmodelc`, `JointDecisionv3.mlmodelc`, `MelEncoder.mlmodelc` (int-8 quantized), `Preprocessor.mlmodelc`, `parakeet_v3_vocab.json`.

**Legacy (skip):** `JointDecision.mlmodelc` (v1), `JointDecisionv2.mlmodelc`, `JointDecisionv3.mlpackage` (duplicate of the `.mlmodelc`), `Melspectrogram_15s.mlmodelc`, `ParakeetEncoder_15s.mlmodelc`, `ParakeetDecoder.mlmodelc`, `RNNTJoint.mlmodelc`, `parakeet_vocab.json` (pre-v3), `mlpackages/` folder.

Estimated runtime subset: 600 MB to 1.2 GB (not yet precisely measured; drives the bundle-vs-download decision).

### 8.4 License inventory (three distinct licenses in dependency graph)

- FluidAudio library: Apache-2.0 (permissive).
- Parakeet TDT v3 CoreML model: derived from NVIDIA's Parakeet TDT v3; inherits NVIDIA's license terms.
- Streaming Sortformer diarizer: CC-BY-4.0 (attribution required, commercial use permitted).

Attribution placement (About panel / LICENSE file / settings) is a product decision.

### 8.5 Runtime AEC target model

- Current batch AEC: DTLN-aec 256-unit CoreML (via `DTLNAec256` package in `AECProcessor.swift`).
- Runtime replacement target: DTLN-aec **512-unit** `.large` variant (streaming member of the same family).

### 8.6 Apple SpeechTranscriber runtime re-check snippet

To re-verify whether Apple has added Ukrainian (or any other locale) to `SpeechTranscriber.supportedLocales` on a current macOS 26.x machine:

```swift
import Foundation
import Speech

@available(macOS 26.0, *)
@main struct Main {
    static func main() async {
        let locales = await SpeechTranscriber.supportedLocales
        let ids = locales.map { $0.identifier }.sorted()
        print("count=\(ids.count)")
        for id in ids { print(id) }
        print("has_uk=\(ids.contains { $0.hasPrefix("uk") })")
    }
}
```

Compile with `swiftc -parse-as-library locales.swift -o locales && ./locales`. Baseline on 2026-04-24 (macOS 26.5): 30 locales, no uk, no ru.

### 8.7 First-launch download UX

PR #3 already shipped a first-launch model download with progress UI. Whichever branch of bundle-vs-download wins, the existing UX is reusable rather than rewritten.

## 9. Open questions

### 9.1 Verification items (answerable with research)

- **Actual Parakeet runtime-subset size.** Drives bundle-vs-download. Measure the runtime-required files listed in section 8.3.
- **Chirp 3 streaming + Ukrainian + diarization simultaneously.** Confirm from Chirp 3 model docs, not the supported-languages table alone.
- **Runtime AEC compatibility with SCStream.** Streaming DTLN-512 should not collide with SCStream the way VPIO does (D7), but verify explicitly before committing.
- **Sliding-window chunk cadence on real Ukrainian call audio.** Pilot on actual call audio (Ihor, Chrome Meet).
- **Sparkle delta-update behavior with bundled models.** If model is bundled in DMG, verify deltas do not re-ship the model with each release.
- **Whether LS-EEND or Sortformer is better for 2-person call identity stability.** Empirical, not documentable.

### 9.2 Product decisions (require user judgment)

- **Bundle vs first-launch download.** Driven by 9.1 sizing.
- **Cloud fallback trigger policy.** Auto-run on live failure, or explicit user action?
- **Live transcript UI placement.** Menu-bar popover, HUD, dedicated window, or main-window tab?
- **Transcript sidecar schema.** Add provenance metadata (engine version, detected language, live vs cloud, confidence) or keep current shape?
- **Mixed-language UX.** Language selector, warning banner, or no UI surface until upstream language-hint lands?
- **Cloud fallback provider list.** Keep Soniox only, or add Google Chirp 3 as a selectable alternate in v1?
- **Attribution placement for the three-license dependency graph.** About panel, LICENSE file, Settings, or multiple?

## 10. Out of scope

This umbrella explicitly does not touch:

- Capture pipeline refactor beyond the live-write substrate (SCStream, AVAudioEngine, polling call detection, permission-loss handling, device latency compensation, mic recovery, restart budget, hardware smoke test infrastructure).
- Actor + DispatchSerialQueue executor model.
- v0.7.0 CATap work (reverted in v0.8.0, D10 supersedes D5).
- Apple Foundation Models for summarization/titling (orthogonal follow-up).

## 11. Notes for a resuming session

If you are picking this up fresh without prior context:

1. **Re-run the SpeechTranscriber locale check** (section 8.6) on your current macOS. If Ukrainian is now in the list, the decision matrix in 6.2 changes and section 7's Apple dismissal reasoning needs revisiting.
2. **Check FluidAudio's current releases** at https://github.com/FluidInference/FluidAudio/releases for a multilingual cache-aware streaming model (as of 2026-04-24 only English has cache-aware streaming via Parakeet EOU). If one has landed, sliding-window is no longer the only multilingual streaming option.
3. **Re-verify Chirp 3 streaming + Ukrainian + diarization** coverage from the Chirp 3 model docs. If all three are confirmed together, Google becomes a stronger candidate for either primary (if D2 is relaxed) or fallback (always).
4. **Confirm mlx-audio-swift maturity** at https://github.com/Blaizzy/mlx-audio-swift. If v1.0+ with Swift 6 compatibility, it joins the primary-path candidate list.
5. **Re-read this issue's section 6.2 decision matrix.** If any cell value changed, update the matrix before acting.

Today's leaning (2026-04-24): primary path = FluidAudio Parakeet TDT v3 + streaming Sortformer or LS-EEND; fallback path = Soniox async (currently integrated) and potentially Google Chirp 3.

## 12. Spec impact (conditional)

When individual sub-changes land, `docs/specification.md` needs corresponding updates. Conditional so the issue does not rot as directions evolve.

- IF on-device live transcription (5.A candidate) is adopted → add a new decision entry (D13 or next available) documenting engine choice, fallback policy, and the multilingual accuracy tradeoffs.
- IF runtime AEC (5.B candidate) is adopted → replace or deprecate D7 (post-processing AEC rationale).
- IF per-track live-write (5.C candidate) is adopted → remove or substantially revise D8 (silence gap filling); the behavior becomes unnecessary rather than load-bearing. Also revise tail padding and leading silence references.
- IF the System Overview diagram needs to reflect a live transcription branch → add the on-device transcript layer to the diagram.

Sub-changes are independent; spec edits can land with each.

## 13. Related

- Supersedes #3, which explored on-device transcription against an earlier generation of local ASR APIs since replaced by streaming-capable equivalents in the candidate engine.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design refactor: live transcription, runtime echo cancellation, per-track live-write pipeline #11

1. Summary

2. Current-state observations

3. Invariants to preserve

4. Design philosophy

5. Design surfaces under consideration

5.A. Transcription timing

5.B. Echo cancellation timing

5.C. Live-write substrate

6. Engine landscape (2026-04-24 snapshot)

6.1 Evaluated options (evidence only, no verdicts)

6.2 Decision matrix (for primary live-transcription path)

6.3 Today's leaning

7. Dismissed for v1 with reasons

8. Implementation reference facts

8.1 FluidAudio integration

8.2 HuggingFace model IDs

8.3 Parakeet runtime-required file subset

8.4 License inventory (three distinct licenses in dependency graph)

8.5 Runtime AEC target model

8.6 Apple SpeechTranscriber runtime re-check snippet

8.7 First-launch download UX

9. Open questions

9.1 Verification items (answerable with research)

9.2 Product decisions (require user judgment)

10. Out of scope

11. Notes for a resuming session

12. Spec impact (conditional)

13. Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Option	Ukrainian	Live streaming	Diarization (live)	On-device	D2 compatible	2h+ session durable	Swift integration
FluidAudio Parakeet v3 + Sortformer/LS-EEND	Yes (7.2% WER)	Sliding-window, ~1-2s chunks	Yes	Yes	Yes	Yes (no session boundary)	Native Swift package
Apple SpeechTranscriber	No	Yes	No	Yes	Yes	Yes	Native
Google Chirp 3	Yes (GA batch)	Yes (streaming TBV for uk)	Yes (TBV for uk)	No	No	TBV	gRPC, no Swift SDK
Soniox realtime	Yes	Yes	Mono acoustic only	No	No	No (5h cap)	WebSocket (URLSession)
Gemini Live	TBV	Yes	No	No	No	No (15min cap)	WebSocket (URLSession)
Gladia	Likely	Yes	Native 2-channel	No	No	TBV	WebSocket
Deepgram Nova-3	Likely	Yes	Multichannel=true	No	No	No doc'd cap	WebSocket
mlx-audio-swift Parakeet v3	Yes (inherited)	Claimed	TBV	Yes	Yes	Yes	Experimental

Design refactor: live transcription, runtime echo cancellation, per-track live-write pipeline #11

Description

1. Summary

2. Current-state observations

3. Invariants to preserve

4. Design philosophy

5. Design surfaces under consideration

5.A. Transcription timing

5.B. Echo cancellation timing

5.C. Live-write substrate

6. Engine landscape (2026-04-24 snapshot)

6.1 Evaluated options (evidence only, no verdicts)

6.2 Decision matrix (for primary live-transcription path)

6.3 Today's leaning

7. Dismissed for v1 with reasons

8. Implementation reference facts

8.1 FluidAudio integration

8.2 HuggingFace model IDs

8.3 Parakeet runtime-required file subset

8.4 License inventory (three distinct licenses in dependency graph)

8.5 Runtime AEC target model

8.6 Apple SpeechTranscriber runtime re-check snippet

8.7 First-launch download UX

9. Open questions

9.1 Verification items (answerable with research)

9.2 Product decisions (require user judgment)

10. Out of scope

11. Notes for a resuming session

12. Spec impact (conditional)

13. Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions