Skip to content

Design refactor: live transcription, runtime echo cancellation, per-track live-write pipeline #11

@tenequm

Description

@tenequm

1. Summary

This umbrella captures three design directions we're exploring for Blackbox v1 (live transcription, runtime echo cancellation, live-write substrate), the engine-landscape evidence collected on 2026-04-24, and the invariants any proposed change must preserve. Not all decisions are made. Sections below distinguish evidence (facts about the current landscape) from direction (today's leaning and what would change it). A fresh session picking this up should start at section 11 to rebuild context before acting.

2. Current-state observations

Three pressure points in the current design. Observations, not a case being built.

  1. Transcription is offline-only and brittle at length. The current path (TranscriptionService.swift via Soniox async) uploads the finished M4A, queues for async processing, and returns a transcript minutes later. Recordings over roughly one hour have been observed to fail at the provider's finalize step. There is no transcript available during the call itself.

  2. Echo cancellation runs as a post-recording pass. AECProcessor.swift produces audio-processed.m4a alongside audio.m4a, forcing the playback UI (MainWindowView.swift:149-151) to branch on which file exists. The post-pass is fire-and-forget and has a silent-failure path. The user's mental model has to accommodate both artifacts.

  3. The live-writer carries complexity specific to AVAssetWriter's constraints. Several architectural constraints in docs/specification.md (D8 silence gap filling, tail padding, leading silence, first-buffer-format immutability) exist because AVAssetWriter collapses PTS gaps and freezes track format on the first appended sample. This complexity is not a product requirement; it is a tax paid for using AVAssetWriter as the live target.

3. Invariants to preserve

Hard constraints any candidate change must respect. Non-negotiable.

  • No silent audio loss mid-call. Users are on the call and unaware that recording is happening; the system must self-heal without their intervention.
  • Automatic recovery from device switches. D12's three-layer mic recovery (AVAudioEngineConfigurationChange + kAudioHardwarePropertyDefaultInputDevice listener + buffer-arrival watchdog) stays.
  • Crash-safe partial recordings. Any point-in-time file on disk must be recoverable to the extent audio was captured.
  • Independent failure domains. One sub-component failing (system audio, mic, transcript, AEC) cannot take down the others.
  • Existing product surfaces stay. D3 polling call detection, D6 permission-loss handling, D9 device latency compensation, AudioMonitor restart budget + permission-terminal state, AudioRecorder actor + DispatchSerialQueue executor, SCStream display-wide system audio capture, hardware smoke tests.

4. Design philosophy

  • D2 independent failure domains remain load-bearing. No new cross-component coupling.
  • Audio capture never loses samples. Every other pipeline output is best-effort with graceful degradation.
  • Prefer a simpler failure-mode story over a smaller line count. Lines are cheap; silent-failure paths are expensive.
  • Accept and label quality limits. Limitations should be documented rather than hidden.

5. Design surfaces under consideration

Three areas where change is being evaluated. Candidate directions, not committed decisions.

5.A. Transcription timing

Current: offline-after-stop, single provider (Soniox async).

Observed pain: no live feedback during call; 1hr+ finalize failure; upload+queue latency on long files.

Candidate direction: on-device streaming ASR + speaker diarization alongside the recorder, with a user-selectable cloud fallback retained for failure cases, legacy recordings, and explicit user preference. Transcript is a sidecar file (extending the current transcript.json convention) and may gain provenance metadata (engine, language, live vs cloud, confidence).

Tradeoffs of the candidate:

  • On-device eliminates network failure domain in the recording path (respects D2).
  • On-device adds model-weight storage cost (DMG or first-launch download, see section 8).
  • Multilingual streaming today is sliding-window over batch, not true cache-aware. Chunk cadence on the order of 1-2 seconds.
  • Mixed-language calls (EN+RU+UK) have known accuracy weakness until upstream language-hint API lands.

What would change the preference: Apple SpeechTranscriber adding Ukrainian; FluidAudio shipping a multilingual cache-aware streaming model; mlx-audio-swift reaching production maturity; a cloud provider shipping native multi-channel diarization with a multi-hour session cap.

5.B. Echo cancellation timing

Current: post-recording CoreML DTLN-aec 256-unit pass writing audio-processed.m4a.

Observed pain: two-file user model, UI branching on file existence, silent failure path, fire-and-forget lifecycle.

Candidate direction: runtime AEC in the capture path, consuming the mic signal with system audio as reference (same reference relationship as today's post-pass, different lifecycle). Output is a single authoritative file per recording; post-recording pass and its sidecar artifact go away. Explicitly not Voice-Processed I/O (D7 rejected VPIO because it silences SCStream capture); a streaming DTLN-family AEC processes captured signals without touching the output path.

Tradeoffs of the candidate:

  • Collapses two-file model to one.
  • Moves compute from post-stop to realtime, adding steady-state CPU/ANE load during the call.
  • Needs verification that the streaming AEC library does not collide with SCStream in practice.

What would change the preference: no drop-in RT-safe AEC model available; SCStream interaction turns out to be problematic; battery cost on M1/M2 is unacceptable.

5.C. Live-write substrate

Current: AVAssetWriter multi-track M4A live-writer with movieFragmentInterval = 10s for crash safety.

Observed pain: D8, tail padding, leading silence, first-buffer-format immutability all exist to work around AVAssetWriter's constraints. A single track failure puts the whole writer in .failed state, potentially losing both tracks.

Candidate direction: parallel per-track raw PCM (WAV) files during capture, muxed to a single multi-track M4A at stop. Periodic WAV header rewrite (every ~10s matching current fragment cadence) preserves "natively playable at any point in time" guarantee. Final user-visible artifact unchanged.

Tradeoffs of the candidate:

  • Deletes D8 (WAV has no PTS to collapse), tail padding, leading silence, first-buffer-format immutability.
  • Independent per-track failure domains (one track failing cannot corrupt the other).
  • Handles mid-stream format changes by closing old WAV and opening new; mux concatenates.
  • Temporary disk footprint during recording: ~1.5 GB of raw PCM scratch for a 2-hour stereo 48kHz call, released after mux at stop.
  • 5-15 seconds of mux pass at stop before the M4A is finalized.

What would change the preference: AVAssetWriter gains per-input failure isolation and PTS-gap preservation (neither plausible); scratch disk becomes a real constraint on target hardware.

6. Engine landscape (2026-04-24 snapshot)

Evidence per option, then a decision matrix, then today's leaning.

6.1 Evaluated options (evidence only, no verdicts)

FluidAudio - Swift Package, Apache-2.0, CoreML + Apple Neural Engine.

  • 25 European languages including Ukrainian (7.2% WER on FLEURS-uk, 5.1% WER on CoVoST-uk) via Parakeet TDT v3. Russian also 7.2% WER.
  • Multilingual streaming = sliding-window over batch Parakeet TDT v3 (~1-2s chunk cadence), not cache-aware.
  • True cache-aware streaming available only for English (Parakeet EOU 120M).
  • Live diarization: streaming Sortformer (4 speakers, ~1s latency, stronger identity stability) or LS-EEND (10 speakers, 100ms finalized + 900ms preview).
  • v0.14.x as of 2026-04-24. 10 releases in prior 30 days (rapid churn).
  • No language-hint API yet (tracked upstream).

Apple SpeechTranscriber - Speech framework, macOS 26.0+, on-device.

  • Runtime-verified on macOS 26.5: SpeechTranscriber.supportedLocales returns 30 locales (varieties of EN, DE, ES, FR, IT, JA, KO, PT, ZH, yue).
  • Ukrainian (uk_UA) and Russian (ru_RU) not present.
  • Apple does not ship first-party on-device speaker diarization on macOS.

Apple Foundation Models - text-only on-device LLM, macOS 26.0+. Not an ASR option. Could be useful later for post-transcript summarization or titling.

Soniox async (currently integrated) - REST upload+poll+fetch.

  • Ukrainian supported.
  • Known limitation: finalize step observed to fail on recordings roughly beyond 1 hour.

Soniox realtime WebSocket - distinct from their async.

  • 5-hour hard session cap.
  • Diarization in realtime is mono acoustic detection only (no channel-based); docs explicitly flag accuracy as lower than async.
  • No native session-resume protocol; drops = new session, context lost.
  • ~$0.12/hr streaming (vs $0.10/hr async).

Google Cloud Speech-to-Text Chirp 3 - STT v2 API.

  • Ukrainian GA in supported-languages table across eu, us, us-central1, europe-west4, asia-southeast1, global regions.
  • Streaming supported; diarization supported.
  • Whether streaming + Ukrainian + diarization simultaneously cover the same recognition path needs verification (markdown extract lost the feature-column markers).

Gemini Live API - bidirectional WebSocket.

  • Audio-only session cap: 15 minutes without compression.
  • Requires session-resumption token dance to extend.
  • "70 supported languages" claimed; Ukrainian not explicitly confirmed in overview.
  • No documented speaker diarization.

Gladia realtime - WebSocket.

  • Native multi-channel demux: channels: 2 in config splits stereo stream and tags each utterance with channel. Unique among realtime providers; enables mic-left + system-right → per-speaker labeling in a single session.
  • pyannoteAI Precision-2 diarization under the hood.
  • ~$0.25-$0.75/hr depending on tier.

Deepgram Nova-3 streaming - WebSocket.

  • multichannel=true returns separate per-channel messages, up to 20 channels. Billed per channel.
  • Nova-3 claims 53.4% WER reduction vs competitors on streaming.
  • ~$0.0077/min = $0.462/hr Nova-3 PAYG.

mlx-audio-swift - Swift companion to mlx-audio Python library.

  • MIT license, 589 stars, v0.1.2, macOS 14+, Swift 5.9+ (not Swift 6).
  • Supports Parakeet v3 (multilingual incl. Ukrainian inherited).
  • Streaming capability documented (generateStream API).
  • Experimental pre-1.0 maturity; no production shipping app reference.

6.2 Decision matrix (for primary live-transcription path)

Option Ukrainian Live streaming Diarization (live) On-device D2 compatible 2h+ session durable Swift integration
FluidAudio Parakeet v3 + Sortformer/LS-EEND Yes (7.2% WER) Sliding-window, ~1-2s chunks Yes Yes Yes Yes (no session boundary) Native Swift package
Apple SpeechTranscriber No Yes No Yes Yes Yes Native
Google Chirp 3 Yes (GA batch) Yes (streaming TBV for uk) Yes (TBV for uk) No No TBV gRPC, no Swift SDK
Soniox realtime Yes Yes Mono acoustic only No No No (5h cap) WebSocket (URLSession)
Gemini Live TBV Yes No No No No (15min cap) WebSocket (URLSession)
Gladia Likely Yes Native 2-channel No No TBV WebSocket
Deepgram Nova-3 Likely Yes Multichannel=true No No No doc'd cap WebSocket
mlx-audio-swift Parakeet v3 Yes (inherited) Claimed TBV Yes Yes Yes Experimental

TBV = to be verified.

6.3 Today's leaning

Primary path: FluidAudio leads on the evidence above. It is the only option that satisfies all hard requirements (Ukrainian + live + diarization + on-device + D2 + multi-hour durability) today. Expected cost: ~1-2 second chunk cadence (vs sub-second from cloud realtime), mixed-language accuracy tax until language-hint lands upstream.

Cloud fallback: Soniox async stays (currently integrated). Google Chirp 3 is a credible alternate worth adding once streaming+Ukrainian+diarization coverage is verified.

Re-verify this leaning if any of the following changes:

  • Apple adds Ukrainian to SpeechTranscriber in a future macOS release.
  • FluidAudio ships a multilingual cache-aware streaming model.
  • mlx-audio-swift reaches production maturity (v1.0+, Swift 6, shipping app references).
  • A cloud provider ships native multi-channel diarization with a 2h+ session cap and confirmed Ukrainian.

7. Dismissed for v1 with reasons

Explicit to prevent relitigation without new evidence.

  • Apple SpeechTranscriber: no Ukrainian in the macOS 26.5 runtime locale list (verified). No first-party Apple diarization on macOS. Reconsider if Apple adds uk_UA.
  • Apple Foundation Models: not an ASR framework. Usable only for post-transcript text work.
  • Soniox realtime WebSocket: 5-hour hard session cap plus no native resume conflicts with multi-hour call recording; realtime diarization is mono acoustic only. Reconsider if Soniox raises the cap and adds channel-based diarization.
  • Gemini Live API: 15-minute audio-only session cap plus no documented speaker diarization make this a non-starter for call recording today. Reconsider if both constraints are lifted.
  • Gladia: credible for cloud-primary (native 2-channel demux is uniquely well-suited to mic+system labeling), but kept out of v1 because cloud-primary violates D2. Keep on radar for a future "cloud-primary with no on-device dependency" configuration if demand exists.
  • Deepgram Nova-3 streaming: similar to Gladia (strong cloud realtime, multichannel supported), same D2 concern. Keep on radar.
  • mlx-audio-swift: too early (v0.1.2, experimental, Swift 5.9+, no production references). Reconsider if it reaches v1.0+ with Swift 6 compatibility and shipping app adoption.

8. Implementation reference facts

Hard-to-rediscover specifics. The things that actually slow a fresh session down if lost.

8.1 FluidAudio integration

  • Package: https://github.com/FluidInference/FluidAudio (Swift Package, Apache-2.0)
  • Pin a specific minor version (v0.14.x baseline as of 2026-04-24); release cadence is high (10 releases in prior 30 days), do not track main.
  • Multilingual ASR entry point: SlidingWindowAsrManager with Parakeet TDT v3 (sliding-window, not cache-aware).
  • Live diarization entry points: LSEENDDiarizer (up to 10 speakers) or SortformerDiarizer(config: .default) (4 speakers, stronger identity stability).
  • Model loader pattern: SortformerModels.loadFromHuggingFace(config: .default) etc.

8.2 HuggingFace model IDs

  • ASR: FluidInference/parakeet-tdt-0.6b-v3-coreml
  • Streaming diarizer (Sortformer variant): FluidInference/diar-streaming-sortformer-coreml (CC-BY-4.0)
  • If LS-EEND is chosen instead, locate its repo in FluidInference's HuggingFace org.

8.3 Parakeet runtime-required file subset

The full HuggingFace repo is 2.69 GB but contains legacy variants.

Runtime-required (newest): Decoder.mlmodelc, Encoder.mlmodelc, JointDecisionv3.mlmodelc, MelEncoder.mlmodelc (int-8 quantized), Preprocessor.mlmodelc, parakeet_v3_vocab.json.

Legacy (skip): JointDecision.mlmodelc (v1), JointDecisionv2.mlmodelc, JointDecisionv3.mlpackage (duplicate of the .mlmodelc), Melspectrogram_15s.mlmodelc, ParakeetEncoder_15s.mlmodelc, ParakeetDecoder.mlmodelc, RNNTJoint.mlmodelc, parakeet_vocab.json (pre-v3), mlpackages/ folder.

Estimated runtime subset: 600 MB to 1.2 GB (not yet precisely measured; drives the bundle-vs-download decision).

8.4 License inventory (three distinct licenses in dependency graph)

  • FluidAudio library: Apache-2.0 (permissive).
  • Parakeet TDT v3 CoreML model: derived from NVIDIA's Parakeet TDT v3; inherits NVIDIA's license terms.
  • Streaming Sortformer diarizer: CC-BY-4.0 (attribution required, commercial use permitted).

Attribution placement (About panel / LICENSE file / settings) is a product decision.

8.5 Runtime AEC target model

  • Current batch AEC: DTLN-aec 256-unit CoreML (via DTLNAec256 package in AECProcessor.swift).
  • Runtime replacement target: DTLN-aec 512-unit .large variant (streaming member of the same family).

8.6 Apple SpeechTranscriber runtime re-check snippet

To re-verify whether Apple has added Ukrainian (or any other locale) to SpeechTranscriber.supportedLocales on a current macOS 26.x machine:

import Foundation
import Speech

@available(macOS 26.0, *)
@main struct Main {
    static func main() async {
        let locales = await SpeechTranscriber.supportedLocales
        let ids = locales.map { $0.identifier }.sorted()
        print("count=\(ids.count)")
        for id in ids { print(id) }
        print("has_uk=\(ids.contains { $0.hasPrefix("uk") })")
    }
}

Compile with swiftc -parse-as-library locales.swift -o locales && ./locales. Baseline on 2026-04-24 (macOS 26.5): 30 locales, no uk, no ru.

8.7 First-launch download UX

PR #3 already shipped a first-launch model download with progress UI. Whichever branch of bundle-vs-download wins, the existing UX is reusable rather than rewritten.

9. Open questions

9.1 Verification items (answerable with research)

  • Actual Parakeet runtime-subset size. Drives bundle-vs-download. Measure the runtime-required files listed in section 8.3.
  • Chirp 3 streaming + Ukrainian + diarization simultaneously. Confirm from Chirp 3 model docs, not the supported-languages table alone.
  • Runtime AEC compatibility with SCStream. Streaming DTLN-512 should not collide with SCStream the way VPIO does (D7), but verify explicitly before committing.
  • Sliding-window chunk cadence on real Ukrainian call audio. Pilot on actual call audio (Ihor, Chrome Meet).
  • Sparkle delta-update behavior with bundled models. If model is bundled in DMG, verify deltas do not re-ship the model with each release.
  • Whether LS-EEND or Sortformer is better for 2-person call identity stability. Empirical, not documentable.

9.2 Product decisions (require user judgment)

  • Bundle vs first-launch download. Driven by 9.1 sizing.
  • Cloud fallback trigger policy. Auto-run on live failure, or explicit user action?
  • Live transcript UI placement. Menu-bar popover, HUD, dedicated window, or main-window tab?
  • Transcript sidecar schema. Add provenance metadata (engine version, detected language, live vs cloud, confidence) or keep current shape?
  • Mixed-language UX. Language selector, warning banner, or no UI surface until upstream language-hint lands?
  • Cloud fallback provider list. Keep Soniox only, or add Google Chirp 3 as a selectable alternate in v1?
  • Attribution placement for the three-license dependency graph. About panel, LICENSE file, Settings, or multiple?

10. Out of scope

This umbrella explicitly does not touch:

  • Capture pipeline refactor beyond the live-write substrate (SCStream, AVAudioEngine, polling call detection, permission-loss handling, device latency compensation, mic recovery, restart budget, hardware smoke test infrastructure).
  • Actor + DispatchSerialQueue executor model.
  • v0.7.0 CATap work (reverted in v0.8.0, D10 supersedes D5).
  • Apple Foundation Models for summarization/titling (orthogonal follow-up).

11. Notes for a resuming session

If you are picking this up fresh without prior context:

  1. Re-run the SpeechTranscriber locale check (section 8.6) on your current macOS. If Ukrainian is now in the list, the decision matrix in 6.2 changes and section 7's Apple dismissal reasoning needs revisiting.
  2. Check FluidAudio's current releases at https://github.com/FluidInference/FluidAudio/releases for a multilingual cache-aware streaming model (as of 2026-04-24 only English has cache-aware streaming via Parakeet EOU). If one has landed, sliding-window is no longer the only multilingual streaming option.
  3. Re-verify Chirp 3 streaming + Ukrainian + diarization coverage from the Chirp 3 model docs. If all three are confirmed together, Google becomes a stronger candidate for either primary (if D2 is relaxed) or fallback (always).
  4. Confirm mlx-audio-swift maturity at https://github.com/Blaizzy/mlx-audio-swift. If v1.0+ with Swift 6 compatibility, it joins the primary-path candidate list.
  5. Re-read this issue's section 6.2 decision matrix. If any cell value changed, update the matrix before acting.

Today's leaning (2026-04-24): primary path = FluidAudio Parakeet TDT v3 + streaming Sortformer or LS-EEND; fallback path = Soniox async (currently integrated) and potentially Google Chirp 3.

12. Spec impact (conditional)

When individual sub-changes land, docs/specification.md needs corresponding updates. Conditional so the issue does not rot as directions evolve.

  • IF on-device live transcription (5.A candidate) is adopted → add a new decision entry (D13 or next available) documenting engine choice, fallback policy, and the multilingual accuracy tradeoffs.
  • IF runtime AEC (5.B candidate) is adopted → replace or deprecate D7 (post-processing AEC rationale).
  • IF per-track live-write (5.C candidate) is adopted → remove or substantially revise D8 (silence gap filling); the behavior becomes unnecessary rather than load-bearing. Also revise tail padding and leading silence references.
  • IF the System Overview diagram needs to reflect a live transcription branch → add the on-device transcript layer to the diagram.

Sub-changes are independent; spec edits can land with each.

13. Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions