diff --git a/docs/reference/voice-flow-mic-mute-drain.md b/docs/reference/voice-flow-mic-mute-drain.md new file mode 100644 index 0000000..3bfb792 --- /dev/null +++ b/docs/reference/voice-flow-mic-mute-drain.md @@ -0,0 +1,73 @@ +# Voice Flow — Mic Mute Drain Window (2026-05-19) + +**Component:** `src/app.js` — ClawdbotMode streaming response handler + `playNextAudio()` +**Branch / PR:** `fix/mic-mute-hold-during-tts-pending` → see GitHub PR +**Origin incident:** 2026-05-19, mic captured tail of TTS audio as user speech mid-response. + +--- + +## Symptom + +During a long streamed response, the user reported the mic was "off in timing with the actual audio playback" — STT picked up the end of the TTS being played, then the mic appeared hot before the next TTS chunk played. The Action Console showed: + +``` +Response complete (1381 chars, LLM: 39679ms) +🔊 Playing TTS (TTS: 0ms) ← chunk 1 +🔊 Playing TTS (TTS: 0ms) ← chunk 2 + ← 22-second silent gap +🔊 Playing TTS (TTS: 22021ms) ← chunk 3 (generation took 22s) +🔊 Playing TTS (TTS: 22068ms) +🔊 Playing TTS (TTS: 22069ms) +🔊 Playing TTS (TTS: 23637ms) +🔊 Playing TTS (TTS: 23644ms) +🔊 Playing TTS (TTS: 23646ms) +🔊 Playing TTS (TTS: 23650ms) +``` + +## Root Cause + +`ClawdbotMode.playNextAudio()` ran an 800ms drain timer when the audio queue emptied. That window was a debounce to handle short inter-sentence gaps. But Groq Orpheus TTS has been observed taking **22–25 seconds** to generate a single chunk under load. The 800ms drain timer fired long before the next chunk arrived, the empty queue triggered `onListening()` → `stt.resume()` → mic hot. When the late chunk finally played, the mic captured it as speech. + +## Fix + +Extend the drain window dynamically based on whether the **server response stream is still open**: + +| Stream state | Drain wait | +|---|---| +| `_streamingResponseActive = true` (chunks may still arrive) | **30,000 ms** | +| `_streamingResponseActive = false` (stream ended) | **800 ms** (unchanged) | + +New flag `_streamingResponseActive` (declared in constructor): +- Set `true` immediately before the `fetch(?stream=1)` call +- Set `false` in the streaming handler's `finally{}` block +- When the stream ends and a long drain timer is pending against an empty queue, that block collapses the pending timer and re-invokes `playNextAudio()` so the short-window drain fires and the mic returns promptly + +## Why 30s + +Worst observed Orpheus gen latency in the incident was ~24s. 30s gives margin without crossing into "something's actually wrong" territory. If a chunk truly never arrives, the existing `INACTIVITY_TIMEOUT_MS = 60000` in the stream reader aborts the request, after which the `finally{}` clears `_streamingResponseActive` and the short-window drain releases the mic. + +## What did NOT change + +- SpeechRecognition lifecycle — still the same single instance, still `abort()` on mute, `start()` on resume. Per project rule, NEVER destroy/recreate SR instances. +- The 800ms inter-sentence debounce is preserved for the normal case (post-stream drain). +- AudioContext + queue ordering — untouched. +- `_textDoneReceived` flag and interject logic — untouched. +- PTT and wake-word flows — untouched. + +## Rollback + +Single commit. `git revert ` returns the file to pre-fix behavior — every drain falls back to 800ms unconditionally and the new flag is unused (declared as `false`, never read). + +## Monitoring + +Things to watch after deploy: +1. **Echo captures decrease** — search `[VoiceSession] Ignoring transcript during TTS` in browser logs. Should drop on long responses. +2. **Mic-hot timing matches audio playback** — Action Console "Playing TTS" lines should always precede `LISTENING` status transitions for streamed responses with audio. +3. **Stop button behavior** — should remain visible the entire time TTS is in-flight, even during 20+ second Orpheus generation gaps. +4. **No new stuck-in-listening states** — if `_streamingResponseActive` ever leaks `true` after a stream ends, the mic would stay muted indefinitely. The `finally{}` block and the 60s inactivity timeout both guard against this; verify by checking that long responses fully release the mic afterward. + +## Related + +- `src/providers/WebSpeechSTT.js` — `mute()` / `resume()` semantics (no changes here) +- `src/core/VoiceSession.js` — `onSpeakingChange` handler (no changes here) +- Server-side TTS chunk timing — see openclaw response chunking + Groq Orpheus provider in OpenVoiceUI/`tts_providers/` diff --git a/src/app.js b/src/app.js index b261e0c..a9afde8 100644 --- a/src/app.js +++ b/src/app.js @@ -3462,6 +3462,15 @@ connectAiradio(); // fetch. Checked in sendMessage() to fall through to the normal // fresh-request path instead of interject. this._textDoneReceived = false; + // Drain-timer extension: when the server response stream is still + // open, more TTS chunks may arrive with LONG gaps (Groq Orpheus + // takes 20-25s per chunk under load). The default 800ms drain + // briefly empties the audio queue between chunks → STT resumes → + // mic captures the next chunk as user speech. While this flag is + // true, playNextAudio() uses a 30s drain instead. The stream's + // finally{} clears the flag, after which the normal 800ms drain + // releases the mic promptly. Origin: 2026-05-19 echo capture bug. + this._streamingResponseActive = false; // Use shared STT instance instead of creating a new one // This prevents conflicts with VoiceConversation's STT @@ -3943,6 +3952,7 @@ connectAiradio(); const gatewayAgentId = localStorage.getItem('gateway_agent_id') || null; this._fetchAbortController = new AbortController(); this._textDoneReceived = false; // new stream — reset the race-window guard + this._streamingResponseActive = true; // stream open — TTS chunks may arrive with long gaps; see constructor note const response = await fetch(`${this.config.serverUrl}/api/conversation?stream=1`, { method: 'POST', signal: this._fetchAbortController.signal, @@ -4648,6 +4658,17 @@ connectAiradio(); if (_inactivityTimer) clearTimeout(_inactivityTimer); this._sending = false; this._fetchAbortController = null; + // Stream is done. Future drain timer fires should use the short + // 800ms wait again. If an extended-wait drain timer is currently + // pending and the queue is empty, collapse it to the short window + // so the mic returns promptly after the response ends. + // See constructor note on _streamingResponseActive. + this._streamingResponseActive = false; + if (this._drainTimer && this.audioQueue.length === 0) { + clearTimeout(this._drainTimer); + this._drainTimer = null; + this.playNextAudio(); // re-run drain logic with short wait now + } // Safety net: if no audio was queued/played, STT never gets restarted // via onListening callback. Ensure mic comes back after a short delay. // Only fires if call is still active (_voiceActive) — prevents restart after hang-up. @@ -4924,6 +4945,16 @@ connectAiradio(); // Don't immediately transition to listening — more TTS chunks // may be in-flight from streamed sentences. Wait briefly and // check again so the stop button doesn't flash between sentences. + // + // 2026-05-19: extend the drain window while the server response + // stream is still open. Groq Orpheus has been observed taking + // 22-25 SECONDS to generate a single TTS chunk under load; the + // old 800ms wait empties the queue between chunks, STT resumes, + // and the mic captures the late chunk as user speech (echo). + // While _streamingResponseActive is true, wait up to 30s. The + // stream's finally{} clears the flag and re-arms the short + // timer so the mic releases promptly after the stream ends. + const drainMs = this._streamingResponseActive ? 30000 : 800; if (!this._drainTimer) { this._drainTimer = setTimeout(() => { this._drainTimer = null; @@ -4956,7 +4987,7 @@ connectAiradio(); }, 600); } } - }, 800); // 800ms grace period for next TTS chunk to arrive + }, drainMs); } return; }