Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/reference/voice-flow-mic-mute-drain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Voice Flow — Mic Mute Drain Window (2026-05-19)

**Component:** `src/app.js` — ClawdbotMode streaming response handler + `playNextAudio()`
**Branch / PR:** `fix/mic-mute-hold-during-tts-pending` → see GitHub PR
**Origin incident:** 2026-05-19, mic captured tail of TTS audio as user speech mid-response.

---

## Symptom

During a long streamed response, the user reported the mic was "off in timing with the actual audio playback" — STT picked up the end of the TTS being played, then the mic appeared hot before the next TTS chunk played. The Action Console showed:

```
Response complete (1381 chars, LLM: 39679ms)
🔊 Playing TTS (TTS: 0ms) ← chunk 1
🔊 Playing TTS (TTS: 0ms) ← chunk 2
← 22-second silent gap
🔊 Playing TTS (TTS: 22021ms) ← chunk 3 (generation took 22s)
🔊 Playing TTS (TTS: 22068ms)
🔊 Playing TTS (TTS: 22069ms)
🔊 Playing TTS (TTS: 23637ms)
🔊 Playing TTS (TTS: 23644ms)
🔊 Playing TTS (TTS: 23646ms)
🔊 Playing TTS (TTS: 23650ms)
```

## Root Cause

`ClawdbotMode.playNextAudio()` ran an 800ms drain timer when the audio queue emptied. That window was a debounce to handle short inter-sentence gaps. But Groq Orpheus TTS has been observed taking **22–25 seconds** to generate a single chunk under load. The 800ms drain timer fired long before the next chunk arrived, the empty queue triggered `onListening()` → `stt.resume()` → mic hot. When the late chunk finally played, the mic captured it as speech.

## Fix

Extend the drain window dynamically based on whether the **server response stream is still open**:

| Stream state | Drain wait |
|---|---|
| `_streamingResponseActive = true` (chunks may still arrive) | **30,000 ms** |
| `_streamingResponseActive = false` (stream ended) | **800 ms** (unchanged) |

New flag `_streamingResponseActive` (declared in constructor):
- Set `true` immediately before the `fetch(?stream=1)` call
- Set `false` in the streaming handler's `finally{}` block
- When the stream ends and a long drain timer is pending against an empty queue, that block collapses the pending timer and re-invokes `playNextAudio()` so the short-window drain fires and the mic returns promptly

## Why 30s

Worst observed Orpheus gen latency in the incident was ~24s. 30s gives margin without crossing into "something's actually wrong" territory. If a chunk truly never arrives, the existing `INACTIVITY_TIMEOUT_MS = 60000` in the stream reader aborts the request, after which the `finally{}` clears `_streamingResponseActive` and the short-window drain releases the mic.

## What did NOT change

- SpeechRecognition lifecycle — still the same single instance, still `abort()` on mute, `start()` on resume. Per project rule, NEVER destroy/recreate SR instances.
- The 800ms inter-sentence debounce is preserved for the normal case (post-stream drain).
- AudioContext + queue ordering — untouched.
- `_textDoneReceived` flag and interject logic — untouched.
- PTT and wake-word flows — untouched.

## Rollback

Single commit. `git revert <sha>` returns the file to pre-fix behavior — every drain falls back to 800ms unconditionally and the new flag is unused (declared as `false`, never read).

## Monitoring

Things to watch after deploy:
1. **Echo captures decrease** — search `[VoiceSession] Ignoring transcript during TTS` in browser logs. Should drop on long responses.
2. **Mic-hot timing matches audio playback** — Action Console "Playing TTS" lines should always precede `LISTENING` status transitions for streamed responses with audio.
3. **Stop button behavior** — should remain visible the entire time TTS is in-flight, even during 20+ second Orpheus generation gaps.
4. **No new stuck-in-listening states** — if `_streamingResponseActive` ever leaks `true` after a stream ends, the mic would stay muted indefinitely. The `finally{}` block and the 60s inactivity timeout both guard against this; verify by checking that long responses fully release the mic afterward.

## Related

- `src/providers/WebSpeechSTT.js` — `mute()` / `resume()` semantics (no changes here)
- `src/core/VoiceSession.js` — `onSpeakingChange` handler (no changes here)
- Server-side TTS chunk timing — see openclaw response chunking + Groq Orpheus provider in OpenVoiceUI/`tts_providers/`
33 changes: 32 additions & 1 deletion src/app.js
Original file line number Diff line number Diff line change
Expand Up @@ -3462,6 +3462,15 @@ connectAiradio();
// fetch. Checked in sendMessage() to fall through to the normal
// fresh-request path instead of interject.
this._textDoneReceived = false;
// Drain-timer extension: when the server response stream is still
// open, more TTS chunks may arrive with LONG gaps (Groq Orpheus
// takes 20-25s per chunk under load). The default 800ms drain
// briefly empties the audio queue between chunks → STT resumes →
// mic captures the next chunk as user speech. While this flag is
// true, playNextAudio() uses a 30s drain instead. The stream's
// finally{} clears the flag, after which the normal 800ms drain
// releases the mic promptly. Origin: 2026-05-19 echo capture bug.
this._streamingResponseActive = false;

// Use shared STT instance instead of creating a new one
// This prevents conflicts with VoiceConversation's STT
Expand Down Expand Up @@ -3943,6 +3952,7 @@ connectAiradio();
const gatewayAgentId = localStorage.getItem('gateway_agent_id') || null;
this._fetchAbortController = new AbortController();
this._textDoneReceived = false; // new stream — reset the race-window guard
this._streamingResponseActive = true; // stream open — TTS chunks may arrive with long gaps; see constructor note
const response = await fetch(`${this.config.serverUrl}/api/conversation?stream=1`, {
method: 'POST',
signal: this._fetchAbortController.signal,
Expand Down Expand Up @@ -4648,6 +4658,17 @@ connectAiradio();
if (_inactivityTimer) clearTimeout(_inactivityTimer);
this._sending = false;
this._fetchAbortController = null;
// Stream is done. Future drain timer fires should use the short
// 800ms wait again. If an extended-wait drain timer is currently
// pending and the queue is empty, collapse it to the short window
// so the mic returns promptly after the response ends.
// See constructor note on _streamingResponseActive.
this._streamingResponseActive = false;
if (this._drainTimer && this.audioQueue.length === 0) {
clearTimeout(this._drainTimer);
this._drainTimer = null;
this.playNextAudio(); // re-run drain logic with short wait now
}
// Safety net: if no audio was queued/played, STT never gets restarted
// via onListening callback. Ensure mic comes back after a short delay.
// Only fires if call is still active (_voiceActive) — prevents restart after hang-up.
Expand Down Expand Up @@ -4924,6 +4945,16 @@ connectAiradio();
// Don't immediately transition to listening — more TTS chunks
// may be in-flight from streamed sentences. Wait briefly and
// check again so the stop button doesn't flash between sentences.
//
// 2026-05-19: extend the drain window while the server response
// stream is still open. Groq Orpheus has been observed taking
// 22-25 SECONDS to generate a single TTS chunk under load; the
// old 800ms wait empties the queue between chunks, STT resumes,
// and the mic captures the late chunk as user speech (echo).
// While _streamingResponseActive is true, wait up to 30s. The
// stream's finally{} clears the flag and re-arms the short
// timer so the mic releases promptly after the stream ends.
const drainMs = this._streamingResponseActive ? 30000 : 800;
if (!this._drainTimer) {
this._drainTimer = setTimeout(() => {
this._drainTimer = null;
Expand Down Expand Up @@ -4956,7 +4987,7 @@ connectAiradio();
}, 600);
}
}
}, 800); // 800ms grace period for next TTS chunk to arrive
}, drainMs);
}
return;
}
Expand Down