Skip to content

Claude/voice conversion realtime skbru#1

Open
punitst wants to merge 2 commits intomainfrom
claude/voice-conversion-realtime-Skbru
Open

Claude/voice conversion realtime skbru#1
punitst wants to merge 2 commits intomainfrom
claude/voice-conversion-realtime-Skbru

Conversation

@punitst
Copy link
Copy Markdown

@punitst punitst commented Jan 29, 2026

Summary by CodeRabbit

  • New Features
    • Added real-time voice conversion capability with WebSocket streaming support
    • Introduced voice selection and customization via reference audio
    • Added voice management endpoints to list and configure available voices
    • Enabled live audio conversion in configurable frame sizes with status tracking

✏️ Tip: You can customize this high-level summary in your review settings.

Implements a complete voice conversion system based on Google's StreamVC
architecture for low-latency (~70-100ms) voice-to-voice conversion.

Components:
- CausalContentEncoder: Extracts speaker-independent linguistic features
- F0Extractor: Causal pitch extraction with whitening to prevent identity leakage
- SpeakerEncoder: d-vector style speaker embedding extraction
- StreamingDecoder: FiLM-conditioned decoder for target voice synthesis

The architecture uses causal convolutions throughout to enable true streaming
inference, compatible with the existing PersonaPlex WebSocket infrastructure.

https://claude.ai/code/session_01RMegy6932bUxBLMShQnkm5
Server (vc_server.py):
- WebSocket endpoint for real-time voice conversion at /api/vc
- HTTP endpoints for voice management (/api/vc/voices, /api/vc/set_voice)
- Reference mode for voice cloning from user audio
- Pre-computed voice embedding support
- Uses same protocol as main PersonaPlex server for compatibility

Client (useVoiceConversion.ts):
- React hook for voice conversion with status tracking
- Microphone capture with Opus encoding
- Reference audio collection for voice cloning
- Available voices fetching from server

https://claude.ai/code/session_01RMegy6932bUxBLMShQnkm5
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Introduces a complete real-time voice conversion system comprising a React client hook with WebSocket integration, a PyTorch streaming voice conversion model with content/F0/speaker encoding, and an aiohttp-based server supporting WebSocket streaming and HTTP voice management endpoints.

Changes

Cohort / File(s) Summary
Client Voice Conversion Hook
client/src/pages/Conversation/hooks/useVoiceConversion.ts
New React hook exposing WebSocket-based voice conversion with lifecycle management, microphone recording via opus-recorder, configurable callbacks, and methods to start/stop conversion, set target voice, handle reference audio, and fetch available voices.
Voice Conversion Model
moshi/moshi/models/voice_conversion.py
End-to-end streaming voice conversion model with ContentEncoder (causal downsampling), F0Extractor (pitch extraction with statistics), SpeakerEncoder (speaker embedding), FiLMLayer (speaker conditioning), StreamingDecoder (synthesis with upsampling), and StreamVCModel integrating all components with speaker management and streaming conversion methods.
Voice Conversion Server
moshi/moshi/vc_server.py
WebSocket/HTTP server for real-time voice conversion featuring VCServerState for model management, handshake and streaming protocol handling, reference audio collection for target voice derivation, Opus encoding for efficient frame transmission, HTTP endpoints for voice listing and setting, and device auto-selection with warmup initialization.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (React Hook)
    participant Mic as Microphone/Opus Recorder
    participant WebSocket as WebSocket Server
    participant Model as Voice Conversion Model
    participant SpeakerEnc as Speaker Encoder

    Client->>WebSocket: Connect WebSocket
    WebSocket-->>Client: Handshake OK
    
    Client->>Mic: Start recording
    Mic->>Client: Audio frames (20ms)
    
    alt Reference Mode
        Client->>WebSocket: Send reference audio chunk
        WebSocket->>SpeakerEnc: Compute mel from reference
        SpeakerEnc->>Model: Extract speaker embedding
        Model->>WebSocket: Reference processed
        WebSocket-->>Client: Status: reference_ready
    end
    
    Client->>Mic: Continue recording
    Mic->>Client: Audio frame
    Client->>WebSocket: Send audio data
    
    WebSocket->>Model: Process audio chunk
    Model->>Model: Content extraction
    Model->>Model: F0 extraction + speaker conditioning
    Model->>Model: Streaming decode (synthesis)
    Model-->>WebSocket: Converted audio
    
    WebSocket->>WebSocket: Encode with Opus
    WebSocket-->>Client: Send converted frame
    Client->>Client: onConvertedAudio callback
    
    Client->>Mic: Stop recording
    Client->>WebSocket: Close connection
    WebSocket-->>Client: Disconnected
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰✨ A hop through streaming voices bright,
With models trained to get it right,
WebSockets dance and embeddings sing,
Real-time conversion on the wing!
~🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Claude/voice conversion realtime skbru' is partially related to the changeset but is vague and uses branch naming conventions instead of a clear description of the main change. Revise the title to clearly describe the main feature, such as 'Add real-time voice conversion hook and streaming server' or 'Implement WebSocket-based voice conversion system'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62981a418f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +239 to +244
recorder.ondataavailable = (opusData: ArrayBuffer) => {
if (
socketRef.current &&
socketRef.current.readyState === WebSocket.OPEN &&
status === "converting"
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid stale status gating that blocks audio send

The recorder.ondataavailable callback closes over the status value captured when start() was invoked. After the server handshake updates state to "converting", this callback still sees the old value (e.g., "connecting"), so the status === "converting" guard stays false and no audio is ever sent. This means the voice conversion stream never receives microphone data unless start() is called again. Consider storing status in a ref or removing the status gate so the callback sees the latest state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts`:
- Around line 182-206: The switch cases in useVoiceConversion.ts leak
block-scoped variables (controlMsg, errorMsg) across cases; wrap each case body
in its own block (e.g., case 0x03: { ... break; } and case 0x05: { ... break; })
or otherwise ensure const declarations are scoped locally so controlMsg and
errorMsg cannot be accessed by other cases—keep existing calls to updateStatus,
onConvertedAudio, and onError unchanged and just add braces around the 0x03 and
0x05 case bodies.
- Around line 239-251: The recorder.ondataavailable handler is using the stale
React state variable status (captured at start) so the check status ===
"converting" never becomes true; create a mutable ref (e.g., statusRef) to
mirror status, update statusRef.current whenever status changes, and replace the
closure check with statusRef.current === "converting" (keeping existing
socketRef and recorder.ondataavailable logic) so the callback sees the latest
state and will send audio when converting.
- Around line 106-126: In useVoiceConversion's fetchVoices, the URL assembly is
fragile: parse serverUrl with the URL constructor (if present), map ws->http and
wss->https by checking protocol, then build the voices endpoint from
urlObj.origin + (optional basePath if your server uses a path prefix) or simply
urlObj.origin + '/api/vc/voices' so you don't rely on string replace of
"/api/vc"; fall back to a relative "/api/vc/voices" when serverUrl is falsy;
update variables baseUrl/url and keep setAvailableVoices(voices) and return
semantics unchanged.

In `@moshi/moshi/models/voice_conversion.py`:
- Around line 586-608: The _create_mel_basis method contains an unused import
"import numpy as np" which should be removed; edit the _create_mel_basis
function to delete the numpy import line so that only torch operations remain
(keep variables like n_fft, n_mels, fmin, fmax, mel_low/mel_high, mel_points,
hz_points, bin_points, and the mel_basis loops intact).
- Around line 206-220: Change the mutable default list for the strides parameter
to None in the __init__ of the VoiceConversion class and similarly in
StreamingDecoder.__init__ and StreamVCModel.__init__; inside each constructor,
set self.strides = strides if strides is not None else [2, 4, 5, 6] (or the
class-specific default), then compute self.hop_length = math.prod(self.strides)
(or update any existing hop_length assignment) so no function has a mutable list
as a default argument and behavior remains identical.
- Around line 424-430: The load_speaker_embedding function uses torch.load(path,
map_location=device) which can unpickle arbitrary objects; update the call to
pass weights_only=True (torch.load(path, map_location=device,
weights_only=True)) to restrict loading to tensor data only, keep the existing
dict handling for 'speaker_embedding'/'embedding', and ensure set_target_speaker
is called with the resulting tensor; also add a brief error message or exception
handling around torch.load to surface failures when untrusted or incompatible
files are provided.

In `@moshi/moshi/vc_server.py`:
- Around line 239-247: Wrap the model inference call to self.voice_converter
inside a try/except around the torch.no_grad() block so any exceptions (e.g.,
CUDA OOM or runtime errors) are caught; on exception, log the error (including
the exception text), notify the client about the failure (via whatever
client/error channel this loop uses), optionally torch.cuda.empty_cache() if
OOM, and skip or break processing for that chunk instead of letting the loop
crash; ensure you still handle cleanup such as not calling
opus_writer.append_pcm when converted is unavailable.
🧹 Nitpick comments (7)
client/src/pages/Conversation/hooks/useVoiceConversion.ts (1)

158-265: Consider waiting for WebSocket connection before starting the recorder.

The recorder starts immediately after WebSocket creation (Line 253), but the WebSocket may not be open yet. While the readyState check prevents sending to a closed socket, any audio captured before connection completes is lost. Additionally, if the WebSocket fails to connect, the recorder continues running unnecessarily.

♻️ Suggested improvement

Wrap the WebSocket setup in a promise and start the recorder only after connection:

await new Promise<void>((resolve, reject) => {
  ws.onopen = () => {
    console.log("Voice conversion WebSocket connected");
    resolve();
  };
  ws.onerror = (error) => {
    console.error("Voice conversion WebSocket error:", error);
    onError?.("WebSocket connection error");
    updateStatus("error");
    reject(error);
  };
});

// Now start the recorder after WebSocket is connected
const recorder = new Recorder({ ... });
moshi/moshi/models/voice_conversion.py (2)

508-512: Consider using zip(..., strict=True) to catch length mismatches.

While decoder_blocks and film_layers should always have the same length, using strict=True would catch any future bugs that cause a mismatch.

♻️ Proposed fix
-        for block, film in zip(self.decoder_blocks, self.film_layers):
+        for block, film in zip(self.decoder_blocks, self.film_layers, strict=True):
             x = block(x)
             x = film(x, speaker_emb)

360-371: Unused loop variable i - rename to _.

♻️ Proposed fix
-        for i in range(num_layers):
+        for _ in range(num_layers):
             out_ch = hidden_channels
moshi/moshi/vc_server.py (4)

95-102: Use logger.exception to include stack trace for debugging.

When catching exceptions, logger.exception automatically includes the stack trace, which is helpful for debugging failed embedding loads.

♻️ Proposed fix
             if os.path.exists(path):
                 try:
                     self.voice_converter.load_target_speaker(path)
                     logger.info(f"loaded voice embedding: {path}")
                     return True
                 except Exception as e:
-                    logger.error(f"failed to load voice embedding {path}: {e}")
+                    logger.exception(f"failed to load voice embedding {path}")

         return False

264-296: Lock held for entire WebSocket session limits concurrency to one client.

The async with self.lock at Line 264 wraps the entire WebSocket session, meaning only one client can perform voice conversion at a time. Other clients will block until the session ends.

If this is intentional (e.g., GPU memory constraints), consider adding a comment. Otherwise, narrow the lock scope to just the model inference:

♻️ Suggested improvement

Move the lock inside process_loop to only protect model inference:

async def process_loop():
    # ... existing setup ...
    while all_pcm_data.shape[-1] >= process_size:
        chunk = all_pcm_data[:process_size]
        all_pcm_data = all_pcm_data[process_size:]
        
        chunk_tensor = torch.from_numpy(chunk).float().to(self.device)
        chunk_tensor = chunk_tensor.unsqueeze(0).unsqueeze(0)
        
        async with self.lock:  # Only lock during inference
            with torch.no_grad():
                converted = self.voice_converter(chunk_tensor)
        
        converted_pcm = converted[0, 0].cpu().numpy()
        opus_writer.append_pcm(converted_pcm)

Note: This requires making process_loop aware of the lock, so it may need restructuring.


283-291: Unused variable done - use underscore prefix.

♻️ Proposed fix
-            done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
+            _done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)

432-434: Unconventional placement of torch.no_grad() context.

Wrapping the entire main() in torch.no_grad() works but is unconventional. Typically, no_grad() is applied at specific inference points. Since the server handles requests indefinitely, this global scope is fine, but a comment explaining the intent would improve clarity.

Comment on lines +106 to +126
const fetchVoices = useCallback(async (): Promise<string[]> => {
try {
const baseUrl = serverUrl?.replace(/^wss?:/, "http") || "";
const url = baseUrl
? `${baseUrl.replace("/api/vc", "")}/api/vc/voices`
: "/api/vc/voices";

const response = await fetch(url);
if (!response.ok) {
throw new Error(`Failed to fetch voices: ${response.statusText}`);
}

const data = await response.json();
const voices = data.voices || [];
setAvailableVoices(voices);
return voices;
} catch (error) {
console.error("Failed to fetch voices:", error);
return [];
}
}, [serverUrl]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fragile URL construction may fail for HTTPS or custom paths.

The URL manipulation at Lines 108-111 has edge cases:

  1. The regex ^wss?: won't match https:, so HTTPS serverUrls won't be converted properly
  2. If serverUrl doesn't contain /api/vc, the path construction fails
🔧 Proposed fix
 const fetchVoices = useCallback(async (): Promise<string[]> => {
   try {
-    const baseUrl = serverUrl?.replace(/^wss?:/, "http") || "";
-    const url = baseUrl
-      ? `${baseUrl.replace("/api/vc", "")}/api/vc/voices`
-      : "/api/vc/voices";
+    let url = "/api/vc/voices";
+    if (serverUrl) {
+      const wsUrl = new URL(serverUrl, window.location.href);
+      wsUrl.protocol = wsUrl.protocol.replace(/^ws/, "http");
+      wsUrl.pathname = "/api/vc/voices";
+      url = wsUrl.toString();
+    }

     const response = await fetch(url);
🤖 Prompt for AI Agents
In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 106 -
126, In useVoiceConversion's fetchVoices, the URL assembly is fragile: parse
serverUrl with the URL constructor (if present), map ws->http and wss->https by
checking protocol, then build the voices endpoint from urlObj.origin + (optional
basePath if your server uses a path prefix) or simply urlObj.origin +
'/api/vc/voices' so you don't rely on string replace of "/api/vc"; fall back to
a relative "/api/vc/voices" when serverUrl is falsy; update variables
baseUrl/url and keep setAvailableVoices(voices) and return semantics unchanged.

Comment on lines +182 to +206
switch (messageType) {
case 0x00: // Handshake
console.log("Voice conversion handshake received");
updateStatus(referenceMode ? "awaiting_reference" : "converting");
break;

case 0x01: // Audio
onConvertedAudio?.(payload.buffer);
break;

case 0x03: // Control
const controlMsg = new TextDecoder().decode(payload);
if (controlMsg === "awaiting_reference") {
updateStatus("awaiting_reference");
} else if (controlMsg === "reference_ready") {
updateStatus("converting");
}
break;

case 0x05: // Error
const errorMsg = new TextDecoder().decode(payload);
onError?.(errorMsg);
updateStatus("error");
break;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrap switch case declarations in blocks to prevent scope leakage.

The controlMsg (Line 193) and errorMsg (Line 202) declarations inside switch cases without blocks can be accessed by other cases, potentially causing temporal dead zone errors.

🔧 Proposed fix
           case 0x03: // Control
-            const controlMsg = new TextDecoder().decode(payload);
-            if (controlMsg === "awaiting_reference") {
-              updateStatus("awaiting_reference");
-            } else if (controlMsg === "reference_ready") {
-              updateStatus("converting");
-            }
+            {
+              const controlMsg = new TextDecoder().decode(payload);
+              if (controlMsg === "awaiting_reference") {
+                updateStatus("awaiting_reference");
+              } else if (controlMsg === "reference_ready") {
+                updateStatus("converting");
+              }
+            }
             break;

           case 0x05: // Error
-            const errorMsg = new TextDecoder().decode(payload);
-            onError?.(errorMsg);
-            updateStatus("error");
+            {
+              const errorMsg = new TextDecoder().decode(payload);
+              onError?.(errorMsg);
+              updateStatus("error");
+            }
             break;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
switch (messageType) {
case 0x00: // Handshake
console.log("Voice conversion handshake received");
updateStatus(referenceMode ? "awaiting_reference" : "converting");
break;
case 0x01: // Audio
onConvertedAudio?.(payload.buffer);
break;
case 0x03: // Control
const controlMsg = new TextDecoder().decode(payload);
if (controlMsg === "awaiting_reference") {
updateStatus("awaiting_reference");
} else if (controlMsg === "reference_ready") {
updateStatus("converting");
}
break;
case 0x05: // Error
const errorMsg = new TextDecoder().decode(payload);
onError?.(errorMsg);
updateStatus("error");
break;
}
switch (messageType) {
case 0x00: // Handshake
console.log("Voice conversion handshake received");
updateStatus(referenceMode ? "awaiting_reference" : "converting");
break;
case 0x01: // Audio
onConvertedAudio?.(payload.buffer);
break;
case 0x03: // Control
{
const controlMsg = new TextDecoder().decode(payload);
if (controlMsg === "awaiting_reference") {
updateStatus("awaiting_reference");
} else if (controlMsg === "reference_ready") {
updateStatus("converting");
}
}
break;
case 0x05: // Error
{
const errorMsg = new TextDecoder().decode(payload);
onError?.(errorMsg);
updateStatus("error");
}
break;
}
🧰 Tools
🪛 Biome (2.3.13)

[error] 193-193: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)


[error] 202-202: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

🤖 Prompt for AI Agents
In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 182 -
206, The switch cases in useVoiceConversion.ts leak block-scoped variables
(controlMsg, errorMsg) across cases; wrap each case body in its own block (e.g.,
case 0x03: { ... break; } and case 0x05: { ... break; }) or otherwise ensure
const declarations are scoped locally so controlMsg and errorMsg cannot be
accessed by other cases—keep existing calls to updateStatus, onConvertedAudio,
and onError unchanged and just add braces around the 0x03 and 0x05 case bodies.

Comment on lines +239 to +251
recorder.ondataavailable = (opusData: ArrayBuffer) => {
if (
socketRef.current &&
socketRef.current.readyState === WebSocket.OPEN &&
status === "converting"
) {
// Send audio with message type byte
const message = new Uint8Array(opusData.byteLength + 1);
message[0] = 0x01;
message.set(new Uint8Array(opusData), 1);
socketRef.current.send(message.buffer);
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Stale closure bug: status check will always fail.

The ondataavailable callback captures status from the closure when start() is called. At that point, status is "connecting", not "converting". Since the callback isn't updated when React state changes, the condition status === "converting" will never be true, and audio will never be sent.

Use a ref to track the current status instead:

🐛 Proposed fix

Add a ref to track current status:

 const socketRef = useRef<WebSocket | null>(null);
 const recorderRef = useRef<Recorder | null>(null);
 const targetVoiceRef = useRef<string | null>(null);
+const statusRef = useRef<VCStatus>("idle");

 // Update status and notify
 const updateStatus = useCallback(
   (newStatus: VCStatus) => {
     setStatus(newStatus);
+    statusRef.current = newStatus;
     onStatusChange?.(newStatus);
   },
   [onStatusChange]
 );

Then in the callback:

       recorder.ondataavailable = (opusData: ArrayBuffer) => {
         if (
           socketRef.current &&
           socketRef.current.readyState === WebSocket.OPEN &&
-          status === "converting"
+          statusRef.current === "converting"
         ) {
🤖 Prompt for AI Agents
In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 239 -
251, The recorder.ondataavailable handler is using the stale React state
variable status (captured at start) so the check status === "converting" never
becomes true; create a mutable ref (e.g., statusRef) to mirror status, update
statusRef.current whenever status changes, and replace the closure check with
statusRef.current === "converting" (keeping existing socketRef and
recorder.ondataavailable logic) so the callback sees the latest state and will
send audio when converting.

Comment on lines +206 to +220
def __init__(
self,
in_channels: int = 1,
channels: int = 64,
latent_dim: int = 64,
num_residual: int = 3,
strides: List[int] = [2, 4, 5, 6], # Total: 240x
):
super().__init__()

self.in_channels = in_channels
self.channels = channels
self.latent_dim = latent_dim
self.strides = strides
self.hop_length = math.prod(strides) # 240
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid mutable default argument for strides parameter.

Using a mutable list as a default argument can lead to unexpected behavior if modified. Use None and initialize inside the function.

🔧 Proposed fix
 def __init__(
     self,
     in_channels: int = 1,
     channels: int = 64,
     latent_dim: int = 64,
     num_residual: int = 3,
-    strides: List[int] = [2, 4, 5, 6],  # Total: 240x
+    strides: Optional[List[int]] = None,  # Total: 240x
 ):
     super().__init__()
+    if strides is None:
+        strides = [2, 4, 5, 6]

     self.in_channels = in_channels

Also apply the same pattern to StreamingDecoder.__init__ (Line 451) and StreamVCModel.__init__ (Line 542).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(
self,
in_channels: int = 1,
channels: int = 64,
latent_dim: int = 64,
num_residual: int = 3,
strides: List[int] = [2, 4, 5, 6], # Total: 240x
):
super().__init__()
self.in_channels = in_channels
self.channels = channels
self.latent_dim = latent_dim
self.strides = strides
self.hop_length = math.prod(strides) # 240
def __init__(
self,
in_channels: int = 1,
channels: int = 64,
latent_dim: int = 64,
num_residual: int = 3,
strides: Optional[List[int]] = None, # Total: 240x
):
super().__init__()
if strides is None:
strides = [2, 4, 5, 6]
self.in_channels = in_channels
self.channels = channels
self.latent_dim = latent_dim
self.strides = strides
self.hop_length = math.prod(strides) # 240
🧰 Tools
🪛 Ruff (0.14.14)

212-212: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents
In `@moshi/moshi/models/voice_conversion.py` around lines 206 - 220, Change the
mutable default list for the strides parameter to None in the __init__ of the
VoiceConversion class and similarly in StreamingDecoder.__init__ and
StreamVCModel.__init__; inside each constructor, set self.strides = strides if
strides is not None else [2, 4, 5, 6] (or the class-specific default), then
compute self.hop_length = math.prod(self.strides) (or update any existing
hop_length assignment) so no function has a mutable list as a default argument
and behavior remains identical.

Comment on lines +424 to +430
def load_speaker_embedding(self, path: str, device: torch.device):
"""Load a pre-computed speaker embedding from file."""
embedding = torch.load(path, map_location=device)
if isinstance(embedding, dict):
embedding = embedding.get('speaker_embedding', embedding.get('embedding'))
self.set_target_speaker(embedding)
return embedding
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n moshi/moshi/models/voice_conversion.py | sed -n '424,430p'

Repository: shivankittech/personaplex

Length of output: 504


🌐 Web query:

torch.load weights_only parameter PyTorch 2.x security pickle

💡 Result:

torch.load(..., weights_only=...) is PyTorch’s “restricted unpickling” mode intended to reduce the risk of arbitrary code execution from malicious checkpoint files (since normal PyTorch checkpoints are Python pickle under the hood). In weights_only=True, PyTorch uses a restricted unpickler that only allows loading plain tensor state_dict-style data (tensors + a limited set of primitive/container types) and disallows dynamic imports during unpickling. If a checkpoint needs extra classes/functions (e.g., tensor subclasses), you must explicitly allowlist them via torch.serialization.add_safe_globals(...) / safe_globals(...), and PyTorch can also help you enumerate unsafe globals in a file. [1][2]

Default behavior in PyTorch 2.x

  • Starting in PyTorch 2.6, if you do not pass pickle_module, torch.load will default to using weights_only=True (a flip from the historical default). [1]
  • Current docs still warn: never load from an untrusted source, because unpickling is inherently dangerous. [2]

Important security caveat (pickle + “safe” mode is not a sandbox)

  • CVE-2025-32434: In PyTorch versions < 2.6.0, torch.load(..., weights_only=True) could still lead to remote code execution; fixed in 2.6.0. [3]
  • CVE-2026-24747 / GHSA-63cw-57p8-fm3p: In PyTorch versions <= 2.9.1, loading a malicious checkpoint with weights_only=True could trigger memory corruption and potentially RCE; fixed in 2.10.0. [4]

Practical guidance

  • For maximum safety, only load checkpoints you trust, and upgrade PyTorch (at least 2.10.0 to cover both issues above). [4]
  • Prefer saving/loading state_dicts and keep weights_only=True unless you truly need to load a full pickled object (e.g., an old checkpoint that saved an nn.Module), in which case weights_only=False re-enables full pickle risk. [1][2]

Sources: PyTorch serialization notes & API docs [1][2], GitHub security advisories for CVE-2025-32434 and CVE-2026-24747 [3][4].

[1] https://docs.pytorch.org/docs/stable/notes/serialization.html
[2] https://docs.pytorch.org/docs/stable/generated/torch.load.html
[3] GHSA-53q9-r3pm-6pq6
[4] GHSA-63cw-57p8-fm3p


Security risk: torch.load without weights_only=True can execute arbitrary code from untrusted checkpoint files.

The pickle-based loading at line 426 should specify weights_only=True to restrict unpickling to tensor data only. Voice embeddings may come from external or user-provided sources. While PyTorch 2.6+ defaults to this behavior, older versions allow full pickle deserialization, which is inherently unsafe.

Suggested fix
 def load_speaker_embedding(self, path: str, device: torch.device):
     """Load a pre-computed speaker embedding from file."""
-    embedding = torch.load(path, map_location=device)
+    embedding = torch.load(path, map_location=device, weights_only=True)
     if isinstance(embedding, dict):
         embedding = embedding.get('speaker_embedding', embedding.get('embedding'))
     self.set_target_speaker(embedding)
     return embedding

For maximum safety, also ensure only trusted checkpoints are loaded and PyTorch is kept up-to-date (2.10.0+).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def load_speaker_embedding(self, path: str, device: torch.device):
"""Load a pre-computed speaker embedding from file."""
embedding = torch.load(path, map_location=device)
if isinstance(embedding, dict):
embedding = embedding.get('speaker_embedding', embedding.get('embedding'))
self.set_target_speaker(embedding)
return embedding
def load_speaker_embedding(self, path: str, device: torch.device):
"""Load a pre-computed speaker embedding from file."""
embedding = torch.load(path, map_location=device, weights_only=True)
if isinstance(embedding, dict):
embedding = embedding.get('speaker_embedding', embedding.get('embedding'))
self.set_target_speaker(embedding)
return embedding
🤖 Prompt for AI Agents
In `@moshi/moshi/models/voice_conversion.py` around lines 424 - 430, The
load_speaker_embedding function uses torch.load(path, map_location=device) which
can unpickle arbitrary objects; update the call to pass weights_only=True
(torch.load(path, map_location=device, weights_only=True)) to restrict loading
to tensor data only, keep the existing dict handling for
'speaker_embedding'/'embedding', and ensure set_target_speaker is called with
the resulting tensor; also add a brief error message or exception handling
around torch.load to surface failures when untrusted or incompatible files are
provided.

Comment on lines +586 to +608
def _create_mel_basis(self) -> torch.Tensor:
"""Create mel filterbank."""
import numpy as np
n_fft = 1024
n_mels = 80
fmin = 0
fmax = self.sample_rate // 2

# Create mel filterbank using librosa-style computation
mel_low = self._hz_to_mel(fmin)
mel_high = self._hz_to_mel(fmax)
mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)
hz_points = self._mel_to_hz(mel_points)
bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()

mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)
for i in range(n_mels):
for j in range(int(bin_points[i]), int(bin_points[i + 1])):
mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])
for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):
mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])

return mel_basis
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused numpy import inside method.

Line 588 imports numpy as np, but np is never used in the function - all operations use torch. This is dead code.

🔧 Proposed fix
 def _create_mel_basis(self) -> torch.Tensor:
     """Create mel filterbank."""
-    import numpy as np
     n_fft = 1024
     n_mels = 80
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _create_mel_basis(self) -> torch.Tensor:
"""Create mel filterbank."""
import numpy as np
n_fft = 1024
n_mels = 80
fmin = 0
fmax = self.sample_rate // 2
# Create mel filterbank using librosa-style computation
mel_low = self._hz_to_mel(fmin)
mel_high = self._hz_to_mel(fmax)
mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)
hz_points = self._mel_to_hz(mel_points)
bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()
mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)
for i in range(n_mels):
for j in range(int(bin_points[i]), int(bin_points[i + 1])):
mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])
for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):
mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])
return mel_basis
def _create_mel_basis(self) -> torch.Tensor:
"""Create mel filterbank."""
n_fft = 1024
n_mels = 80
fmin = 0
fmax = self.sample_rate // 2
# Create mel filterbank using librosa-style computation
mel_low = self._hz_to_mel(fmin)
mel_high = self._hz_to_mel(fmax)
mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)
hz_points = self._mel_to_hz(mel_points)
bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()
mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)
for i in range(n_mels):
for j in range(int(bin_points[i]), int(bin_points[i + 1])):
mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])
for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):
mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])
return mel_basis
🤖 Prompt for AI Agents
In `@moshi/moshi/models/voice_conversion.py` around lines 586 - 608, The
_create_mel_basis method contains an unused import "import numpy as np" which
should be removed; edit the _create_mel_basis function to delete the numpy
import line so that only torch operations remain (keep variables like n_fft,
n_mels, fmin, fmax, mel_low/mel_high, mel_points, hz_points, bin_points, and the
mel_basis loops intact).

Comment thread moshi/moshi/vc_server.py
Comment on lines +239 to +247
# Voice conversion
with torch.no_grad():
converted = self.voice_converter(chunk_tensor)

# Convert back to numpy
converted_pcm = converted[0, 0].cpu().numpy()

# Send to opus writer
opus_writer.append_pcm(converted_pcm)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for model inference to gracefully report failures.

If self.voice_converter(chunk_tensor) raises an exception (e.g., CUDA OOM, numerical errors), the process loop crashes without notifying the client. Consider wrapping the inference in try/except and sending an error message.

🔧 Proposed fix
                     # Voice conversion
-                    with torch.no_grad():
-                        converted = self.voice_converter(chunk_tensor)
-
-                    # Convert back to numpy
-                    converted_pcm = converted[0, 0].cpu().numpy()
-
-                    # Send to opus writer
-                    opus_writer.append_pcm(converted_pcm)
+                    try:
+                        with torch.no_grad():
+                            converted = self.voice_converter(chunk_tensor)
+                        converted_pcm = converted[0, 0].cpu().numpy()
+                        opus_writer.append_pcm(converted_pcm)
+                    except Exception as e:
+                        clog.log("error", f"voice conversion failed: {e}")
+                        await ws.send_bytes(b"\x05" + str(e).encode("utf-8"))
+                        return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Voice conversion
with torch.no_grad():
converted = self.voice_converter(chunk_tensor)
# Convert back to numpy
converted_pcm = converted[0, 0].cpu().numpy()
# Send to opus writer
opus_writer.append_pcm(converted_pcm)
# Voice conversion
try:
with torch.no_grad():
converted = self.voice_converter(chunk_tensor)
converted_pcm = converted[0, 0].cpu().numpy()
opus_writer.append_pcm(converted_pcm)
except Exception as e:
clog.log("error", f"voice conversion failed: {e}")
await ws.send_bytes(b"\x05" + str(e).encode("utf-8"))
return
🤖 Prompt for AI Agents
In `@moshi/moshi/vc_server.py` around lines 239 - 247, Wrap the model inference
call to self.voice_converter inside a try/except around the torch.no_grad()
block so any exceptions (e.g., CUDA OOM or runtime errors) are caught; on
exception, log the error (including the exception text), notify the client about
the failure (via whatever client/error channel this loop uses), optionally
torch.cuda.empty_cache() if OOM, and skip or break processing for that chunk
instead of letting the loop crash; ensure you still handle cleanup such as not
calling opus_writer.append_pcm when converted is unavailable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants