Claude/voice conversion realtime skbru by punitst · Pull Request #1 · shivankittech/personaplex

punitst · 2026-01-29T19:37:47Z

Summary by CodeRabbit

New Features
- Added real-time voice conversion capability with WebSocket streaming support
- Introduced voice selection and customization via reference audio
- Added voice management endpoints to list and configure available voices
- Enabled live audio conversion in configurable frame sizes with status tracking

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Implements a complete voice conversion system based on Google's StreamVC architecture for low-latency (~70-100ms) voice-to-voice conversion. Components: - CausalContentEncoder: Extracts speaker-independent linguistic features - F0Extractor: Causal pitch extraction with whitening to prevent identity leakage - SpeakerEncoder: d-vector style speaker embedding extraction - StreamingDecoder: FiLM-conditioned decoder for target voice synthesis The architecture uses causal convolutions throughout to enable true streaming inference, compatible with the existing PersonaPlex WebSocket infrastructure. https://claude.ai/code/session_01RMegy6932bUxBLMShQnkm5

Server (vc_server.py): - WebSocket endpoint for real-time voice conversion at /api/vc - HTTP endpoints for voice management (/api/vc/voices, /api/vc/set_voice) - Reference mode for voice cloning from user audio - Pre-computed voice embedding support - Uses same protocol as main PersonaPlex server for compatibility Client (useVoiceConversion.ts): - React hook for voice conversion with status tracking - Microphone capture with Opus encoding - Reference audio collection for voice cloning - Available voices fetching from server https://claude.ai/code/session_01RMegy6932bUxBLMShQnkm5

coderabbitai · 2026-01-29T19:38:11Z

📝 Walkthrough

Walkthrough

Introduces a complete real-time voice conversion system comprising a React client hook with WebSocket integration, a PyTorch streaming voice conversion model with content/F0/speaker encoding, and an aiohttp-based server supporting WebSocket streaming and HTTP voice management endpoints.

Changes

Cohort / File(s)	Summary
Client Voice Conversion Hook `client/src/pages/Conversation/hooks/useVoiceConversion.ts`	New React hook exposing WebSocket-based voice conversion with lifecycle management, microphone recording via opus-recorder, configurable callbacks, and methods to start/stop conversion, set target voice, handle reference audio, and fetch available voices.
Voice Conversion Model `moshi/moshi/models/voice_conversion.py`	End-to-end streaming voice conversion model with ContentEncoder (causal downsampling), F0Extractor (pitch extraction with statistics), SpeakerEncoder (speaker embedding), FiLMLayer (speaker conditioning), StreamingDecoder (synthesis with upsampling), and StreamVCModel integrating all components with speaker management and streaming conversion methods.
Voice Conversion Server `moshi/moshi/vc_server.py`	WebSocket/HTTP server for real-time voice conversion featuring VCServerState for model management, handshake and streaming protocol handling, reference audio collection for target voice derivation, Opus encoding for efficient frame transmission, HTTP endpoints for voice listing and setting, and device auto-selection with warmup initialization.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (React Hook)
    participant Mic as Microphone/Opus Recorder
    participant WebSocket as WebSocket Server
    participant Model as Voice Conversion Model
    participant SpeakerEnc as Speaker Encoder

    Client->>WebSocket: Connect WebSocket
    WebSocket-->>Client: Handshake OK
    
    Client->>Mic: Start recording
    Mic->>Client: Audio frames (20ms)
    
    alt Reference Mode
        Client->>WebSocket: Send reference audio chunk
        WebSocket->>SpeakerEnc: Compute mel from reference
        SpeakerEnc->>Model: Extract speaker embedding
        Model->>WebSocket: Reference processed
        WebSocket-->>Client: Status: reference_ready
    end
    
    Client->>Mic: Continue recording
    Mic->>Client: Audio frame
    Client->>WebSocket: Send audio data
    
    WebSocket->>Model: Process audio chunk
    Model->>Model: Content extraction
    Model->>Model: F0 extraction + speaker conditioning
    Model->>Model: Streaming decode (synthesis)
    Model-->>WebSocket: Converted audio
    
    WebSocket->>WebSocket: Encode with Opus
    WebSocket-->>Client: Send converted frame
    Client->>Client: onConvertedAudio callback
    
    Client->>Mic: Stop recording
    Client->>WebSocket: Close connection
    WebSocket-->>Client: Disconnected

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰✨ A hop through streaming voices bright,
With models trained to get it right,
WebSockets dance and embeddings sing,
Real-time conversion on the wing!
~🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Claude/voice conversion realtime skbru' is partially related to the changeset but is vague and uses branch naming conventions instead of a clear description of the main change.	Revise the title to clearly describe the main feature, such as 'Add real-time voice conversion hook and streaming server' or 'Implement WebSocket-based voice conversion system'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62981a418f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-29T19:41:39Z

+        recorder.ondataavailable = (opusData: ArrayBuffer) => {
+          if (
+            socketRef.current &&
+            socketRef.current.readyState === WebSocket.OPEN &&
+            status === "converting"
+          ) {


Avoid stale status gating that blocks audio send

The recorder.ondataavailable callback closes over the status value captured when start() was invoked. After the server handshake updates state to "converting", this callback still sees the old value (e.g., "connecting"), so the status === "converting" guard stays false and no audio is ever sent. This means the voice conversion stream never receives microphone data unless start() is called again. Consider storing status in a ref or removing the status gate so the callback sees the latest state.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts`:
- Around line 182-206: The switch cases in useVoiceConversion.ts leak
block-scoped variables (controlMsg, errorMsg) across cases; wrap each case body
in its own block (e.g., case 0x03: { ... break; } and case 0x05: { ... break; })
or otherwise ensure const declarations are scoped locally so controlMsg and
errorMsg cannot be accessed by other cases—keep existing calls to updateStatus,
onConvertedAudio, and onError unchanged and just add braces around the 0x03 and
0x05 case bodies.
- Around line 239-251: The recorder.ondataavailable handler is using the stale
React state variable status (captured at start) so the check status ===
"converting" never becomes true; create a mutable ref (e.g., statusRef) to
mirror status, update statusRef.current whenever status changes, and replace the
closure check with statusRef.current === "converting" (keeping existing
socketRef and recorder.ondataavailable logic) so the callback sees the latest
state and will send audio when converting.
- Around line 106-126: In useVoiceConversion's fetchVoices, the URL assembly is
fragile: parse serverUrl with the URL constructor (if present), map ws->http and
wss->https by checking protocol, then build the voices endpoint from
urlObj.origin + (optional basePath if your server uses a path prefix) or simply
urlObj.origin + '/api/vc/voices' so you don't rely on string replace of
"/api/vc"; fall back to a relative "/api/vc/voices" when serverUrl is falsy;
update variables baseUrl/url and keep setAvailableVoices(voices) and return
semantics unchanged.

In `@moshi/moshi/models/voice_conversion.py`:
- Around line 586-608: The _create_mel_basis method contains an unused import
"import numpy as np" which should be removed; edit the _create_mel_basis
function to delete the numpy import line so that only torch operations remain
(keep variables like n_fft, n_mels, fmin, fmax, mel_low/mel_high, mel_points,
hz_points, bin_points, and the mel_basis loops intact).
- Around line 206-220: Change the mutable default list for the strides parameter
to None in the __init__ of the VoiceConversion class and similarly in
StreamingDecoder.__init__ and StreamVCModel.__init__; inside each constructor,
set self.strides = strides if strides is not None else [2, 4, 5, 6] (or the
class-specific default), then compute self.hop_length = math.prod(self.strides)
(or update any existing hop_length assignment) so no function has a mutable list
as a default argument and behavior remains identical.
- Around line 424-430: The load_speaker_embedding function uses torch.load(path,
map_location=device) which can unpickle arbitrary objects; update the call to
pass weights_only=True (torch.load(path, map_location=device,
weights_only=True)) to restrict loading to tensor data only, keep the existing
dict handling for 'speaker_embedding'/'embedding', and ensure set_target_speaker
is called with the resulting tensor; also add a brief error message or exception
handling around torch.load to surface failures when untrusted or incompatible
files are provided.

In `@moshi/moshi/vc_server.py`:
- Around line 239-247: Wrap the model inference call to self.voice_converter
inside a try/except around the torch.no_grad() block so any exceptions (e.g.,
CUDA OOM or runtime errors) are caught; on exception, log the error (including
the exception text), notify the client about the failure (via whatever
client/error channel this loop uses), optionally torch.cuda.empty_cache() if
OOM, and skip or break processing for that chunk instead of letting the loop
crash; ensure you still handle cleanup such as not calling
opus_writer.append_pcm when converted is unavailable.

🧹 Nitpick comments (7)

client/src/pages/Conversation/hooks/useVoiceConversion.ts (1)
158-265: Consider waiting for WebSocket connection before starting the recorder.

The recorder starts immediately after WebSocket creation (Line 253), but the WebSocket may not be open yet. While the readyState check prevents sending to a closed socket, any audio captured before connection completes is lost. Additionally, if the WebSocket fails to connect, the recorder continues running unnecessarily.
♻️ Suggested improvement

Wrap the WebSocket setup in a promise and start the recorder only after connection:
await new Promise<void>((resolve, reject) => {
  ws.onopen = () => {
    console.log("Voice conversion WebSocket connected");
    resolve();
  };
  ws.onerror = (error) => {
    console.error("Voice conversion WebSocket error:", error);
    onError?.("WebSocket connection error");
    updateStatus("error");
    reject(error);
  };
});

// Now start the recorder after WebSocket is connected
const recorder = new Recorder({ ... });
moshi/moshi/models/voice_conversion.py (2)
508-512: Consider using zip(..., strict=True) to catch length mismatches.

While decoder_blocks and film_layers should always have the same length, using strict=True would catch any future bugs that cause a mismatch.
♻️ Proposed fix
-        for block, film in zip(self.decoder_blocks, self.film_layers):
+        for block, film in zip(self.decoder_blocks, self.film_layers, strict=True):
             x = block(x)
             x = film(x, speaker_emb)
360-371: Unused loop variable i - rename to _.
♻️ Proposed fix
-        for i in range(num_layers):
+        for _ in range(num_layers):
             out_ch = hidden_channels
moshi/moshi/vc_server.py (4)
95-102: Use logger.exception to include stack trace for debugging.

When catching exceptions, logger.exception automatically includes the stack trace, which is helpful for debugging failed embedding loads.
♻️ Proposed fix
             if os.path.exists(path):
                 try:
                     self.voice_converter.load_target_speaker(path)
                     logger.info(f"loaded voice embedding: {path}")
                     return True
                 except Exception as e:
-                    logger.error(f"failed to load voice embedding {path}: {e}")
+                    logger.exception(f"failed to load voice embedding {path}")

         return False
264-296: Lock held for entire WebSocket session limits concurrency to one client.

The async with self.lock at Line 264 wraps the entire WebSocket session, meaning only one client can perform voice conversion at a time. Other clients will block until the session ends.

If this is intentional (e.g., GPU memory constraints), consider adding a comment. Otherwise, narrow the lock scope to just the model inference:
♻️ Suggested improvement

Move the lock inside process_loop to only protect model inference:
async def process_loop():
    # ... existing setup ...
    while all_pcm_data.shape[-1] >= process_size:
        chunk = all_pcm_data[:process_size]
        all_pcm_data = all_pcm_data[process_size:]
        
        chunk_tensor = torch.from_numpy(chunk).float().to(self.device)
        chunk_tensor = chunk_tensor.unsqueeze(0).unsqueeze(0)
        
        async with self.lock:  # Only lock during inference
            with torch.no_grad():
                converted = self.voice_converter(chunk_tensor)
        
        converted_pcm = converted[0, 0].cpu().numpy()
        opus_writer.append_pcm(converted_pcm)
Note: This requires making process_loop aware of the lock, so it may need restructuring.
283-291: Unused variable done - use underscore prefix.
♻️ Proposed fix
-            done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
+            _done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
432-434: Unconventional placement of torch.no_grad() context.

Wrapping the entire main() in torch.no_grad() works but is unconventional. Typically, no_grad() is applied at specific inference points. Since the server handles requests indefinitely, this global scope is fine, but a comment explaining the intent would improve clarity.

coderabbitai · 2026-01-29T19:42:55Z

+  const fetchVoices = useCallback(async (): Promise<string[]> => {
+    try {
+      const baseUrl = serverUrl?.replace(/^wss?:/, "http") || "";
+      const url = baseUrl
+        ? `${baseUrl.replace("/api/vc", "")}/api/vc/voices`
+        : "/api/vc/voices";
+
+      const response = await fetch(url);
+      if (!response.ok) {
+        throw new Error(`Failed to fetch voices: ${response.statusText}`);
+      }
+
+      const data = await response.json();
+      const voices = data.voices || [];
+      setAvailableVoices(voices);
+      return voices;
+    } catch (error) {
+      console.error("Failed to fetch voices:", error);
+      return [];
+    }
+  }, [serverUrl]);


⚠️ Potential issue | 🟡 Minor

Fragile URL construction may fail for HTTPS or custom paths.

The URL manipulation at Lines 108-111 has edge cases:

The regex ^wss?: won't match https:, so HTTPS serverUrls won't be converted properly

If serverUrl doesn't contain /api/vc, the path construction fails

🔧 Proposed fix

const fetchVoices = useCallback(async (): Promise<string[]> => { try { - const baseUrl = serverUrl?.replace(/^wss?:/, "http") || ""; - const url = baseUrl - ? `${baseUrl.replace("/api/vc", "")}/api/vc/voices` - : "/api/vc/voices"; + let url = "/api/vc/voices"; + if (serverUrl) { + const wsUrl = new URL(serverUrl, window.location.href); + wsUrl.protocol = wsUrl.protocol.replace(/^ws/, "http"); + wsUrl.pathname = "/api/vc/voices"; + url = wsUrl.toString(); + } const response = await fetch(url);

🤖 Prompt for AI Agents

In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 106 - 126, In useVoiceConversion's fetchVoices, the URL assembly is fragile: parse serverUrl with the URL constructor (if present), map ws->http and wss->https by checking protocol, then build the voices endpoint from urlObj.origin + (optional basePath if your server uses a path prefix) or simply urlObj.origin + '/api/vc/voices' so you don't rely on string replace of "/api/vc"; fall back to a relative "/api/vc/voices" when serverUrl is falsy; update variables baseUrl/url and keep setAvailableVoices(voices) and return semantics unchanged.

coderabbitai · 2026-01-29T19:42:55Z

+          switch (messageType) {
+            case 0x00: // Handshake
+              console.log("Voice conversion handshake received");
+              updateStatus(referenceMode ? "awaiting_reference" : "converting");
+              break;
+
+            case 0x01: // Audio
+              onConvertedAudio?.(payload.buffer);
+              break;
+
+            case 0x03: // Control
+              const controlMsg = new TextDecoder().decode(payload);
+              if (controlMsg === "awaiting_reference") {
+                updateStatus("awaiting_reference");
+              } else if (controlMsg === "reference_ready") {
+                updateStatus("converting");
+              }
+              break;
+
+            case 0x05: // Error
+              const errorMsg = new TextDecoder().decode(payload);
+              onError?.(errorMsg);
+              updateStatus("error");
+              break;
+          }


⚠️ Potential issue | 🟡 Minor

Wrap switch case declarations in blocks to prevent scope leakage.

The controlMsg (Line 193) and errorMsg (Line 202) declarations inside switch cases without blocks can be accessed by other cases, potentially causing temporal dead zone errors.

🔧 Proposed fix

case 0x03: // Control - const controlMsg = new TextDecoder().decode(payload); - if (controlMsg === "awaiting_reference") { - updateStatus("awaiting_reference"); - } else if (controlMsg === "reference_ready") { - updateStatus("converting"); - } + { + const controlMsg = new TextDecoder().decode(payload); + if (controlMsg === "awaiting_reference") { + updateStatus("awaiting_reference"); + } else if (controlMsg === "reference_ready") { + updateStatus("converting"); + } + } break; case 0x05: // Error - const errorMsg = new TextDecoder().decode(payload); - onError?.(errorMsg); - updateStatus("error"); + { + const errorMsg = new TextDecoder().decode(payload); + onError?.(errorMsg); + updateStatus("error"); + } break;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

switch (messageType) {

case 0x00: // Handshake

console.log("Voice conversion handshake received");

updateStatus(referenceMode ? "awaiting_reference" : "converting");

break;

case 0x01: // Audio

onConvertedAudio?.(payload.buffer);

break;

case 0x03: // Control

const controlMsg = new TextDecoder().decode(payload);

if (controlMsg === "awaiting_reference") {

updateStatus("awaiting_reference");

} else if (controlMsg === "reference_ready") {

updateStatus("converting");

}

break;

case 0x05: // Error

const errorMsg = new TextDecoder().decode(payload);

onError?.(errorMsg);

updateStatus("error");

break;

}

switch (messageType) {

case 0x00: // Handshake

console.log("Voice conversion handshake received");

updateStatus(referenceMode ? "awaiting_reference" : "converting");

break;

case 0x01: // Audio

onConvertedAudio?.(payload.buffer);

break;

case 0x03: // Control

{

const controlMsg = new TextDecoder().decode(payload);

if (controlMsg === "awaiting_reference") {

updateStatus("awaiting_reference");

} else if (controlMsg === "reference_ready") {

updateStatus("converting");

}

}

break;

case 0x05: // Error

{

const errorMsg = new TextDecoder().decode(payload);

onError?.(errorMsg);

updateStatus("error");

}

break;

}

🧰 Tools

🪛 Biome (2.3.13)

[error] 193-193: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

[error] 202-202: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause.

The declaration is defined in this switch clause:

Safe fix: Wrap the declaration in a block.

(lint/correctness/noSwitchDeclarations)

🤖 Prompt for AI Agents

In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 182 - 206, The switch cases in useVoiceConversion.ts leak block-scoped variables (controlMsg, errorMsg) across cases; wrap each case body in its own block (e.g., case 0x03: { ... break; } and case 0x05: { ... break; }) or otherwise ensure const declarations are scoped locally so controlMsg and errorMsg cannot be accessed by other cases—keep existing calls to updateStatus, onConvertedAudio, and onError unchanged and just add braces around the 0x03 and 0x05 case bodies.

coderabbitai · 2026-01-29T19:42:56Z

+        recorder.ondataavailable = (opusData: ArrayBuffer) => {
+          if (
+            socketRef.current &&
+            socketRef.current.readyState === WebSocket.OPEN &&
+            status === "converting"
+          ) {
+            // Send audio with message type byte
+            const message = new Uint8Array(opusData.byteLength + 1);
+            message[0] = 0x01;
+            message.set(new Uint8Array(opusData), 1);
+            socketRef.current.send(message.buffer);
+          }
+        };


⚠️ Potential issue | 🔴 Critical

Stale closure bug: status check will always fail.

The ondataavailable callback captures status from the closure when start() is called. At that point, status is "connecting", not "converting". Since the callback isn't updated when React state changes, the condition status === "converting" will never be true, and audio will never be sent.

Use a ref to track the current status instead:

🐛 Proposed fix

Add a ref to track current status:

const socketRef = useRef<WebSocket | null>(null); const recorderRef = useRef<Recorder | null>(null); const targetVoiceRef = useRef<string | null>(null); +const statusRef = useRef<VCStatus>("idle"); // Update status and notify const updateStatus = useCallback( (newStatus: VCStatus) => { setStatus(newStatus); + statusRef.current = newStatus; onStatusChange?.(newStatus); }, [onStatusChange] );

Then in the callback:

recorder.ondataavailable = (opusData: ArrayBuffer) => { if ( socketRef.current && socketRef.current.readyState === WebSocket.OPEN && - status === "converting" + statusRef.current === "converting" ) {

🤖 Prompt for AI Agents

In `@client/src/pages/Conversation/hooks/useVoiceConversion.ts` around lines 239 - 251, The recorder.ondataavailable handler is using the stale React state variable status (captured at start) so the check status === "converting" never becomes true; create a mutable ref (e.g., statusRef) to mirror status, update statusRef.current whenever status changes, and replace the closure check with statusRef.current === "converting" (keeping existing socketRef and recorder.ondataavailable logic) so the callback sees the latest state and will send audio when converting.

coderabbitai · 2026-01-29T19:42:56Z

+    def __init__(
+        self,
+        in_channels: int = 1,
+        channels: int = 64,
+        latent_dim: int = 64,
+        num_residual: int = 3,
+        strides: List[int] = [2, 4, 5, 6],  # Total: 240x
+    ):
+        super().__init__()
+
+        self.in_channels = in_channels
+        self.channels = channels
+        self.latent_dim = latent_dim
+        self.strides = strides
+        self.hop_length = math.prod(strides)  # 240


⚠️ Potential issue | 🟡 Minor

Avoid mutable default argument for strides parameter.

Using a mutable list as a default argument can lead to unexpected behavior if modified. Use None and initialize inside the function.

🔧 Proposed fix

def __init__( self, in_channels: int = 1, channels: int = 64, latent_dim: int = 64, num_residual: int = 3, - strides: List[int] = [2, 4, 5, 6], # Total: 240x + strides: Optional[List[int]] = None, # Total: 240x ): super().__init__() + if strides is None: + strides = [2, 4, 5, 6] self.in_channels = in_channels

Also apply the same pattern to StreamingDecoder.__init__ (Line 451) and StreamVCModel.__init__ (Line 542).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(

self,

in_channels: int = 1,

channels: int = 64,

latent_dim: int = 64,

num_residual: int = 3,

strides: List[int] = [2, 4, 5, 6], # Total: 240x

):

super().__init__()

self.in_channels = in_channels

self.channels = channels

self.latent_dim = latent_dim

self.strides = strides

self.hop_length = math.prod(strides) # 240

def __init__(

self,

in_channels: int = 1,

channels: int = 64,

latent_dim: int = 64,

num_residual: int = 3,

strides: Optional[List[int]] = None, # Total: 240x

):

super().__init__()

if strides is None:

strides = [2, 4, 5, 6]

self.in_channels = in_channels

self.channels = channels

self.latent_dim = latent_dim

self.strides = strides

self.hop_length = math.prod(strides) # 240

🧰 Tools

🪛 Ruff (0.14.14)

212-212: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents

In `@moshi/moshi/models/voice_conversion.py` around lines 206 - 220, Change the mutable default list for the strides parameter to None in the __init__ of the VoiceConversion class and similarly in StreamingDecoder.__init__ and StreamVCModel.__init__; inside each constructor, set self.strides = strides if strides is not None else [2, 4, 5, 6] (or the class-specific default), then compute self.hop_length = math.prod(self.strides) (or update any existing hop_length assignment) so no function has a mutable list as a default argument and behavior remains identical.

coderabbitai · 2026-01-29T19:42:56Z

CVE-2025-32434

+    def load_speaker_embedding(self, path: str, device: torch.device):
+        """Load a pre-computed speaker embedding from file."""
+        embedding = torch.load(path, map_location=device)
+        if isinstance(embedding, dict):
+            embedding = embedding.get('speaker_embedding', embedding.get('embedding'))
+        self.set_target_speaker(embedding)
+        return embedding


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n moshi/moshi/models/voice_conversion.py | sed -n '424,430p'

Repository: shivankittech/personaplex

Length of output: 504

🌐 Web query:

torch.load weights_only parameter PyTorch 2.x security pickle

💡 Result:

torch.load(..., weights_only=...) is PyTorch’s “restricted unpickling” mode intended to reduce the risk of arbitrary code execution from malicious checkpoint files (since normal PyTorch checkpoints are Python pickle under the hood). In weights_only=True, PyTorch uses a restricted unpickler that only allows loading plain tensor state_dict-style data (tensors + a limited set of primitive/container types) and disallows dynamic imports during unpickling. If a checkpoint needs extra classes/functions (e.g., tensor subclasses), you must explicitly allowlist them via torch.serialization.add_safe_globals(...) / safe_globals(...), and PyTorch can also help you enumerate unsafe globals in a file. [1][2]

Default behavior in PyTorch 2.x

Starting in PyTorch 2.6, if you do not pass pickle_module, torch.load will default to using weights_only=True (a flip from the historical default). [1]

Current docs still warn: never load from an untrusted source, because unpickling is inherently dangerous. [2]

Important security caveat (pickle + “safe” mode is not a sandbox)

CVE-2025-32434: In PyTorch versions < 2.6.0, torch.load(..., weights_only=True) could still lead to remote code execution; fixed in 2.6.0. [3]

CVE-2026-24747 / GHSA-63cw-57p8-fm3p: In PyTorch versions <= 2.9.1, loading a malicious checkpoint with weights_only=True could trigger memory corruption and potentially RCE; fixed in 2.10.0. [4]

Practical guidance

For maximum safety, only load checkpoints you trust, and upgrade PyTorch (at least 2.10.0 to cover both issues above). [4]

Prefer saving/loading state_dicts and keep weights_only=True unless you truly need to load a full pickled object (e.g., an old checkpoint that saved an nn.Module), in which case weights_only=False re-enables full pickle risk. [1][2]

Sources: PyTorch serialization notes & API docs [1][2], GitHub security advisories for CVE-2025-32434 and CVE-2026-24747 [3][4].

[1] https://docs.pytorch.org/docs/stable/notes/serialization.html
[2] https://docs.pytorch.org/docs/stable/generated/torch.load.html
[3] GHSA-53q9-r3pm-6pq6
[4] GHSA-63cw-57p8-fm3p

Security risk: torch.load without weights_only=True can execute arbitrary code from untrusted checkpoint files.

The pickle-based loading at line 426 should specify weights_only=True to restrict unpickling to tensor data only. Voice embeddings may come from external or user-provided sources. While PyTorch 2.6+ defaults to this behavior, older versions allow full pickle deserialization, which is inherently unsafe.

Suggested fix

def load_speaker_embedding(self, path: str, device: torch.device): """Load a pre-computed speaker embedding from file.""" - embedding = torch.load(path, map_location=device) + embedding = torch.load(path, map_location=device, weights_only=True) if isinstance(embedding, dict): embedding = embedding.get('speaker_embedding', embedding.get('embedding')) self.set_target_speaker(embedding) return embedding

For maximum safety, also ensure only trusted checkpoints are loaded and PyTorch is kept up-to-date (2.10.0+).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def load_speaker_embedding(self, path: str, device: torch.device):

"""Load a pre-computed speaker embedding from file."""

embedding = torch.load(path, map_location=device)

if isinstance(embedding, dict):

embedding = embedding.get('speaker_embedding', embedding.get('embedding'))

self.set_target_speaker(embedding)

return embedding

def load_speaker_embedding(self, path: str, device: torch.device):

"""Load a pre-computed speaker embedding from file."""

embedding = torch.load(path, map_location=device, weights_only=True)

if isinstance(embedding, dict):

embedding = embedding.get('speaker_embedding', embedding.get('embedding'))

self.set_target_speaker(embedding)

return embedding

🤖 Prompt for AI Agents

In `@moshi/moshi/models/voice_conversion.py` around lines 424 - 430, The load_speaker_embedding function uses torch.load(path, map_location=device) which can unpickle arbitrary objects; update the call to pass weights_only=True (torch.load(path, map_location=device, weights_only=True)) to restrict loading to tensor data only, keep the existing dict handling for 'speaker_embedding'/'embedding', and ensure set_target_speaker is called with the resulting tensor; also add a brief error message or exception handling around torch.load to surface failures when untrusted or incompatible files are provided.

coderabbitai · 2026-01-29T19:42:56Z

+    def _create_mel_basis(self) -> torch.Tensor:
+        """Create mel filterbank."""
+        import numpy as np
+        n_fft = 1024
+        n_mels = 80
+        fmin = 0
+        fmax = self.sample_rate // 2
+
+        # Create mel filterbank using librosa-style computation
+        mel_low = self._hz_to_mel(fmin)
+        mel_high = self._hz_to_mel(fmax)
+        mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)
+        hz_points = self._mel_to_hz(mel_points)
+        bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()
+
+        mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)
+        for i in range(n_mels):
+            for j in range(int(bin_points[i]), int(bin_points[i + 1])):
+                mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])
+            for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):
+                mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])
+
+        return mel_basis


⚠️ Potential issue | 🟡 Minor

Unused numpy import inside method.

Line 588 imports numpy as np, but np is never used in the function - all operations use torch. This is dead code.

🔧 Proposed fix

def _create_mel_basis(self) -> torch.Tensor: """Create mel filterbank.""" - import numpy as np n_fft = 1024 n_mels = 80

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _create_mel_basis(self) -> torch.Tensor:

"""Create mel filterbank."""

import numpy as np

n_fft = 1024

n_mels = 80

fmin = 0

fmax = self.sample_rate // 2

# Create mel filterbank using librosa-style computation

mel_low = self._hz_to_mel(fmin)

mel_high = self._hz_to_mel(fmax)

mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)

hz_points = self._mel_to_hz(mel_points)

bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()

mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)

for i in range(n_mels):

for j in range(int(bin_points[i]), int(bin_points[i + 1])):

mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])

for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):

mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])

return mel_basis

def _create_mel_basis(self) -> torch.Tensor:

"""Create mel filterbank."""

n_fft = 1024

n_mels = 80

fmin = 0

fmax = self.sample_rate // 2

# Create mel filterbank using librosa-style computation

mel_low = self._hz_to_mel(fmin)

mel_high = self._hz_to_mel(fmax)

mel_points = torch.linspace(mel_low, mel_high, n_mels + 2)

hz_points = self._mel_to_hz(mel_points)

bin_points = torch.floor((n_fft + 1) * hz_points / self.sample_rate).long()

mel_basis = torch.zeros(n_mels, n_fft // 2 + 1)

for i in range(n_mels):

for j in range(int(bin_points[i]), int(bin_points[i + 1])):

mel_basis[i, j] = (j - bin_points[i]) / (bin_points[i + 1] - bin_points[i])

for j in range(int(bin_points[i + 1]), int(bin_points[i + 2])):

mel_basis[i, j] = (bin_points[i + 2] - j) / (bin_points[i + 2] - bin_points[i + 1])

return mel_basis

🤖 Prompt for AI Agents

In `@moshi/moshi/models/voice_conversion.py` around lines 586 - 608, The _create_mel_basis method contains an unused import "import numpy as np" which should be removed; edit the _create_mel_basis function to delete the numpy import line so that only torch operations remain (keep variables like n_fft, n_mels, fmin, fmax, mel_low/mel_high, mel_points, hz_points, bin_points, and the mel_basis loops intact).

coderabbitai · 2026-01-29T19:42:56Z

+                    # Voice conversion
+                    with torch.no_grad():
+                        converted = self.voice_converter(chunk_tensor)
+
+                    # Convert back to numpy
+                    converted_pcm = converted[0, 0].cpu().numpy()
+
+                    # Send to opus writer
+                    opus_writer.append_pcm(converted_pcm)


⚠️ Potential issue | 🟡 Minor

Add error handling for model inference to gracefully report failures.

If self.voice_converter(chunk_tensor) raises an exception (e.g., CUDA OOM, numerical errors), the process loop crashes without notifying the client. Consider wrapping the inference in try/except and sending an error message.

🔧 Proposed fix

# Voice conversion - with torch.no_grad(): - converted = self.voice_converter(chunk_tensor) - - # Convert back to numpy - converted_pcm = converted[0, 0].cpu().numpy() - - # Send to opus writer - opus_writer.append_pcm(converted_pcm) + try: + with torch.no_grad(): + converted = self.voice_converter(chunk_tensor) + converted_pcm = converted[0, 0].cpu().numpy() + opus_writer.append_pcm(converted_pcm) + except Exception as e: + clog.log("error", f"voice conversion failed: {e}") + await ws.send_bytes(b"\x05" + str(e).encode("utf-8")) + return

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Voice conversion

with torch.no_grad():

converted = self.voice_converter(chunk_tensor)

# Convert back to numpy

converted_pcm = converted[0, 0].cpu().numpy()

# Send to opus writer

opus_writer.append_pcm(converted_pcm)

# Voice conversion

try:

with torch.no_grad():

converted = self.voice_converter(chunk_tensor)

converted_pcm = converted[0, 0].cpu().numpy()

opus_writer.append_pcm(converted_pcm)

except Exception as e:

clog.log("error", f"voice conversion failed: {e}")

await ws.send_bytes(b"\x05" + str(e).encode("utf-8"))

return

🤖 Prompt for AI Agents

In `@moshi/moshi/vc_server.py` around lines 239 - 247, Wrap the model inference call to self.voice_converter inside a try/except around the torch.no_grad() block so any exceptions (e.g., CUDA OOM or runtime errors) are caught; on exception, log the error (including the exception text), notify the client about the failure (via whatever client/error channel this loop uses), optionally torch.cuda.empty_cache() if OOM, and skip or break processing for that chunk instead of letting the loop crash; ensure you still handle cleanup such as not calling opus_writer.append_pcm when converted is unavailable.

claude added 2 commits January 29, 2026 19:31

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jan 29, 2026

View reviewed changes

Conversation

punitst commented Jan 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

punitst commented Jan 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 29, 2026 •

edited

Loading