Skip to content
Draft
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
8580e09
Fix MLX dependencies and improve UX
rgr4y Feb 3, 2026
dceb874
Fix mac/MLX issues. Make server standalone docker
rgr4y Feb 13, 2026
027121a
Fix Docker
rgr4y Feb 14, 2026
584a0e6
Output files
rgr4y Feb 14, 2026
8e56e8b
fix: add multiprocessing.freeze_support() for PyInstaller
rgr4y Feb 14, 2026
4d02c57
fix: remove hardcoded Mac Mini M4 whisper recommendation
rgr4y Feb 14, 2026
8c6b8ea
feat: query HuggingFace API for model sizes before download
rgr4y Feb 14, 2026
b68d1a2
feat: add generation progress via SSE
rgr4y Feb 14, 2026
ad4210e
CLI changes
rgr4y Feb 14, 2026
63f7724
Merge branch 'jamiepine:main' into fix-whisper-stt-transcription
rgr4y Feb 14, 2026
af84ab9
chore: rename voicebox CLI to voicebox-cli for clarity
rgr4y Feb 14, 2026
e7a127e
feat: add EBU R128 loudness normalization (Python-only, no ffmpeg)
rgr4y Feb 14, 2026
fed23eb
Merge branch 'audio-normalization' into fix-whisper-stt-transcription
rgr4y Feb 14, 2026
5fd5f58
feat: Python-only audio normalization in CLI, simplify upload retry
rgr4y Feb 14, 2026
c3e5331
fix: auto-trim and normalize reference audio instead of rejecting
rgr4y Feb 14, 2026
d9d6598
feat: enhance deletion UX, job status filtering, and fix lint errors
rgr4y Feb 14, 2026
13c9236
Update python-version for local
rgr4y Feb 14, 2026
b8b19db
feat: add copy button, import warning, fix force-cancel queue
rgr4y Feb 14, 2026
04354f4
fix: only mark API emission after successful fetch
rgr4y Feb 14, 2026
9805969
Bump CUDA -> 12.9 & Ubuntu -> 24
rgr4y Feb 14, 2026
29acbd9
Fix macOS dependency resolver conflicts for transformers/MLX/qwen-tts
rgr4y Feb 15, 2026
63c1c87
fix: dependencies on macos
rgr4y Feb 15, 2026
2219fdf
Remove model prefs
rgr4y Feb 18, 2026
1b42a94
Add RunPod Serverless support to voicebox
rgr4y Feb 18, 2026
69e31ff
fix: MLX concurrent load crash, 4-bit models, CLI improvements
rgr4y Feb 18, 2026
9c7e594
Merge main into serverless; remove Assets.car; fix progress toast
rgr4y Feb 18, 2026
81b63a7
fix: download toast polling, model stickiness, HF offline cache, tqdm…
rgr4y Feb 18, 2026
a6d8964
fix: CoreAudio device enumeration, input device visibility, AudioTab UX
rgr4y Feb 18, 2026
2e0a2ea
feat: mic device picker for voice recording, hide Audio tab on web
rgr4y Feb 18, 2026
beb1b3a
fix: address Copilot review comments on serverless handler
rgr4y Feb 18, 2026
abf7ae0
Merge branch 'serverless' into fix/audio-device-enumeration
rgr4y Feb 18, 2026
52962ce
fix: remove dead useAutoUpdater.ts, fix dep array, fix CFStringGetCSt…
rgr4y Feb 18, 2026
d5533fb
fix: health endpoint cleanup, useAutoUpdater showToast, add pre-commi…
rgr4y Feb 18, 2026
eb2137e
docs: clarify CoreAudio/cpal device ID matching limitation in audio_o…
rgr4y Feb 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Voicebox Project Notes

## CLI — voicebox-cli vs cli.py

**`voicebox/voicebox-cli`** is the real CLI. It is stdlib-only (no pip deps), self-contained, and is what users actually run. It has all commands: `server`, `voices`, `import`, `generate`/`say`, `health`, `config`, `transcribe`, `create-voice`. Config persists to `~/.config/voicebox/config.json`.

**`voicebox/backend/cli.py`** is dead code. It predates `voicebox-cli` and was superseded. Its only live reference is the launcher line in `setup-linux.sh` which is intentionally left as-is. **Do not modify cli.py.**

When the user asks for CLI changes, always work on `voicebox-cli`.

## Key Architecture

- **Backend**: FastAPI (`backend/main.py`) served by uvicorn on port 17493
- **Entry points**: `server.py` (PyInstaller binary), `backend/main.py __main__` (dev)
- **Dev script**: `scripts/dev-backend-watch.sh` — loads `.env` from `voicebox/` and `../` then runs uvicorn with `--reload`
- **MLX backend**: `backend/backends/mlx_backend.py` — Apple Silicon only, uses mlx-audio. Models: `mlx-community/Qwen3-TTS-12Hz-{1.7B,0.6B}-Base-4bit`. Uses `Base` variants (not `CustomVoice` — those require a named speaker, not ref_audio).
- **PyTorch backend**: `backend/backends/pytorch_backend.py` — CUDA/CPU, uses qwen-tts
- **Logging**: stdlib `logging`. Set `LOG_LEVEL=DEBUG` env var for verbose output.

## MLX Gotchas

- `transformers` verbosity is suppressed at module-level import in `mlx_backend.py` — do not restore or move this
- Concurrent MLX loads crash Metal (`commit an already committed command buffer`) — serialized via `_MLX_LOAD_LOCK` threading lock in `load_model_async`
- `CustomVoice` model variants require a named speaker arg; `Base` variants support arbitrary voice cloning via `ref_audio`/`ref_text`
- On 16GB unified memory, bf16 models cause swap pressure — use 4-bit quantized variants
27 changes: 27 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
data/
backend/venv/
node_modules/
__pycache__/
*.pyc
*.egg-info/
.claude/
.git/
.github/
.vscode/
*.md
docs/
mlx-test/
scripts/
tauri/
web/
landing/
.DS_Store
*.log
*.cache
dist/
build/
.env
.env.*
*.swp
*.swo
*~
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ data/profiles/*
data/generations/*
data/projects/*
data/voicebox.db
data/huggingface
data/model_prefs.json
!data/.gitkeep

# Logs
Expand All @@ -57,3 +59,7 @@ tauri/src-tauri/binaries/*
tmp/
temp/
*.tmp
output*.m4a
package-lock.json
.claude
tauri/src-tauri/gen/Assets.car
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12.12
88 changes: 88 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Voicebox TTS Server
# CUDA 12.9 + Python 3.12 on Ubuntu 24.04
#
# Build:
# DOCKER_BUILDKIT=1 docker build -t voicebox .
# DOCKER_BUILDKIT=1 docker build --build-arg CUDA=0 -t voicebox-cpu .
# DOCKER_BUILDKIT=1 docker build --build-arg SERVERLESS=1 -t voicebox-serverless .
#
# Run:
# docker compose up -d
#
# syntax=docker/dockerfile:1.4

ARG CUDA=1
ARG SERVERLESS=0

# --- Base stage ---
FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04 AS base-cuda
FROM ubuntu:24.04 AS base-cpu

# --- Pick base based on CUDA arg --
FROM base-cuda AS base-1
FROM base-cpu AS base-0
FROM base-${CUDA} AS base

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
python3 \
python3-venv \
python3-dev \
python3-pip \
libsndfile1 \
ffmpeg \
curl \
sox \
&& rm -rf /var/lib/apt/lists/*

# --- Dependencies stage (cached layer) ---
FROM base AS deps

ARG CUDA
WORKDIR /app

# Create virtual environment outside /app to survive volume mount
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY backend/requirements.txt ./requirements.txt

RUN --mount=type=cache,target=/root/.cache/pip \
pip install --upgrade pip && \
if [ "$CUDA" = "1" ]; then \
pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu124 && \
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124; \
else \
pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cpu && \
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu; \
fi

# Source is volume-mounted at runtime (local dev) or COPYed below (serverless)
ENV HF_HOME=/app/data/huggingface
ENV PATH="/opt/venv/bin:$PATH"

# Copy source into image for non-volume-mount deployments (e.g. RunPod)
COPY backend/ /app/backend/

# --- Normal mode: FastAPI server on port 17493 ---
FROM deps AS final-0
EXPOSE 17493
HEALTHCHECK --interval=60s --timeout=5s --start-period=30s --retries=3 \
CMD curl -f http://localhost:17493/health || exit 1
ENTRYPOINT ["/opt/venv/bin/python3", "-m", "backend.main"]
CMD ["--host", "0.0.0.0", "--port", "17493", "--data-dir", "/app/data"]

# --- Serverless mode: RunPod handler ---
FROM deps AS final-1
ENV SERVERLESS=1
HEALTHCHECK NONE
ENTRYPOINT ["/opt/venv/bin/python3", "-u", "-m", "backend.serverless_handler"]
CMD []

# --- Pick final stage based on SERVERLESS arg ---
ARG SERVERLESS
FROM final-${SERVERLESS} AS final
24 changes: 18 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,29 @@ setup: setup-js setup-python ## Full project setup (all dependencies)
@echo -e " Run $(YELLOW)make dev$(NC) to start development servers"

setup-js: ## Install JavaScript dependencies (bun)
@command -v bun >/dev/null 2>&1 || { \
echo -e "$(YELLOW)bun not found — installing...$(NC)"; \
curl -fsSL https://bun.sh/install | bash; \
}
@echo -e "$(BLUE)Installing JavaScript dependencies...$(NC)"
bun install

setup-python: $(VENV)/bin/activate ## Set up Python virtual environment and dependencies
@echo -e "$(BLUE)Installing Python dependencies...$(NC)"
$(PIP) install --upgrade pip
$(PIP) install -r $(BACKEND_DIR)/requirements.txt
@if [ "$$(uname -m)" = "arm64" ] && [ "$$(uname)" = "Darwin" ]; then \
echo -e "$(BLUE)Detected Apple Silicon - installing MLX dependencies...$(NC)"; \
echo -e "$(BLUE)Detected Apple Silicon - using MLX-compatible dependency resolution...$(NC)"; \
$(PIP) install -r $(BACKEND_DIR)/requirements-mlx.txt; \
grep -v -E "^transformers" $(BACKEND_DIR)/requirements.txt > /tmp/voicebox-requirements-filtered.txt; \
$(PIP) install -r /tmp/voicebox-requirements-filtered.txt; \
rm /tmp/voicebox-requirements-filtered.txt; \
$(PIP) install --no-deps git+https://github.com/QwenLM/Qwen3-TTS.git; \
echo -e "$(GREEN)✓ MLX backend enabled (native Metal acceleration)$(NC)"; \
echo -e "$(YELLOW)Note: Using transformers 5.0.0rc3 (required by MLX)$(NC)"; \
else \
$(PIP) install -r $(BACKEND_DIR)/requirements.txt; \
$(PIP) install git+https://github.com/QwenLM/Qwen3-TTS.git; \
fi
$(PIP) install git+https://github.com/QwenLM/Qwen3-TTS.git
@echo -e "$(GREEN)✓ Python environment ready$(NC)"

$(VENV)/bin/activate:
Expand All @@ -72,7 +82,7 @@ setup-rust: ## Install Rust toolchain (if not present)
# DEVELOPMENT
# =============================================================================

.PHONY: dev dev-backend dev-frontend dev-web kill-dev
.PHONY: dev dev-backend dev-backend-watch dev-frontend dev-web kill-dev

dev: ## Start backend + desktop app (parallel)
@echo -e "$(BLUE)Starting development servers...$(NC)"
Expand All @@ -82,9 +92,11 @@ dev: ## Start backend + desktop app (parallel)
sleep 2 && $(MAKE) dev-frontend & \
wait

dev-backend: ## Start FastAPI backend server
dev-backend: dev-backend-watch ## Start FastAPI backend server (venv-verified, auto-reload)

dev-backend-watch: ## Start backend with venv verification + Python file watching
@echo -e "$(BLUE)Starting backend server on http://localhost:17493$(NC)"
$(VENV_BIN)/uvicorn backend.main:app --reload --port 17493
./scripts/dev-backend-watch.sh

dev-frontend: ## Start Tauri desktop app
@echo -e "$(BLUE)Starting Tauri desktop app...$(NC)"
Expand Down
161 changes: 161 additions & 0 deletions SERVERLESS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# RunPod Serverless Deployment

Voicebox can run as a [RunPod Serverless](https://docs.runpod.io/serverless/quickstart) worker. Workers spin up on demand, process requests, and shut down automatically — you only pay while they're running.

## How it works

The serverless image starts a RunPod handler (`serverless_handler.py`) which:

1. Launches the existing FastAPI server in a background thread
2. Waits for `/health` to respond (up to 5 min on cold start for model downloads)
3. Proxies each RunPod job as an HTTP request to the local server
4. Returns the response — JSON as-is, audio files as base64

Model idle-unloading is disabled in serverless mode (`SERVERLESS=1`). The model stays loaded for the worker's lifetime. RunPod shuts down the entire worker after the configured idle timeout.

## Build the image

```bash
# From the voicebox/ directory
./scripts/serverless-build.sh

# Build + push to Docker Hub
./scripts/serverless-build.sh --push --tag youruser/voicebox-serverless:latest

# Build + push to GHCR
./scripts/serverless-build.sh --push --tag ghcr.io/youruser/voicebox-serverless:latest
```

Or manually:

```bash
DOCKER_BUILDKIT=1 docker build \
--build-arg CUDA=1 \
--build-arg SERVERLESS=1 \
-t voicebox-serverless \
.
```

## RunPod endpoint settings

When creating your endpoint on [runpod.io](https://runpod.io):

| Setting | Recommended |
| ----------------- | -------------------------------- |
| Container image | your pushed image tag |
| GPU | RTX 4090 or similar (16GB+ VRAM) |
| Idle timeout | **60 seconds** |
| Execution timeout | 600 seconds |
| Active workers | 0 (pure on-demand) |
| Max workers | 1 (increase for production) |
| FlashBoot | enabled |

**On idle timeout:** The GPU stays allocated (and billed) for the full idle window regardless of VRAM usage. Keeping the model hot and using a short idle timeout (60s) is more cost-effective than unloading the model and using a long idle timeout.

## Authentication

Add your RunPod API key to the root `.env`:

```
RUNPOD_API_KEY=your_runpod_api_key_here
```

All requests to the RunPod endpoint require this key as a bearer token:

```
Authorization: Bearer $RUNPOD_API_KEY
```

## Sending requests

RunPod wraps requests in a job envelope. The handler accepts:

| Field | Type | Description |
| --------- | ------ | -------------------------------------- |
| `method` | string | HTTP method (default: `"POST"`) |
| `path` | string | Required. API path, e.g. `"/generate"` |
| `body` | object | JSON body for POST/PUT requests |
| `params` | object | Query string parameters |
| `headers` | object | Additional HTTP headers |

### Health check

```bash
curl -X POST https://api.runpod.ai/v2/$ENDPOINT_ID/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": {"method": "GET", "path": "/health"}}'
```

### Generate speech

```bash
curl -X POST https://api.runpod.ai/v2/$ENDPOINT_ID/runsync \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"method": "POST",
"path": "/generate",
"body": {
"profile_id": "your-profile-id",
"text": "Hello from RunPod."
}
}
}'
```

### Download audio

Audio endpoints return base64-encoded content with `"is_binary": true`:

```json
{
"output": {
"status_code": 200,
"is_binary": true,
"body_base64": "UklGRi..."
}
}
```

Decode it:

```bash
echo "$BODY_BASE64" | base64 -d > output.wav
```

### Async jobs (long generations)

For long texts, use `/run` instead of `/runsync` to avoid the 90s sync timeout:

```bash
# Submit
JOB=$(curl -s -X POST https://api.runpod.ai/v2/$ENDPOINT_ID/run \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": {"path": "/generate", "body": {"profile_id": "...", "text": "..."}}}')

JOB_ID=$(echo $JOB | jq -r '.id')

# Poll
curl https://api.runpod.ai/v2/$ENDPOINT_ID/status/$JOB_ID \
-H "Authorization: Bearer $RUNPOD_API_KEY"
```

## Local testing

The RunPod SDK includes a local test server:

```bash
cd voicebox/
SERVERLESS=1 python3 -m backend.serverless_handler --rp_serve_api
```

This starts a local HTTP server that simulates the RunPod job protocol at `http://localhost:8000`.

## Limitations

- **SSE streaming not supported** — RunPod jobs return a single response. Generation still works, just without real-time progress events. Use `/generate` (non-streaming) via the job body.
- **Ephemeral storage** — The SQLite database (profiles, history) is lost when the worker shuts down. Voice profiles need to be re-imported each cold start, or use a RunPod network volume for persistence.
- **Cold start time** — First start downloads model weights (~3–5 GB from HuggingFace). Subsequent starts with FlashBoot are much faster.
Loading