A production-grade local AI video and voice stack for a single NVIDIA DGX Spark (GB10, 128 GB UMA, ARM64, CUDA 13.0).
NeonForge combines a Next.js frontend, a FastAPI gateway, Redis/supervisor orchestration, and several local AI runtimes for voice, video, and workflow-driven generation on one machine with one GPU.
- CURRENT_STATE.md for the fastest current product summary.
- ARCHITECTURE.md for service boundaries, ports, and runtime wiring.
- VOICEOVER_STUDIO.md for the isolated long-form narration path.
- ACCEPTANCE_TESTS.md for validation and smoke-test workflows.
AI and media runtimes:
| Runtime | Purpose | Notes |
|---|---|---|
| Faster-Whisper | Audio transcription (STT) | Always-on baseline service; also used to transcribe reference audio for Fish Speech |
| F5-TTS | Text-to-speech and voiceover synthesis | Most reliable local voiceover backend |
| Fish Speech 1.5 | Higher-quality local voiceover synthesis | Optional; enabled with FISH_SPEECH_ENABLED=true |
| VoxCPM2 | Experimental local voiceover synthesis | Optional; enabled with VOXCPM2_ENABLED=true |
| LivePortrait | Face animation from image + driving video | Warm GPU runtime |
| Lip-sync | Audio-driven mouth animation | Warm GPU runtime |
| Wan 2.1 | Text-to-video generation | Lazy-start singleton |
| ComfyUI | Template-driven Character Swap and related workflows | Optional Compose profile |
| Wan 2.2 Animate UI | Separate Gradio UI for Wan experimentation | Optional sidecar UI |
Core infrastructure:
| Component | Purpose |
|---|---|
| Frontend | Director's Console web UI on :3000 |
| Gateway | Public API entry point, request routing, voiceover jobs, ComfyUI management, memory gating |
| Supervisor | Internal sidecar with Docker socket, starts/stops lazy containers |
| Redis | Job state, activity timestamps, service coordination |
| Idle Manager | Host-side systemd service, stops idle containers, thermal monitoring |
This diagram focuses on the core gateway/supervisor path. The frontend, optional voiceover runtimes, ComfyUI, and the wan-ui sidecar are omitted for readability.
Internet / LAN
|
:8080 (Gateway)
|
+----------+----------+
| API Gateway | <-- job queue, memory gate, semaphores
| (no Docker sock) | UMA hard limit: 80%
+----------+----------+ medium concurrency: 1
| heavy concurrency: 1
+--------------+--------------+
| | |
+-----+----+ +------+-----+ +----+-----+
| Redis | | Supervisor | | Whisper |
| (state) | | (Docker | | (always |
| | | socket) | | on) |
+----------+ +------+-----+ +----------+
|
+----------+----------+
| | |
+----+---+ +----+---+ +---+----+
| F5-TTS | | Live | | Lip |
| (warm) | |Portrait| | -sync |
| | | (warm) | | (warm) |
+--------+ +--------+ +--------+
+----------------------------------+
| Wan 2.1 (lazy) |
| Started on demand by Supervisor |
| Stopped after 300s idle |
+----------------------------------+
Always-on -- Container and model always running. Instant response.
- Frontend, Redis, Gateway, Supervisor, Whisper
Warm / on-demand model services -- Containers are typically running, and some runtimes load on first request or destroy models after idle. Healthz responds immediately even when inference will still trigger heavier setup work.
- F5-TTS, Fish Speech, VoxCPM2, LivePortrait, Lip-sync
Lazy-start singleton -- Container stopped by default. Gateway asks Supervisor to start it on demand. Stopped by idle_manager after 5 min of inactivity. Only one instance ever runs.
- Wan 2.1
Optional sidecars -- Started only when you enable the relevant workflow or UI.
- ComfyUI, Wan 2.2 Animate UI
-
True model destruction. When a model goes idle, the service doesn't just call
torch.cuda.empty_cache(). It moves sub-modules to CPU, deletes the Python object, runsgc.collect()twice, clears the CUDA allocator, and logs the actual GB freed via/proc/meminfo. This is essential on UMA where "GPU memory" and "system memory" are the same pool. -
Supervisor sidecar. The public-facing gateway has zero Docker socket access. Container lifecycle (start/stop Wan 2.1) is delegated to an internal-only supervisor that holds the socket. This limits the blast radius of a gateway compromise.
-
UMA-aware memory gating. The gateway checks
/proc/meminfobefore admitting GPU jobs. Default hard limit: 80% used. Per-tier reservations ensure minimum available GB (heavy=40 GB, medium=10 GB, light=2 GB).nvidia-smimemory reporting is "Not Supported" on DGX Spark UMA. -
GPU concurrency = 1 for all tiers. Single GPU, single device. Even medium-weight models serialize to prevent CUDA context contention and OOM on shared UMA.
-
video-retalking over MuseTalk. See COMPATIBILITY_NOTES.md for the full comparison. TL;DR: video-retalking has proven quality, subprocess isolation, and lower memory. MuseTalk has no ARM64 support and unstable API.
-
Voiceover Studio stays isolated. The older Voice Studio / direct TTS flow remains separate from the newer long-form Voiceover Studio path. Voice profile uploads accept WAV, MP3, and M4A, decode once with
ffmpeg, and store a PCM WAV master so downstream runtimes receive canonical reference audio without extra lossy transcodes.
# 1. Clone and configure
cd ~/neonforge
cp .env.example .env
# Edit .env if needed (defaults are conservative and production-safe)
# 2. Create host directories
sudo mkdir -p /srv/ai/{models,cache/hf,outputs,assets,logs}
sudo chown -R $USER:$USER /srv/ai
# 3. Start always-on tier
docker compose up -d redis supervisor gateway frontend whisper
# 4. Start warm tier
docker compose up -d f5tts liveportrait lipsync
# 5. (Optional) Start additional runtimes if configured
# Fish Speech uses a compose image reference rather than a local Dockerfile.
docker compose up -d fish_speech voxcpm2
# 6. Verify
python3 scripts/verify_dgx.py --smoke
# 7. (Optional) Install idle manager systemd units
sudo cp systemd/ai-idle-manager.service /etc/systemd/system/
sudo cp systemd/ai-idle-manager.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now ai-idle-manager.timerWan 2.1 starts automatically on first request via the gateway:
curl -X POST http://localhost:8080/api/v1/wan21/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "a cat walking in a garden", "num_frames": 16, "width": 512, "height": 512}'All requests go through http://localhost:8080.
| Method | Endpoint | Service | Tier |
|---|---|---|---|
| POST | /api/v1/whisper/transcribe |
Whisper | Light |
| POST | /api/v1/tts/synthesize |
F5-TTS | Medium |
| POST | /api/v1/voiceover/profiles |
Create reusable voice profile from WAV/MP3/M4A input | -- |
| GET | /api/v1/voiceover/profiles |
List saved voice profiles | -- |
| DELETE | /api/v1/voiceover/profiles/{profile_id} |
Delete saved voice profile | -- |
| GET | /api/v1/voiceover/profiles/{profile_id}/sample |
Download stored reference sample | -- |
| GET | /api/v1/voiceover/models |
List available voiceover backends | -- |
| POST | /api/v1/voiceover/jobs |
Queue long-form voiceover render | Medium |
| GET | /api/v1/voiceover/jobs/{job_id} |
Voiceover job status lookup | -- |
| GET | /api/v1/voiceover/outputs |
List recent voiceover outputs | -- |
| GET | /api/v1/voiceover/output/{job_id} |
Download finished voiceover output | -- |
| DELETE | /api/v1/voiceover/output/{job_id} |
Delete finished voiceover output | -- |
| POST | /api/v1/liveportrait/animate |
LivePortrait | Medium |
| POST | /api/v1/lipsync/sync |
Lip-sync | Medium |
| POST | /api/v1/wan21/generate |
Wan 2.1 | Heavy |
| GET | /api/v1/comfyui/templates |
Managed ComfyUI templates | Heavy |
| GET | /api/v1/comfyui/templates/{template_id} |
Template detail + validation | Heavy |
| GET | /api/v1/comfyui/models |
Read-only ComfyUI model inventory | -- |
| GET | /api/v1/comfyui/assets |
Uploaded Character Swap assets | -- |
| POST | /api/v1/comfyui/assets/upload |
Upload Character Swap asset | -- |
| DELETE | /api/v1/comfyui/assets/{asset_id} |
Delete uploaded Character Swap asset | -- |
| POST | /api/v1/comfyui/jobs |
Queue template-driven ComfyUI job | Heavy |
| GET | /api/v1/comfyui/jobs/{job_id} |
Template-driven ComfyUI job detail | Heavy |
| GET | /api/v1/outputs/{path} |
File retrieval | -- |
| GET | /healthz |
Gateway health | -- |
| GET | /readyz |
Gateway + Redis readiness | -- |
| GET | /memory |
UMA memory status + thresholds | -- |
| GET | /services/status |
All service health | -- |
| GET | /jobs/{job_id} |
Job status lookup | -- |
Voiceover Studio is the isolated long-form narration path in NeonForge. It intentionally does not replace the older direct TTS flow in Voice Studio.
- Voice profiles accept
.wav,.mp3, and.m4auploads. - Accepted profile uploads are normalized once with
ffmpegand stored as a PCM WAV master. - New saved reference profiles always persist as
.wav; older saved MP3/WAV profiles remain compatible. - Ingest keeps a high-quality master reference instead of forcing
24 kHzmono up front. Any runtime-specific conversion should happen in the model call path. - Reference clips longer than 30 seconds are rejected when duration tools are available.
- The UI tracks active jobs across refreshes and keeps a recent-outputs list for playback, download, and cleanup.
- VoxCPM2 exposes three explicit modes in this surface:
design: no reference audio, optional style/control textclone: reference audio only, optional style/control textcontinuation: reference audio plus the exact transcript of that clip
- Vox now defaults to normal clone semantics instead of silently auto-entering continuation mode.
See VOICEOVER_STUDIO.md for the product-level notes and gateway/voiceover/ for the implementation.
NeonForge now includes a managed Character Swap flow in the Studio UI that runs ComfyUI workflows through the gateway instead of treating ComfyUI as an untracked sidecar.
- Template manifests and workflow source files live in
gateway/templates/comfyui/. - Uploaded Character Swap assets live under
${ASSETS_DIR}/comfyui/uploads/and are mounted in the gateway at/app/data/assets/comfyui/uploads/. - Before submission, selected uploaded assets are staged into the ComfyUI input directory at
/outputs/comfyui/input/. - Final videos are written by ComfyUI to
/outputs/comfyui/output/and are exposed through the existing history/download APIs.
- The template workflow JSON is the source-of-truth graph stored in the repo.
- The runtime payload sent to ComfyUI is always an API prompt graph under
payload.prompt. - For full exported workflows, NeonForge converts the stored workflow to API format through ComfyUI's
/workflow/convertendpoint at submission time, then patches only the manifest-defined runtime inputs. - The first managed template patches:
reference_image -> node 57 -> inputs.imagedriving_video -> node 63 -> inputs.videoseed -> node 27 -> inputs.seedsteps -> node 27 -> inputs.stepscfg -> node 27 -> inputs.cfgdenoise_strength -> node 27 -> inputs.denoise_strength
- NeonForge scans only the ComfyUI model roots from
COMFYUI_MODEL_ROOTS(default:/models/comfyui,/opt/ComfyUI/custom_nodes/comfyui_controlnet_aux/ckpts). COMFYUI_MODEL_ROOTSaccepts multiple roots separated by commas, newlines, or semicolons.- In Docker Compose, the gateway mounts
/modelsread-only for shared weights, and it falls back to a read-only supervisor-backed scan for ComfyUI container-only roots such ascomfyui_controlnet_aux/ckpts. - Validation is read-only. NeonForge never downloads models, never mutates the ComfyUI models directory, and never tries to "fix" missing files.
- If a template references a model that is not present, submission is rejected with a clear validation error before the workflow is queued.
- The validator reports which root satisfied each required model and includes the scanned roots in
/api/v1/comfyui/modelsfor debugging. - The Wan Character Swap template validates the workflow's referenced Wan checkpoint, VAE, CLIP vision, text encoder, LoRAs, SAM2 weights, and preprocess ONNX/TorchScript files by filename.
- Add a manifest file to
gateway/templates/comfyui/<template-id>.manifest.json. - Add the workflow source JSON to
gateway/templates/comfyui/and pointworkflow_fileat it. - Define
required_inputs,optional_params, theoutputnode expectation, and theruntime_mappingsthat patch node inputs at runtime. - Make sure every referenced model is already present in the mounted ComfyUI model roots, then refresh the Character Swap tab or call
GET /api/v1/comfyui/templatesto see validation results.
Each service exposes:
GET /healthz-- process alive (no model check)GET /readyz-- model loaded status + UMA usageGET /smoke-- lightweight deterministic test (may trigger model load)
neonforge/
docker-compose.yml # All services, volumes, networks
.env.example # Configuration template
ARCHITECTURE.md # Service boundaries, ports, and mounts
ACCEPTANCE_TESTS.md # Step-by-step validation + benchmark matrix
COMPATIBILITY_NOTES.md # ARM64 / CUDA / DGX Spark risks per service
CURRENT_STATE.md # Current product surfaces and practical guidance
README.md # This file
VOICEOVER_STUDIO.md # Notes for the isolated voiceover path
frontend/
app/ # Next.js routes and studio pages
components/ # Shared UI, including Voiceover Studio
Dockerfile
gateway/
app.py # FastAPI gateway, job queue, memory gate
voiceover/ # Voice profiles, model registry, chunking, runner
templates/comfyui/ # Managed ComfyUI workflow manifests
Dockerfile
requirements.txt
supervisor/
app.py # Internal sidecar, Docker socket holder
Dockerfile
requirements.txt
services/
whisper/
app.py # Faster-whisper, always-on
Dockerfile
requirements.txt
f5tts/
app.py # F5-TTS, warm with true model destroy
Dockerfile
requirements.txt
liveportrait/
app.py # LivePortrait, warm with true model destroy
Dockerfile
requirements.txt
lipsync/
app.py # video-retalking/SadTalker, warm
Dockerfile
requirements.txt
voxcpm2/
app.py # Experimental local cloned-voice backend
Dockerfile
requirements.txt
wan21/
app.py # Wan 2.1, lazy singleton with true destroy
Dockerfile
requirements.txt
comfyUI/
Dockerfile # Optional managed ComfyUI runtime
Wan2.2-Animate/
Dockerfile # Optional wan-ui sidecar
scripts/
verify_dgx.py # Health check, smoke tests, latency percentiles
idle_manager.py # Systemd-driven idle container stopper
systemd/
ai-idle-manager.service # Systemd unit for idle manager
ai-idle-manager.timer # 60-second timer for idle checks
| Host Path | Container Mount | Purpose |
|---|---|---|
/srv/ai/models |
/models |
Model weights (persistent) |
/srv/ai/cache/hf |
/cache/hf |
HuggingFace download cache |
/srv/ai/outputs |
/outputs |
Generated files (audio, video) |
/srv/ai/assets |
/app/data/assets |
Voice profiles and uploaded workflow assets |
/srv/ai/logs |
/logs |
Service logs |
All configuration is via .env. Key settings:
| Variable | Default | Description |
|---|---|---|
MEMORY_HARD_LIMIT |
80 |
UMA % usage that blocks new GPU jobs |
MEM_RESERVE_HEAVY_GB |
40 |
Min GB free to admit heavy jobs (Wan) |
MEM_RESERVE_MEDIUM_GB |
10 |
Min GB free to admit medium jobs |
MEM_RESERVE_LIGHT_GB |
2 |
Min GB free to admit light jobs |
ASSETS_DIR |
/srv/ai/assets |
Persistent reference audio and uploaded workflow assets |
WAN21_MODEL_VARIANT |
1.3B |
Wan model size (1.3B safe, 14B risky) |
WAN21_IDLE_TIMEOUT |
300 |
Seconds before idle Wan container stops |
F5TTS_IDLE_TIMEOUT |
1800 |
Seconds before F5-TTS model is destroyed |
FISH_SPEECH_ENABLED |
false |
Enables Fish Speech in Voiceover Studio |
VOXCPM2_ENABLED |
false |
Enables VoxCPM2 in Voiceover Studio |
COMFYUI_MODEL_ROOTS |
/models/comfyui,/opt/ComfyUI/custom_nodes/comfyui_controlnet_aux/ckpts |
Read-only model roots scanned for managed ComfyUI templates |
LIPSYNC_BACKEND |
video-retalking |
Lip-sync backend (or sadtalker) |
- NVIDIA DGX Spark -- GB10 GPU, Blackwell architecture (sm_121)
- 128 GB UMA -- CPU and GPU share one memory pool
- ARM64 (aarch64) -- NVIDIA Grace CPU
- CUDA 13.0 -- Driver 580.126.09
- Single GPU -- All concurrency is serialized
Key UMA implications:
nvidia-smireports memory as "Not Supported"- All memory monitoring uses
/proc/meminfo(MemAvailable, SwapFree) - Docker
--memorylimits don't cap GPU tensor allocations - OOM is system-level (Linux OOM killer), not CUDA-level
Run the verification script:
python3 scripts/verify_dgx.py --gateway-url http://localhost:8080 --smoke --jsonThis reports:
- Service health (/healthz, /readyz)
- Smoke test results with latency percentiles (p50/p95/p99)
- Memory gate status and thresholds
- Host memory: MemAvailable, SwapFree, Buffers/Cache
- Per-container RSS via
docker stats - Per-process GPU memory via
nvidia-smi pmon - GPU temperature, utilization, power, clocks
See ACCEPTANCE_TESTS.md for full step-by-step validation including the benchmark matrix.
See COMPATIBILITY_NOTES.md for the full assessment. Top risks:
- PyTorch sm_121 kernels -- pip wheels lack Blackwell support. Use NGC containers or JIT compile.
- CTranslate2 ARM64 -- May need source build for GPU acceleration.
- InsightFace / MediaPipe ARM64 -- Limited or missing ARM64 support for LivePortrait dependencies.
- video-retalking C++ deps -- basicsr, gfpgan, facexlib need source builds on ARM64.
- Wan 14B memory -- 40-80 GB peak on 128 GB UMA is risky. Default to 1.3B.
- UMA OOM -- System-level, not CUDA-level. The Linux OOM killer may terminate unrelated processes.
If you are an LLM reading this codebase:
- Architecture: Microservices in Docker, one container per AI model. Gateway routes and queues. Supervisor manages container lifecycle.
- Docs to read first:
CURRENT_STATE.md,ARCHITECTURE.md, andVOICEOVER_STUDIO.md. - State: Redis stores job records and per-service activity timestamps. Models are stateless (weights loaded from
/models, outputs written to/outputs). - Memory model: UMA (shared CPU+GPU). All memory checks use
/proc/meminfo, nevernvidia-smi. The gateway gates job admission at 80% UMA usage. - Concurrency: Single GPU.
gpu_heavy_sem=1,gpu_medium_sem=1. Jobs within a tier serialize. - Model lifecycle: Warm services load on first request, truly destroy (not just cache-flush) after idle timeout. Lazy services (Wan) have their entire container stopped.
- Voiceover split:
Voice StudioandVoiceover Studioare intentionally separate. The long-form voiceover path lives ingateway/voiceover/andfrontend/components/VoiceoverStudio.tsx. - Voice profile ingest: New voice profile uploads accept WAV/MP3/M4A and are normalized once to a PCM WAV master on ingest.
- Testing: Run
python3 scripts/verify_dgx.py --smoke --jsonfor a machine-readable health report. - Key files:
docker-compose.ymlfor topology,gateway/app.pyfor routing/scheduling,gateway/voiceover/for long-form narration,supervisor/app.pyfor container lifecycle, and.env.examplefor config knobs.
Private project. Not licensed for redistribution.