Claude Code = Senior Video Systems Engineer This document is the master blueprint for building a complete AI-powered video creation system. Research-backed with academic papers, GitHub repos, and industry tools.
Goal: Build a programmable video editor with an LLM (Claude) at its core, integrating:
- Editing Engines (FFmpeg, Remotion, Editly, MoviePy)
- Timeline Frameworks (OpenTimelineIO, GStreamer GES)
- Generative AI Models (Veo, Runway, VACE, Stable Diffusion)
- Orchestration Platforms (VideoAgent, LangGraph, MCP servers)
The Vision: Natural language becomes the ultimate API for video editing. A simple prompt like "Make me a highlights reel with upbeat music and animated captions" results in Claude autonomously writing code, invoking models, and producing a polished video.
Output Types:
- Storytelling videos (narrative, explainers, short films)
- YouTube tutorials (screen recordings, code walkthroughs)
- Faceless content (stock footage + voiceover + text)
- Avatar-driven content (AI presenter)
┌──────────────────────────────────────────────────────────────────────────────────┐
│ VOIDSTUDIO │
├──────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ENTRY POINTS │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ CLI/API │ │ Web UI │ │ MCP Server │ │
│ │ (void cmd) │ │ (Next.js) │ │ (Anthropic) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
│ │ AGENTIC ORCHESTRATION LAYER │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ DIRECTOR (Claude LLM) │ │ │
│ │ │ • Intent Analysis → decompose user instructions into sub-tasks │ │ │
│ │ │ • Script Generation → write voiceover, scene breakdowns │ │ │
│ │ │ • Workflow Planning → create execution graph │ │ │
│ │ │ • Tool Selection → choose which pipeline/model for each task │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Storyboard │ │ Composer │ │ Renderer │ │ Exporter │ │ │
│ │ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
│ │ GENERATION PIPELINES │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ VIDEO │ │ VOICE │ │ IMAGE │ │ AVATAR │ │ │
│ │ │ Pipeline │ │ Pipeline │ │ Pipeline │ │ Pipeline │ │ │
│ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │
│ │ │ • Veo 3 │ │ • ElevenLabs│ │ • DALL-E 3 │ │ • HeyGen │ │ │
│ │ │ • Runway │ │ • OpenAI TTS│ │ • Flux │ │ • D-ID │ │ │
│ │ │ • VACE │ │ • Coqui │ │ • SDXL │ │ • SadTalker │ │ │
│ │ │ • SVD │ │ • Bark │ │ • Midjourney│ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ MUSIC │ │ CAPTION │ │ SCREEN │ │ │
│ │ │ Pipeline │ │ Pipeline │ │ Pipeline │ │ │
│ │ ├─────────────┤ ├─────────────┤ ├─────────────┤ │ │
│ │ │ • Suno │ │ • Whisper │ │ • Playwright│ │ │
│ │ │ • Udio │ │ • faster-w │ │ • Puppeteer │ │ │
│ │ │ • Stable A. │ │ • AssemblyAI│ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
│ │ COMPOSITION ENGINE │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ REMOTION (React Video) │ │ │
│ │ │ • Timeline management • Scene composition │ │ │
│ │ │ • React components • Text animations │ │ │
│ │ │ • Parallel rendering • Keyframe animations │ │ │
│ │ │ • Interactive preview • Audio sync │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ EDITLY (Declarative JSON) │ │ │
│ │ │ • JSON5 spec → video • Transitions & effects │ │ │
│ │ │ • Audio mixing & ducking • Text overlays │ │ │
│ │ │ • Picture-in-picture • GL shader support │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ FFMPEG / FFMPerative │ │ │
│ │ │ • Natural language → FFmpeg • Concat/trim/split │ │ │
│ │ │ • Caption burn-in • Format conversion │ │ │
│ │ │ • Audio normalization • Batch processing │ │ │
│ │ └─────────────────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
│ │ TIMELINE INTERCHANGE │ │
│ │ ┌──────────────────────────────┐ ┌──────────────────────────────┐ │ │
│ │ │ OpenTimelineIO (OTIO) │ │ GStreamer GES │ │ │
│ │ │ • Pixar timeline format │ │ • High-level timeline API │ │ │
│ │ │ • EDL/XML/AAF export │ │ • OTIO integration │ │ │
│ │ │ • Industry standard │ │ • Python bindings │ │ │
│ │ └──────────────────────────────┘ └──────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────────┐ │
│ │ ASSET MANAGEMENT │ │
│ │ • Local file storage • S3/R2 cloud storage │ │
│ │ • Asset database (SQLite) • Thumbnail generation │ │
│ │ • Project versioning • Media library │ │
│ └────────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
The fundamental open-source multimedia engine. Claude generates FFmpeg command sequences or uses bindings.
| Tool | Description | Link |
|---|---|---|
| FFmpeg | Core video processing | ffmpeg.org |
| ffmpeg-python | Python bindings | github.com/kkroening/ffmpeg-python |
| FFMPerative | Natural language → FFmpeg | github.com/remyxai/FFMPerative |
| ffmperative-7b | Fine-tuned Llama 2 for FFmpeg | huggingface.co/remyxai/ffmperative-7b |
FFMPerative Example:
from ffmperative import ffmp
# Natural language video editing
ffmp("sample the 5th frame from '/path/to/video.mp4'")
ffmp("merge subtitles 'captions.srt' with video 'video.mp4' calling it 'video_caps.mp4'")
ffmp("resize video.mp4 to 1080p and add a watermark")Open-source React.js framework for creating videos programmatically. Treats video elements as React components.
| Resource | Description | Link |
|---|---|---|
| Remotion Core | Main framework (24k+ stars) | github.com/remotion-dev/remotion |
| Remotion MCP | Model Context Protocol server | remotion.dev/docs/ai/mcp |
| @remotion/mcp | npm package | npmjs.com/package/@remotion/mcp |
| Rodumani | Remotion editor MCP server | github.com/smilish67/rodumani |
| react-video-editor | CapCut/Canva clone | github.com/designcombo/react-video-editor |
Remotion Features:
- React component model for video
- CSS, Canvas, SVG, WebGL support
- Variables, functions, APIs in video
- Studio (web UI) + Editor Starter kit
- Parallel rendering to .mp4 or other formats
Declarative video editing library (Node.js + FFmpeg) via JSON/JSON5 specifications.
| Resource | Description | Link |
|---|---|---|
| Editly | Main library | github.com/mifi/editly |
| npm package | editly on npm | npmjs.com/package/editly |
Editly JSON5 Spec Example:
{
outPath: "./output.mp4",
width: 1920,
height: 1080,
fps: 30,
clips: [
{
duration: 5,
transition: { name: "fade" },
layers: [
{ type: "video", path: "./intro.mp4" },
{ type: "title", text: "Welcome" }
]
}
],
audioFilePath: "./music.mp3"
}Editly Features:
- Declarative API with JSON5
- Multiple output formats (MP4, MKV, GIF)
- Any aspect ratio (1:1, 9:16, 16:9)
- Audio mixing, crossfading, ducking
- Picture-in-picture, GL shaders
- Streaming editing (fast, low storage)
Python library for scriptable video editing, leveraging FFmpeg internally.
| Resource | Description | Link |
|---|---|---|
| MoviePy | Main library | github.com/Zulko/moviepy |
Powerful open-source timeline interchange format from Pixar for editorial cut data.
| Resource | Description | Link |
|---|---|---|
| OpenTimelineIO | Main repo (ASWF) | github.com/AcademySoftwareFoundation/OpenTimelineIO |
| Documentation | Official docs | opentimeline.io |
OTIO Features:
- Industry-standard interchange format
- C++ core with Python bindings
- Export to EDL, Final Cut Pro XML, AAF
- Supported by Adobe, Avid, DaVinci Resolve
- Plugin system (adapters, media linkers)
OTIO Use Case: Claude generates/modifies OTIO files representing sequences, then exports to industry formats.
High-level timeline API on top of GStreamer.
| Resource | Description | Link |
|---|---|---|
| GES Documentation | Official docs | gstreamer.freedesktop.org/documentation/gst-editing-services |
| GES + OTIO | Integration blog | blogs.gnome.org/tsaunier |
GES Features:
- GESTimeline, GESLayer, GESClip objects
- Python bindings via GObject introspection
- Save/load timelines (including OTIO)
- Audio mixing, video compositing
- Used in Pitivi (open-source NLE)
| Provider | Model | Type | Best For | Link |
|---|---|---|---|---|
| Veo 3 | Text-to-video | High quality, audio sync | deepmind.google/veo | |
| Runway | Gen-3 Alpha | Text/Image-to-video | Fast iteration | runwayml.com |
| Pika Labs | Pika 1.5 | Text-to-video | Motion control | pika.art |
| Luma AI | Dream Machine | Text-to-video | Cinematic quality | lumalabs.ai |
The first open-source unified model for video creation AND editing.
| Resource | Description | Link |
|---|---|---|
| VACE GitHub | Main repo | github.com/ali-vilab/VACE |
| Wan2.1 Base | Base model | github.com/Wan-Video/Wan2.1 |
| HuggingFace | Model weights | huggingface.co/Wan-AI/Wan2.1-VACE-14B |
| Paper | ICCV 2025 | ali-vilab.github.io/VACE-Page |
VACE Capabilities:
- R2V (Reference-to-Video): Generate from reference images
- V2V (Video-to-Video): Edit entire videos via text
- MV2V (Masked Video-to-Video): Edit specific regions
High-Level Tasks:
- Move-Anything
- Swap-Anything
- Reference-Anything
- Expand-Anything
- Animate-Anything
Model Versions:
Wan2.1-VACE-14B(720p, highest quality)Wan2.1-VACE-1.3B(480p, faster)
Requirements: Python 3.10.13, PyTorch >= 2.5.1, CUDA 12.4
| Model | Type | Link |
|---|---|---|
| Stable Video Diffusion | Image-to-video | github.com/Stability-AI/generative-models |
| ModelScope Text2Video | Text-to-video | huggingface.co/damo-vilab |
| HuggingFace Diffusers | Multi-model library | github.com/huggingface/diffusers |
Multi-agent system for video understanding, editing, and creation.
| Resource | Description | Link |
|---|---|---|
| VideoAgent | Main repo | github.com/HKUDS/VideoAgent |
| ViMax | Related project | github.com/HKUDS/AI-Creator |
VideoAgent Architecture:
- Intent Analysis - Decomposes user instructions into sub-intents
- Autonomous Tool Use - Graph-based workflow with feedback loops
- Multi-Modal Understanding - Visual queries for retrieval
Specialized Agents:
- Storyboard Agent (visual queries)
- Video agents (understanding, editing, remixing)
- Voice synthesis agents (TTS, conversion)
- Graph Router (LLM-driven orchestration)
Unique Capabilities:
- Beat-synced edits
- Meme video remaking
- Song remixes
- Cross-lingual adaptations
- 0.95 workflow success rate
LLM-Powered Agent Assistance and Language Augmentation for Video Editing.
| Resource | Description | Link |
|---|---|---|
| Paper | arXiv 2024 | arxiv.org/abs/2402.10294 |
| Project Page | Demo & details | dgp.toronto.edu/~bryanw/lave |
| ACM DL | IUI 2024 | dl.acm.org/doi/10.1145/3640543.3645143 |
LAVE Features:
- Auto-generates language descriptions for footage
- LLM plans and executes editing tasks
- Mixed-initiative UI (agent + manual refinement)
- Uses VLMs for video content understanding
- Brainstorming, semantic search, storyboarding
Production-ready framework for stateful, multi-step AI workflows.
| Resource | Description | Link |
|---|---|---|
| LangGraph | Main framework | langchain-ai.github.io/langgraph |
| Deep Agents | Complex agent harness | github.com/langchain-ai/deepagents |
LangGraph Features:
- Stateful workflows with persistence
- Human-in-the-loop controls
- Memory APIs
- 1-click agent deployment
Anthropic's MCP allows LLMs to securely use external tools.
| Server | Description | Link |
|---|---|---|
| Video Editor MCP | Kush Agrawal's server | github.com/Kush36Agrawal/Video_Editor_MCP |
| ffmpeg-mcp-lite | Lightweight FFmpeg MCP | github.com/kevinwatt/ffmpeg-mcp-lite |
| mcp-ffmpeg | Bits Corp server | glama.ai/mcp/servers/@bitscorp-mcp/mcp-ffmpeg |
Example Usage:
User: "Trim video.mp4 from 1:30 to 2:45"
→ MCP translates to FFmpeg command
→ Executes locally
→ Returns result with progress tracking
| Resource | Description | Link |
|---|---|---|
| Official MCP | Remotion docs integration | remotion.dev/docs/ai/mcp |
| @remotion/mcp | npm package | npmjs.com/package/@remotion/mcp |
| Rodumani | Full editor via MCP | github.com/smilish67/rodumani |
Rodumani Features:
- Media file management (upload, metadata, thumbnails)
- Timeline editing (multi-track, precise timing)
- Editing operations (trim, split, move, undo/redo)
- 2D transformations, keyframe animation
- Transition effects
| Project | Stars | What It Does | Link |
|---|---|---|---|
| ShortGPT | 5k+ | Full YouTube Shorts automation | github.com/RayVentura/ShortGPT |
| MoneyPrinterTurbo | 18k+ | One-click video generation | github.com/harry0703/MoneyPrinterTurbo |
| MoneyPrinter | 10k+ | YouTube Shorts from topic | github.com/FujiwaraChoki/MoneyPrinter |
| Mosaico | 1k+ | Python video composition with AI | github.com/FolhaSP/mosaico |
| Frame | New | AI-powered "vibe" video editor | github.com/aregrid/frame |
- Uses OpenAI for script generation
- ElevenLabs + EdgeTTS for voice
- Pexels for background footage
- Auto-generates captions
- Supports 30+ languages
- One-click video from topic/keyword
- Auto-generates copy, materials, subtitles, music
- Supports multiple LLMs (OpenAI, Gemini, Ollama, DeepSeek)
- High-definition, royalty-free
- Web UI at
http://127.0.0.1:8080
| Provider | Type | Best For | Link |
|---|---|---|---|
| ElevenLabs | API | Most realistic | github.com/elevenlabs/elevenlabs-python |
| OpenAI TTS | API | Fast, integrated | OpenAI API |
| Coqui TTS | Local | Free, open source | github.com/coqui-ai/TTS |
| Bark | Local | Emotions, sound effects | github.com/suno-ai/bark |
| Tool | Type | Best For | Link |
|---|---|---|---|
| Whisper | Local | Accuracy | github.com/openai/whisper |
| faster-whisper | Local | Speed (4x faster) | github.com/guillaumekln/faster-whisper |
| auto-subtitle | Local | Whisper + FFmpeg overlay | github.com/m1guelpf/auto-subtitle |
| AssemblyAI | API | Real-time | assemblyai.com |
| Provider | Type | Link |
|---|---|---|
| Suno | Full songs | github.com/gcui-art/suno-api (unofficial) |
| Udio | DAW integration | udio.com |
| Stable Audio | Background music | stability.ai |
| Provider | Model | Type | Link |
|---|---|---|---|
| OpenAI | DALL-E 3 | API | OpenAI API |
| Stability AI | SDXL | Local/API | github.com/Stability-AI/generative-models |
| Black Forest Labs | Flux | Local/API | github.com/black-forest-labs/flux |
| HuggingFace | Diffusers | Multi-model | github.com/huggingface/diffusers |
| ComfyUI | Node workflows | Local | github.com/comfyanonymous/ComfyUI |
| Provider | Type | Link |
|---|---|---|
| HeyGen | Professional avatars | github.com/HeyGen-Official/StreamingAvatarSDK |
| D-ID | Realistic motion | d-id.com |
| SadTalker | Open source face animation | github.com/OpenTalker/SadTalker |
voidstudio/
├── apps/
│ ├── cli/ # Command-line interface
│ │ ├── src/
│ │ │ ├── commands/ # CLI commands
│ │ │ ├── lib/ # Shared utilities
│ │ │ └── index.ts
│ │ └── package.json
│ │
│ ├── web/ # Web UI (Next.js + Remotion)
│ │ ├── src/
│ │ │ ├── app/ # Next.js app router
│ │ │ ├── components/ # React components
│ │ │ └── remotion/ # Remotion compositions
│ │ └── package.json
│ │
│ └── mcp-server/ # MCP server for Claude
│ ├── src/
│ │ ├── tools/ # MCP tool definitions
│ │ └── index.ts
│ └── package.json
│
├── packages/
│ ├── core/ # Core types and utilities
│ ├── director/ # Claude-powered orchestration
│ ├── composer/ # Scene assembly logic
│ ├── timeline/ # OTIO integration
│ └── exporter/ # FFmpeg/Editly export
│
├── pipelines/
│ ├── video/ # Veo, Runway, VACE
│ ├── voice/ # ElevenLabs, OpenAI TTS
│ ├── image/ # DALL-E, Flux, SD
│ ├── avatar/ # HeyGen, SadTalker
│ ├── music/ # Suno, Stable Audio
│ ├── caption/ # Whisper, AssemblyAI
│ └── screen/ # Playwright recording
│
├── remotion/ # Remotion video project
│ ├── src/
│ │ ├── compositions/ # Video templates
│ │ ├── components/ # Reusable components
│ │ └── Root.tsx
│ └── remotion.config.ts
│
├── templates/ # Video template library
│ ├── explainer/
│ ├── tutorial/
│ ├── storytelling/
│ ├── short-form/
│ └── faceless/
│
├── python/ # Python utilities
│ ├── ffmperative/ # Natural language FFmpeg
│ ├── whisper_pipeline/ # Caption generation
│ ├── vace/ # VACE model integration
│ └── requirements.txt
│
├── config/
│ ├── providers.yaml
│ └── templates.yaml
│
└── package.json # Root monorepo config
- Architecture plan
- Monorepo setup (pnpm workspaces)
- Core TypeScript packages
- Basic CLI structure
- Remotion project initialization
- FFmpeg utility layer (FFMPerative)
- Voice pipeline (ElevenLabs + OpenAI TTS)
- Image pipeline (DALL-E + Replicate)
- Caption pipeline (faster-whisper)
- Basic Editly integration
- Remotion components library
- Template system (explainer, tutorial)
- OpenTimelineIO integration
- Scene assembly logic
- Director agent (Claude)
- Workflow graph execution
- Full pipeline integration
- CLI commands complete
- Video generation (Runway, VACE)
- Avatar pipeline (HeyGen)
- Music pipeline (Suno)
- Web UI
- MCP server
# Full video creation from topic
void create "How quantum computers work" \
--style explainer \
--duration 60 \
--voice elevenlabs:adam
# Script generation only
void script "The history of AI" --format youtube --output script.json
# Voice generation
void voice "Hello world" --provider elevenlabs --voice adam --output hello.mp3
# Image generation
void image "A futuristic city at sunset" --provider dalle --count 4
# Video clip generation
void clip "A robot walking through a forest" --provider runway --duration 4
# Caption generation
void caption ./video.mp4 --output ./video.srt --language en
# Render Remotion composition
void render ./projects/my-video --output ./final.mp4
# Avatar video
void avatar "Welcome to my channel" --provider heygen --avatar professional
# Natural language editing (FFMPerative style)
void edit "trim video.mp4 from 0:30 to 1:45 and add fade transitions"
# Compose from Editly spec
void compose ./spec.json5 --output ./final.mp4
# Export to OTIO
void export ./project.json --format otio --output ./project.otio# .env.example
# LLM
ANTHROPIC_API_KEY=
# Voice
ELEVENLABS_API_KEY=
OPENAI_API_KEY=
# Video Generation
RUNWAY_API_KEY=
REPLICATE_API_KEY=
GCP_PROJECT_ID=
GOOGLE_APPLICATION_CREDENTIALS=
# Avatars
HEYGEN_API_KEY=
DID_API_KEY=
# Music
SUNO_COOKIE= # Unofficial
# Storage
R2_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=
# Database
DATABASE_URL=file:./data/voidstudio.db| Engine | Use Case |
|---|---|
| Remotion | Complex animations, React components, interactive preview |
| Editly | Fast declarative edits, JSON specs, simple compositions |
| FFmpeg direct | Raw processing, format conversion, batch operations |
| OTIO | Industry interchange, export to NLEs |
Following LAVE and VideoAgent research:
- LLM as central planner ("Director")
- Specialized agents for different tasks
- Graph-based workflow execution
- Self-correction and feedback loops
- Mixed-initiative (AI + human refinement)
- Standardized tool interface for Claude
- Secure local execution
- Progress tracking and error handling
- Future-proof (Anthropic-backed standard)
-
LAVE: Wang et al. "LLM-Powered Agent Assistance and Language Augmentation for Video Editing" (IUI 2024) - arXiv:2402.10294
-
VideoAgent: HKUDS "All-in-One Agentic Framework for Video Understanding, Editing, and Remaking" - GitHub
-
VACE: Alibaba "Video All-in-one Creation and Editing" (ICCV 2025) - GitHub
-
OpenTimelineIO: Pixar/ASWF - GitHub
- Approve this plan - Any changes?
- Initialize monorepo - pnpm + TypeScript
- Set up Remotion - Core composition engine
- Build FFMPerative integration - Natural language editing
- Create first template - Simple explainer
- Build Director agent - Claude orchestration
Ready to build?