Skip to content

Latest commit

 

History

History
731 lines (580 loc) · 36 KB

File metadata and controls

731 lines (580 loc) · 36 KB

VOIDSTUDIO: AI Video Creation System

Architecture Plan v2.0

Claude Code = Senior Video Systems Engineer This document is the master blueprint for building a complete AI-powered video creation system. Research-backed with academic papers, GitHub repos, and industry tools.


Executive Summary

Goal: Build a programmable video editor with an LLM (Claude) at its core, integrating:

  • Editing Engines (FFmpeg, Remotion, Editly, MoviePy)
  • Timeline Frameworks (OpenTimelineIO, GStreamer GES)
  • Generative AI Models (Veo, Runway, VACE, Stable Diffusion)
  • Orchestration Platforms (VideoAgent, LangGraph, MCP servers)

The Vision: Natural language becomes the ultimate API for video editing. A simple prompt like "Make me a highlights reel with upbeat music and animated captions" results in Claude autonomously writing code, invoking models, and producing a polished video.

Output Types:

  1. Storytelling videos (narrative, explainers, short films)
  2. YouTube tutorials (screen recordings, code walkthroughs)
  3. Faceless content (stock footage + voiceover + text)
  4. Avatar-driven content (AI presenter)

System Architecture

┌──────────────────────────────────────────────────────────────────────────────────┐
│                                 VOIDSTUDIO                                        │
├──────────────────────────────────────────────────────────────────────────────────┤
│                                                                                   │
│  ENTRY POINTS                                                                     │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                        │
│  │   CLI/API    │    │   Web UI     │    │  MCP Server  │                        │
│  │  (void cmd)  │    │  (Next.js)   │    │ (Anthropic)  │                        │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘                        │
│         └───────────────────┼───────────────────┘                                 │
│                             ▼                                                     │
│  ┌────────────────────────────────────────────────────────────────────────────┐  │
│  │                    AGENTIC ORCHESTRATION LAYER                              │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐   │  │
│  │  │                    DIRECTOR (Claude LLM)                             │   │  │
│  │  │  • Intent Analysis → decompose user instructions into sub-tasks      │   │  │
│  │  │  • Script Generation → write voiceover, scene breakdowns            │   │  │
│  │  │  • Workflow Planning → create execution graph                        │   │  │
│  │  │  • Tool Selection → choose which pipeline/model for each task       │   │  │
│  │  └─────────────────────────────────────────────────────────────────────┘   │  │
│  │                                                                             │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐           │  │
│  │  │ Storyboard │  │  Composer  │  │  Renderer  │  │  Exporter  │           │  │
│  │  │   Agent    │  │   Agent    │  │   Agent    │  │   Agent    │           │  │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘           │  │
│  └────────────────────────────────────────────────────────────────────────────┘  │
│                             │                                                     │
│         ┌───────────────────┼───────────────────┐                                │
│         ▼                   ▼                   ▼                                 │
│  ┌────────────────────────────────────────────────────────────────────────────┐  │
│  │                      GENERATION PIPELINES                                   │  │
│  │                                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │  │
│  │  │   VIDEO     │  │   VOICE     │  │   IMAGE     │  │   AVATAR    │       │  │
│  │  │  Pipeline   │  │  Pipeline   │  │  Pipeline   │  │  Pipeline   │       │  │
│  │  ├─────────────┤  ├─────────────┤  ├─────────────┤  ├─────────────┤       │  │
│  │  │ • Veo 3     │  │ • ElevenLabs│  │ • DALL-E 3  │  │ • HeyGen    │       │  │
│  │  │ • Runway    │  │ • OpenAI TTS│  │ • Flux      │  │ • D-ID      │       │  │
│  │  │ • VACE      │  │ • Coqui     │  │ • SDXL      │  │ • SadTalker │       │  │
│  │  │ • SVD       │  │ • Bark      │  │ • Midjourney│  │             │       │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘       │  │
│  │                                                                             │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                        │  │
│  │  │   MUSIC     │  │   CAPTION   │  │   SCREEN    │                        │  │
│  │  │  Pipeline   │  │  Pipeline   │  │  Pipeline   │                        │  │
│  │  ├─────────────┤  ├─────────────┤  ├─────────────┤                        │  │
│  │  │ • Suno      │  │ • Whisper   │  │ • Playwright│                        │  │
│  │  │ • Udio      │  │ • faster-w  │  │ • Puppeteer │                        │  │
│  │  │ • Stable A. │  │ • AssemblyAI│  │             │                        │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                        │  │
│  └────────────────────────────────────────────────────────────────────────────┘  │
│                             │                                                     │
│                             ▼                                                     │
│  ┌────────────────────────────────────────────────────────────────────────────┐  │
│  │                      COMPOSITION ENGINE                                     │  │
│  │                                                                             │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐   │  │
│  │  │                    REMOTION (React Video)                            │   │  │
│  │  │  • Timeline management         • Scene composition                   │   │  │
│  │  │  • React components            • Text animations                     │   │  │
│  │  │  • Parallel rendering          • Keyframe animations                 │   │  │
│  │  │  • Interactive preview         • Audio sync                          │   │  │
│  │  └─────────────────────────────────────────────────────────────────────┘   │  │
│  │                                                                             │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐   │  │
│  │  │                    EDITLY (Declarative JSON)                         │   │  │
│  │  │  • JSON5 spec → video          • Transitions & effects               │   │  │
│  │  │  • Audio mixing & ducking      • Text overlays                       │   │  │
│  │  │  • Picture-in-picture          • GL shader support                   │   │  │
│  │  └─────────────────────────────────────────────────────────────────────┘   │  │
│  │                                                                             │  │
│  │  ┌─────────────────────────────────────────────────────────────────────┐   │  │
│  │  │                    FFMPEG / FFMPerative                              │   │  │
│  │  │  • Natural language → FFmpeg   • Concat/trim/split                   │   │  │
│  │  │  • Caption burn-in             • Format conversion                   │   │  │
│  │  │  • Audio normalization         • Batch processing                    │   │  │
│  │  └─────────────────────────────────────────────────────────────────────┘   │  │
│  └────────────────────────────────────────────────────────────────────────────┘  │
│                             │                                                     │
│                             ▼                                                     │
│  ┌────────────────────────────────────────────────────────────────────────────┐  │
│  │                    TIMELINE INTERCHANGE                                     │  │
│  │  ┌──────────────────────────────┐  ┌──────────────────────────────┐       │  │
│  │  │    OpenTimelineIO (OTIO)     │  │    GStreamer GES             │       │  │
│  │  │  • Pixar timeline format      │  │  • High-level timeline API   │       │  │
│  │  │  • EDL/XML/AAF export         │  │  • OTIO integration          │       │  │
│  │  │  • Industry standard          │  │  • Python bindings           │       │  │
│  │  └──────────────────────────────┘  └──────────────────────────────┘       │  │
│  └────────────────────────────────────────────────────────────────────────────┘  │
│                             │                                                     │
│                             ▼                                                     │
│  ┌────────────────────────────────────────────────────────────────────────────┐  │
│  │                       ASSET MANAGEMENT                                      │  │
│  │  • Local file storage        • S3/R2 cloud storage                         │  │
│  │  • Asset database (SQLite)   • Thumbnail generation                        │  │
│  │  • Project versioning        • Media library                               │  │
│  └────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                   │
└──────────────────────────────────────────────────────────────────────────────────┘

Part 1: Editing Engines & Libraries

1.1 FFmpeg (Core Processing Engine)

The fundamental open-source multimedia engine. Claude generates FFmpeg command sequences or uses bindings.

Tool Description Link
FFmpeg Core video processing ffmpeg.org
ffmpeg-python Python bindings github.com/kkroening/ffmpeg-python
FFMPerative Natural language → FFmpeg github.com/remyxai/FFMPerative
ffmperative-7b Fine-tuned Llama 2 for FFmpeg huggingface.co/remyxai/ffmperative-7b

FFMPerative Example:

from ffmperative import ffmp

# Natural language video editing
ffmp("sample the 5th frame from '/path/to/video.mp4'")
ffmp("merge subtitles 'captions.srt' with video 'video.mp4' calling it 'video_caps.mp4'")
ffmp("resize video.mp4 to 1080p and add a watermark")

1.2 Remotion (React-Based Video)

Open-source React.js framework for creating videos programmatically. Treats video elements as React components.

Resource Description Link
Remotion Core Main framework (24k+ stars) github.com/remotion-dev/remotion
Remotion MCP Model Context Protocol server remotion.dev/docs/ai/mcp
@remotion/mcp npm package npmjs.com/package/@remotion/mcp
Rodumani Remotion editor MCP server github.com/smilish67/rodumani
react-video-editor CapCut/Canva clone github.com/designcombo/react-video-editor

Remotion Features:

  • React component model for video
  • CSS, Canvas, SVG, WebGL support
  • Variables, functions, APIs in video
  • Studio (web UI) + Editor Starter kit
  • Parallel rendering to .mp4 or other formats

1.3 Editly (Declarative JSON Editing)

Declarative video editing library (Node.js + FFmpeg) via JSON/JSON5 specifications.

Resource Description Link
Editly Main library github.com/mifi/editly
npm package editly on npm npmjs.com/package/editly

Editly JSON5 Spec Example:

{
  outPath: "./output.mp4",
  width: 1920,
  height: 1080,
  fps: 30,
  clips: [
    {
      duration: 5,
      transition: { name: "fade" },
      layers: [
        { type: "video", path: "./intro.mp4" },
        { type: "title", text: "Welcome" }
      ]
    }
  ],
  audioFilePath: "./music.mp3"
}

Editly Features:

  • Declarative API with JSON5
  • Multiple output formats (MP4, MKV, GIF)
  • Any aspect ratio (1:1, 9:16, 16:9)
  • Audio mixing, crossfading, ducking
  • Picture-in-picture, GL shaders
  • Streaming editing (fast, low storage)

1.4 MoviePy (Python Video Editing)

Python library for scriptable video editing, leveraging FFmpeg internally.

Resource Description Link
MoviePy Main library github.com/Zulko/moviepy

Part 2: Timeline Formats & Frameworks

2.1 OpenTimelineIO (OTIO)

Powerful open-source timeline interchange format from Pixar for editorial cut data.

Resource Description Link
OpenTimelineIO Main repo (ASWF) github.com/AcademySoftwareFoundation/OpenTimelineIO
Documentation Official docs opentimeline.io

OTIO Features:

  • Industry-standard interchange format
  • C++ core with Python bindings
  • Export to EDL, Final Cut Pro XML, AAF
  • Supported by Adobe, Avid, DaVinci Resolve
  • Plugin system (adapters, media linkers)

OTIO Use Case: Claude generates/modifies OTIO files representing sequences, then exports to industry formats.

2.2 GStreamer Editing Services (GES)

High-level timeline API on top of GStreamer.

Resource Description Link
GES Documentation Official docs gstreamer.freedesktop.org/documentation/gst-editing-services
GES + OTIO Integration blog blogs.gnome.org/tsaunier

GES Features:

  • GESTimeline, GESLayer, GESClip objects
  • Python bindings via GObject introspection
  • Save/load timelines (including OTIO)
  • Audio mixing, video compositing
  • Used in Pitivi (open-source NLE)

Part 3: Generative Video Models

3.1 Commercial APIs

Provider Model Type Best For Link
Google Veo 3 Text-to-video High quality, audio sync deepmind.google/veo
Runway Gen-3 Alpha Text/Image-to-video Fast iteration runwayml.com
Pika Labs Pika 1.5 Text-to-video Motion control pika.art
Luma AI Dream Machine Text-to-video Cinematic quality lumalabs.ai

3.2 VACE (Open Source - Alibaba)

The first open-source unified model for video creation AND editing.

Resource Description Link
VACE GitHub Main repo github.com/ali-vilab/VACE
Wan2.1 Base Base model github.com/Wan-Video/Wan2.1
HuggingFace Model weights huggingface.co/Wan-AI/Wan2.1-VACE-14B
Paper ICCV 2025 ali-vilab.github.io/VACE-Page

VACE Capabilities:

  • R2V (Reference-to-Video): Generate from reference images
  • V2V (Video-to-Video): Edit entire videos via text
  • MV2V (Masked Video-to-Video): Edit specific regions

High-Level Tasks:

  • Move-Anything
  • Swap-Anything
  • Reference-Anything
  • Expand-Anything
  • Animate-Anything

Model Versions:

  • Wan2.1-VACE-14B (720p, highest quality)
  • Wan2.1-VACE-1.3B (480p, faster)

Requirements: Python 3.10.13, PyTorch >= 2.5.1, CUDA 12.4

3.3 Other Open-Source Video Models

Model Type Link
Stable Video Diffusion Image-to-video github.com/Stability-AI/generative-models
ModelScope Text2Video Text-to-video huggingface.co/damo-vilab
HuggingFace Diffusers Multi-model library github.com/huggingface/diffusers

Part 4: Agentic Frameworks

4.1 VideoAgent (HKU)

Multi-agent system for video understanding, editing, and creation.

Resource Description Link
VideoAgent Main repo github.com/HKUDS/VideoAgent
ViMax Related project github.com/HKUDS/AI-Creator

VideoAgent Architecture:

  1. Intent Analysis - Decomposes user instructions into sub-intents
  2. Autonomous Tool Use - Graph-based workflow with feedback loops
  3. Multi-Modal Understanding - Visual queries for retrieval

Specialized Agents:

  • Storyboard Agent (visual queries)
  • Video agents (understanding, editing, remixing)
  • Voice synthesis agents (TTS, conversion)
  • Graph Router (LLM-driven orchestration)

Unique Capabilities:

  • Beat-synced edits
  • Meme video remaking
  • Song remixes
  • Cross-lingual adaptations
  • 0.95 workflow success rate

4.2 LAVE (Meta Research)

LLM-Powered Agent Assistance and Language Augmentation for Video Editing.

Resource Description Link
Paper arXiv 2024 arxiv.org/abs/2402.10294
Project Page Demo & details dgp.toronto.edu/~bryanw/lave
ACM DL IUI 2024 dl.acm.org/doi/10.1145/3640543.3645143

LAVE Features:

  • Auto-generates language descriptions for footage
  • LLM plans and executes editing tasks
  • Mixed-initiative UI (agent + manual refinement)
  • Uses VLMs for video content understanding
  • Brainstorming, semantic search, storyboarding

4.3 LangGraph (LangChain)

Production-ready framework for stateful, multi-step AI workflows.

Resource Description Link
LangGraph Main framework langchain-ai.github.io/langgraph
Deep Agents Complex agent harness github.com/langchain-ai/deepagents

LangGraph Features:

  • Stateful workflows with persistence
  • Human-in-the-loop controls
  • Memory APIs
  • 1-click agent deployment

Part 5: MCP (Model Context Protocol) Servers

Anthropic's MCP allows LLMs to securely use external tools.

5.1 FFmpeg MCP Servers

Server Description Link
Video Editor MCP Kush Agrawal's server github.com/Kush36Agrawal/Video_Editor_MCP
ffmpeg-mcp-lite Lightweight FFmpeg MCP github.com/kevinwatt/ffmpeg-mcp-lite
mcp-ffmpeg Bits Corp server glama.ai/mcp/servers/@bitscorp-mcp/mcp-ffmpeg

Example Usage:

User: "Trim video.mp4 from 1:30 to 2:45"
→ MCP translates to FFmpeg command
→ Executes locally
→ Returns result with progress tracking

5.2 Remotion MCP

Resource Description Link
Official MCP Remotion docs integration remotion.dev/docs/ai/mcp
@remotion/mcp npm package npmjs.com/package/@remotion/mcp
Rodumani Full editor via MCP github.com/smilish67/rodumani

Rodumani Features:

  • Media file management (upload, metadata, thumbnails)
  • Timeline editing (multi-track, precise timing)
  • Editing operations (trim, split, move, undo/redo)
  • 2D transformations, keyframe animation
  • Transition effects

Part 6: Full Pipeline References

Production-Ready Projects

Project Stars What It Does Link
ShortGPT 5k+ Full YouTube Shorts automation github.com/RayVentura/ShortGPT
MoneyPrinterTurbo 18k+ One-click video generation github.com/harry0703/MoneyPrinterTurbo
MoneyPrinter 10k+ YouTube Shorts from topic github.com/FujiwaraChoki/MoneyPrinter
Mosaico 1k+ Python video composition with AI github.com/FolhaSP/mosaico
Frame New AI-powered "vibe" video editor github.com/aregrid/frame

ShortGPT Breakdown

  • Uses OpenAI for script generation
  • ElevenLabs + EdgeTTS for voice
  • Pexels for background footage
  • Auto-generates captions
  • Supports 30+ languages

MoneyPrinterTurbo Breakdown

  • One-click video from topic/keyword
  • Auto-generates copy, materials, subtitles, music
  • Supports multiple LLMs (OpenAI, Gemini, Ollama, DeepSeek)
  • High-definition, royalty-free
  • Web UI at http://127.0.0.1:8080

Part 7: Voice & Audio

7.1 Text-to-Speech

Provider Type Best For Link
ElevenLabs API Most realistic github.com/elevenlabs/elevenlabs-python
OpenAI TTS API Fast, integrated OpenAI API
Coqui TTS Local Free, open source github.com/coqui-ai/TTS
Bark Local Emotions, sound effects github.com/suno-ai/bark

7.2 Transcription & Captions

Tool Type Best For Link
Whisper Local Accuracy github.com/openai/whisper
faster-whisper Local Speed (4x faster) github.com/guillaumekln/faster-whisper
auto-subtitle Local Whisper + FFmpeg overlay github.com/m1guelpf/auto-subtitle
AssemblyAI API Real-time assemblyai.com

7.3 AI Music

Provider Type Link
Suno Full songs github.com/gcui-art/suno-api (unofficial)
Udio DAW integration udio.com
Stable Audio Background music stability.ai

Part 8: Image Generation

Provider Model Type Link
OpenAI DALL-E 3 API OpenAI API
Stability AI SDXL Local/API github.com/Stability-AI/generative-models
Black Forest Labs Flux Local/API github.com/black-forest-labs/flux
HuggingFace Diffusers Multi-model github.com/huggingface/diffusers
ComfyUI Node workflows Local github.com/comfyanonymous/ComfyUI

Part 9: Avatars

Provider Type Link
HeyGen Professional avatars github.com/HeyGen-Official/StreamingAvatarSDK
D-ID Realistic motion d-id.com
SadTalker Open source face animation github.com/OpenTalker/SadTalker

Project Structure

voidstudio/
├── apps/
│   ├── cli/                      # Command-line interface
│   │   ├── src/
│   │   │   ├── commands/         # CLI commands
│   │   │   ├── lib/              # Shared utilities
│   │   │   └── index.ts
│   │   └── package.json
│   │
│   ├── web/                      # Web UI (Next.js + Remotion)
│   │   ├── src/
│   │   │   ├── app/              # Next.js app router
│   │   │   ├── components/       # React components
│   │   │   └── remotion/         # Remotion compositions
│   │   └── package.json
│   │
│   └── mcp-server/               # MCP server for Claude
│       ├── src/
│       │   ├── tools/            # MCP tool definitions
│       │   └── index.ts
│       └── package.json
│
├── packages/
│   ├── core/                     # Core types and utilities
│   ├── director/                 # Claude-powered orchestration
│   ├── composer/                 # Scene assembly logic
│   ├── timeline/                 # OTIO integration
│   └── exporter/                 # FFmpeg/Editly export
│
├── pipelines/
│   ├── video/                    # Veo, Runway, VACE
│   ├── voice/                    # ElevenLabs, OpenAI TTS
│   ├── image/                    # DALL-E, Flux, SD
│   ├── avatar/                   # HeyGen, SadTalker
│   ├── music/                    # Suno, Stable Audio
│   ├── caption/                  # Whisper, AssemblyAI
│   └── screen/                   # Playwright recording
│
├── remotion/                     # Remotion video project
│   ├── src/
│   │   ├── compositions/         # Video templates
│   │   ├── components/           # Reusable components
│   │   └── Root.tsx
│   └── remotion.config.ts
│
├── templates/                    # Video template library
│   ├── explainer/
│   ├── tutorial/
│   ├── storytelling/
│   ├── short-form/
│   └── faceless/
│
├── python/                       # Python utilities
│   ├── ffmperative/              # Natural language FFmpeg
│   ├── whisper_pipeline/         # Caption generation
│   ├── vace/                     # VACE model integration
│   └── requirements.txt
│
├── config/
│   ├── providers.yaml
│   └── templates.yaml
│
└── package.json                  # Root monorepo config

Implementation Phases

Phase 1: Foundation (Week 1-2)

  • Architecture plan
  • Monorepo setup (pnpm workspaces)
  • Core TypeScript packages
  • Basic CLI structure
  • Remotion project initialization
  • FFmpeg utility layer (FFMPerative)

Phase 2: Pipelines (Week 3-4)

  • Voice pipeline (ElevenLabs + OpenAI TTS)
  • Image pipeline (DALL-E + Replicate)
  • Caption pipeline (faster-whisper)
  • Basic Editly integration

Phase 3: Composition (Week 5-6)

  • Remotion components library
  • Template system (explainer, tutorial)
  • OpenTimelineIO integration
  • Scene assembly logic

Phase 4: Orchestration (Week 7-8)

  • Director agent (Claude)
  • Workflow graph execution
  • Full pipeline integration
  • CLI commands complete

Phase 5: Advanced (Week 9+)

  • Video generation (Runway, VACE)
  • Avatar pipeline (HeyGen)
  • Music pipeline (Suno)
  • Web UI
  • MCP server

CLI Design

# Full video creation from topic
void create "How quantum computers work" \
  --style explainer \
  --duration 60 \
  --voice elevenlabs:adam

# Script generation only
void script "The history of AI" --format youtube --output script.json

# Voice generation
void voice "Hello world" --provider elevenlabs --voice adam --output hello.mp3

# Image generation
void image "A futuristic city at sunset" --provider dalle --count 4

# Video clip generation
void clip "A robot walking through a forest" --provider runway --duration 4

# Caption generation
void caption ./video.mp4 --output ./video.srt --language en

# Render Remotion composition
void render ./projects/my-video --output ./final.mp4

# Avatar video
void avatar "Welcome to my channel" --provider heygen --avatar professional

# Natural language editing (FFMPerative style)
void edit "trim video.mp4 from 0:30 to 1:45 and add fade transitions"

# Compose from Editly spec
void compose ./spec.json5 --output ./final.mp4

# Export to OTIO
void export ./project.json --format otio --output ./project.otio

Environment Variables

# .env.example

# LLM
ANTHROPIC_API_KEY=

# Voice
ELEVENLABS_API_KEY=
OPENAI_API_KEY=

# Video Generation
RUNWAY_API_KEY=
REPLICATE_API_KEY=
GCP_PROJECT_ID=
GOOGLE_APPLICATION_CREDENTIALS=

# Avatars
HEYGEN_API_KEY=
DID_API_KEY=

# Music
SUNO_COOKIE=  # Unofficial

# Storage
R2_ACCOUNT_ID=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET_NAME=

# Database
DATABASE_URL=file:./data/voidstudio.db

Key Architecture Decisions

Why Multiple Composition Engines?

Engine Use Case
Remotion Complex animations, React components, interactive preview
Editly Fast declarative edits, JSON specs, simple compositions
FFmpeg direct Raw processing, format conversion, batch operations
OTIO Industry interchange, export to NLEs

Why Agentic Architecture?

Following LAVE and VideoAgent research:

  • LLM as central planner ("Director")
  • Specialized agents for different tasks
  • Graph-based workflow execution
  • Self-correction and feedback loops
  • Mixed-initiative (AI + human refinement)

Why MCP?

  • Standardized tool interface for Claude
  • Secure local execution
  • Progress tracking and error handling
  • Future-proof (Anthropic-backed standard)

Academic References

  1. LAVE: Wang et al. "LLM-Powered Agent Assistance and Language Augmentation for Video Editing" (IUI 2024) - arXiv:2402.10294

  2. VideoAgent: HKUDS "All-in-One Agentic Framework for Video Understanding, Editing, and Remaking" - GitHub

  3. VACE: Alibaba "Video All-in-one Creation and Editing" (ICCV 2025) - GitHub

  4. OpenTimelineIO: Pixar/ASWF - GitHub


Next Steps

  1. Approve this plan - Any changes?
  2. Initialize monorepo - pnpm + TypeScript
  3. Set up Remotion - Core composition engine
  4. Build FFMPerative integration - Natural language editing
  5. Create first template - Simple explainer
  6. Build Director agent - Claude orchestration

Ready to build?