This is the code repository for the paper:
PresentAgent: Multimodal Agent for Presentation Video Generation
Jingwei Shi*, Zeyu Zhang*†, Biao Wu*, Yanjie Liang*, Meng Fang, Ling Chen, and Yang Zhao#
*Equal contribution. †Project lead. #Corresponding author.
Paper | Colab Demo | Data | HF Paper
Note
🙋🏻♀️ To learn more about PresentAgent, please see the following presentation video, which was generated entirely by PresentAgent without any manual curation.
presentagent_presentation.mp4
If you use any content of this repo for your work, please cite the following our paper:
@article{shi2025presentagent,
title={PresentAgent: Multimodal Agent for Presentation Video Generation},
author={Shi, Jingwei and Zhang, Zeyu and Wu, Biao and Liang, Yanjie and Fang, Meng and Chen, Ling and Zhao, Yang},
journal={arXiv preprint arXiv:2507.04036},
year={2025}
}
2025/07/18: 👉 Our paper has been promoted by Synced.
2025/07/16: 🔔 Our paper has been promoted by AI-Era.
2025/07/11: 📌 Our paper has been promoted by ZHIDING.
2025/07/11: 👋 Our paper has been promoted by QbitAI.
2025/07/08: 📣 Our paper has been promoted by CVer.
- ✅ code release
- ✅ api version
- ✅ colab demo
- ✅ paper release
- ✅ data release
- ⬜️ local version
- ⬜️ HF demo
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document–presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats.
Note
If you’d like to learn more about our paper, be sure to check out this youtube video on the AI Papers Podcast Daily.
e1979ff2f5d94005746a80a9a767bfd8.mp4
Tip
🎮 Before deploying PresentAgent on your local machine, please check out our Colab demo, which is available online and ready to use.
conda create -n presentagent python=3.11
conda activate presentagent
pip install -r requirements.txt
cd presentagent/MegaTTS3
Model Download
The pretrained checkpoint can be found at Google Drive or Huggingface. Please download them and put them to presentagent/MegaTTS3/checkpoints/xxx
.
Requirements (for Linux)
# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0
# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
Requirements (for Windows)
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3
# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
# Set environment variable for root directory
set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users
# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0 # Windows
$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
-
Serve Backend
Initialize your models in
presentagent/backend.py
:language_model = AsyncLLM( model="Qwen2.5-72B-Instruct", api_base="http://localhost:7812/v1" ) vision_model = AsyncLLM(model="gpt-4o-2024-08-06") text_embedder = AsyncLLM(model="text-embedding-3-small")
Or use the environment variables:
export OPENAI_API_KEY="your_key" export API_BASE="http://your_service_provider/v1" export LANGUAGE_MODEL="Qwen2.5-72B-Instruct-GPTQ-Int4" export VISION_MODEL="gpt-4o-2024-08-06" export TEXT_MODEL="text-embedding-3-small"
python backend.py
-
Launch Frontend
Note: The backend API endpoint is configured at
presentagent/vue.config.js
cd presentagent npm install npm run serve
First, you need to upload a PPT template and the document, then click Generate Slides to generate and download the PPT. After downloading the PPT, you can modify it in your own way and then click PPT2Presentation.
After uploading the PPT, you can click Start Conversion to make a presentation video.
Finally, you will get a presentation video and watch it in the page or download it.
To support the evaluation of document to presentation video generation, we curate the Doc2Present Benchmark, a diverse dataset of document–presentation video pairs spanning multiple domains. As shown in the following figure, our benchmark encompasses four representative document types (academic papers, web pages, technical blogs, and slides) paired with human-authored videos, covering diverse real-world domains like education, research, and business reports.
We collect 30 high-quality video samples from public platforms, educational repositories, and professional presentation archives. Each video follows a structured narration format, combining slide-based visuals with synchronized voiceover. We manually align each video with its source document and ensure the following conditions are met:
-
The content structure of the video follows that of the document.
-
The visuals convey document information in a compact, structured form.
-
The narration and slides are well-aligned temporally.
The average document length is 3,000–8,000 words, while the corresponding videos range from 1 to 2 minutes and contain 5-10 slides. This setting highlights the core challenge of the task: transforming dense, domain-specific documents into effective and digestible multimodal presentations.
To assess the quality of generated presentation videos, we adopt two complementary evaluation strategies: Objective Quiz Evaluation and Subjective Scoring.
For each video, we provide the vision-language model with the complete set of slide images and the full narration transcript as a unified input—simulating how a real viewer would experience the presentation.
- In Objective Quiz Evaluation, the model answers a fixed set of factual questions to determine whether the video accurately conveys the key information from the source content.
- In Subjective Scoring, the model evaluates the video along three dimensions: the coherence of the narration, the clarity and design of the visuals, and the overall ease of understanding.
- All evaluations are conducted without ground-truth references and rely entirely on the model’s interpretation of the presented content.
For Objective Quiz Evaluation, to evaluate whether a generated presentation video effectively conveys the core content of its source document, we use a fixed-question comprehension evaluation protocol. Specifically, we manually design five multiple-choice questions for each document, tailored to its content as follows:
Prensentation of Web Pages | What is the main feature highlighted in the iPhone’s promotional webpage? |
---|---|
A. | A more powerful chip for faster performance |
B. | A brighter and more vibrant display |
C. | An upgraded camera system with better lenses |
D. | A longer-lasting and more efficient battery |
Prensentation of Academic Paper | What primary research gap did the authors aim to address by introducing the FineGym dataset? |
A. | Lack of low-resolution sports footage for compression studies |
B. | Need for fine-grained action understanding that goes beyond coarse categories |
C. | Absence of synthetic data to replace human annotations |
D. | Shortage of benchmarks for background context recognition |
For Subjective Scoring, to evaluate the quality of generated presentation videos, we adopt a prompt-based assessment using vision-language models. The prompts are as follows:
Video | Scoring Prompt |
---|---|
Narr. Coh. | “How coherent is the narration across the video? Are the ideas logically connected and easy to follow?” |
Visual Appeal | “How would you rate the visual design of the slides in terms of layout, aesthetics, and overall quality?” |
Comp. Diff. | “How easy is it to understand the presentation as a viewer? Were there any confusing or contradictory parts?” |
Audio | Scoring Prompt |
Narr. Coh. | “How coherent is the narration throughout the audio? Are the ideas logically structured and easy to follow?” |
Audio Appeal | “How pleasant and engaging is the narrator’s voice in terms of tone, pacing, and delivery?” |
Method | Model | Quiz Accuracy | Video Score(mean) | Audio Score(mean) |
---|---|---|---|---|
Human | Human | 0.56 | 4.47 | 4.80 |
PresentAgent | Claude-3.7-sonnet | 0.64 | 4.00 | 4.53 |
PresentAgent | Qwen-VL-Max | 0.52 | 4.47 | 4.60 |
PresentAgent | Gemini-2.5-pro | 0.52 | 4.33 | 4.33 |
PresentAgent | Gemini-2.5-flash | 0.52 | 4.33 | 4.40 |
PresentAgent | GPT-4o-Mini | 0.64 | 4.67 | 4.40 |
PresentAgent | GPT-4o | 0.56 | 3.93 | 4.47 |
3D.CoCa.Contrastive.Learners.are.3D.Captioners.mp4 |
DOEI.Dual.Optimization.of.Embedding.Information.for.Attention-Enhanced.Class.Activation.Maps.mp4 |
MMCLIP.Cross-modal.Attention.Masked.Modelling.for.Medical.Language-Image.Pre-Training.mp4 |
We warmly welcome you to contribute to our project by submitting pull requests—your involvement is key to keeping our work at the cutting edge! Specifically, we encourage efforts to expand its compatibility with the latest visual-language (VL) models and text-to-speech (TTS) models, ensuring the project stays aligned with the most recent advancements in these rapidly evolving fields.
Beyond model updates, we also invite you to explore adding new features that could enhance the project’s functionality, usability, or versatility. Whether it’s optimizing existing workflows, introducing novel tools, or addressing unmet needs in the community, your creative contributions will help make this project more robust and valuable for everyone.
We thank the authors of PPTAgent, PPT Presenter, and MegaTTS3 for their open-source code.