VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Official codebase for the CVPR 2026 Findings paper.

Authors
Wanyue Zhang¹, Lin Geng Foo¹, Thabo Beeler³, Rishabh Dabral^1,2, Christian Theobalt^1,2

¹ Max Planck Institute for Informatics, Saarland Informatics Campus
² Saarbrucken Research Center for Visual Computing, Interaction and AI
³ Google

Links
Project Page | Paper

Teaser


Input Image with Trajectories	Augmentor Result	Final Result

Overview

VHOI is organized as a two-stage pipeline:

A trajectory augmentor densifies sparse human-object trajectories into HOI-aware mask sequences.
A dense control model generates videos conditioned on the densified motion representation.

This repository is centered around these main entry points:

sat/train_video.py: training
sat/sample_video.py: batch inference / evaluation
sat/interactive_traj.py: trajectory drawing UI
sat/inference.py: fixed two-stage launcher for user-defined trajectories

Setup

The current tora environment was tested with:

Python 3.10.0
PyTorch 2.4.0
torchvision 0.19.0
CUDA 12.1
DeepSpeed 0.15.1

Important: this repository contains a modified copy of SwissArmyTransformer under modules/SwissArmyTransformer/. For a fresh environment, you should install that local copy in editable mode. The root requirements.txt intentionally does not install SwissArmyTransformer from PyPI, because the active sat package must come from the bundled source tree.

From scratch, the recommended setup is:

conda create -n vhoi python=3.10 -y
conda activate vhoi

# Install PyTorch first with a CUDA build that matches your machine.
# Example: the setup used for our current environment.
conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# Install the bundled SwissArmyTransformer so that `import sat` resolves to
# modules/SwissArmyTransformer/sat.
pip install -e modules/SwissArmyTransformer

# Install repository dependencies.
pip install -r requirements.txt

In a fresh environment, no manual CUDA-extension build step is needed. SAT's fused optimizer extension will be compiled automatically on the first training run.

Troubleshooting: fused_ema_adam build issues (click to expand or collapse)

Training requires a CUDA-12.1-compatible host compiler. If your cluster's default gcc is newer than 12 and the fused optimizer build fails, either load a compatible compiler module or install one into the conda environment, for example:

conda install -c conda-forge gcc_linux-64=12 gxx_linux-64=12

If you reuse a machine that already has cached SAT/DeepSpeed CUDA extensions from another CUDA/PyTorch environment and you hit errors around fused_ema_adam or libcudart, clear the stale cache and rerun:

rm -rf ~/.cache/torch_extensions/fused_ema_adam

This is only a troubleshooting step for mixed or previously used environments, not a normal requirement of the repository.

All training, evaluation, and inference commands below assume your current working directory is sat/:

cd sat

Model and experiment configs are stored in:

sat/configs/model/
sat/configs/ours/

Bundled example assets, prompts, masks, and trajectories are stored in:

sat/assets/

Download Links

Download model weights and put them in the sat/ckpts folder.

Download vhoi model weights from https://huggingface.co/wanyue-zhang/vhoi/tree/main
Download the VAE and T5 model following CogVideo: VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 T5: text_encoder, tokenizer
Download the Tora i2v base weights from Link. This is only necessary if you want to retrain from Tora.

The ckpts folder structure should look like:

VHOI_CODE/
└── sat/
    └── ckpts/
        ├── t5-v1_1-xxl/
        │   ├── model-00001-of-00002.safetensors
        │   └── ...
        ├── vae/
        │   └── 3d-vae.pt
        ├── tora/
        │   └── i2v/
        │       └── mp_rank_00_model_states.pt
        └── vhoi/
            ├── augmentor/
            │   └── mp_rank_00_model_states.pt
            └── dense/
                └── mp_rank_00_model_states.pt

Downloading this weight requires following the CogVideoX License.

Inference with user defined trajectories

Run the integrated two-stage inference script.

cd sat
python3 inference.py

By default, inference.py uses:

trajectories from assets/user_traj_human_ref.txt and assets/user_traj_object_ref.txt
prompts from assets/augmentor_prompt.txt and assets/dense_prompt.txt
masks and sapien annotations from assets/
output folders output/stage1/ and output/stage2/

To export new trajectories with the local drawing UI, save them anywhere you want and pass the files directly to inference.py.

python3 interactive_traj.py \
  --image assets/microphone.png \
  --output-dir output \
  --prefix user

python3 inference.py \
  --human-traj output/user_human.txt \
  --object-traj output/user_object.txt \
  --output-dir output \
  --seed 12345

This path still requires a high-memory GPU for generation ( A100 or H100 ).

If you want to run inference on custom examples, use SAPIENS to generate the human masks and Grounded SAM 2 to generate the object masks.

Training

Data Processing

We train on processed subsets of HOIGen-1M and BEHAVE. Download the raw data from the official sources, then prepare the inputs expected by this repo.

All paths below are relative to sat/, which is also the working directory assumed by the training commands.

Downsample each source video to 49 frames.
Run SAPIENS on every frame to obtain:
- a human mask video
- per-frame segmentation .npy outputs
Run Grounded SAM 2 to obtain the object mask video for the key object in each prompt.

Example processed assets can be found in:

assets/human_mask.mp4
assets/sapien_seg.npy
assets/object_mask.avi

Place the processed data under the following structure:

sat/
└── data/
    ├── hoigen/
    │   ├── videos/
    │   ├── person_masks/
    │   ├── object_masks/
    │   ├── sapien_data/
    │   ├── train_dense_model_video_paths.txt
    │   ├── val_dense_model_video_paths.txt
    │   ├── train_dense_model_prompts.txt
    │   ├── val_dense_model_prompts.txt
    │   ├── train_augmentor_prompts.txt
    │   ├── val_augmentor_prompts.txt
    │   ├── train_person_mask_paths.txt
    │   ├── val_person_mask_paths.txt
    │   ├── train_object_mask_paths.txt
    │   ├── val_object_mask_paths.txt
    │   ├── train_sapien_data_paths.txt
    │   └── val_sapien_data_paths.txt
    └── behave/
        └── same structure and file naming pattern as `hoigen/`

The train and validation sequence lists are:

HOIGen train: data/hoigen/train_dense_model_video_paths.txt (34,657 sequences)
HOIGen val: data/hoigen/val_dense_model_video_paths.txt (50 sequences)
BEHAVE train: data/behave/train_dense_model_video_paths.txt (4,277 sequences)
BEHAVE val: data/behave/val_dense_model_video_paths.txt (50 sequences)

The augmentor prompt files and dense-model prompt files are different for the same video sequences. Please refer to the paper appendix for the augmentor prompt construction details.

The .txt files under sat/data/ already contain relative paths such as data/hoigen/videos/... and data/hoigen/person_masks/.... They assume that the processed files are placed under the corresponding folders inside sat/data/. If you store the processed data somewhere else, either rewrite the paths inside those .txt files or update the YAML configs under sat/configs/ours/ to point to your own path-list files.

Stage 1: Train the Trajectory Augmentor

For the first augmentor training run, use the embed_first model config. This starts from the 32-channel patch embed and expands it to 48 channels after loading the base checkpoint. The example below uses 4 GPUs. Set nproc_per_node to match the number of GPUs you want to use.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v_embed_first.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
  --experiment-name "augmentor"

To resume augmentor training after that, switch to the 48-channel config. Also update load in configs/ours/train_sapien_augmentor_hoigen.yaml so it points to the latest augmentor checkpoint.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v_embed.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
  --experiment-name "augmentor"

Stage 2: Train the Dense Control Model

For the dense model, use:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v.yaml configs/ours/train_dense_hoigen.yaml \
  --experiment-name "dense_model"

Metrics

We use the following external tools for evaluation:

FVD: common_metrics_on_video_quality
VBench I2V: VBench
Contact detection: ContactHands

Acknowledgements

This repository builds on the following open-source projects:

The authors would like to thank Amin Parchami for helpful advice on cluster usage.

Citation

If you find this code useful, please cite:

@article{zhang2025vhoi,
title = {VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification},
author = {Zhang, Wanyue and Foo, Lin Geng, and Dabral, Rishabh and Beeler, Thabo and Theobalt, Christian},
year = {2025},
archivePrefix = {arXiv},
eprint={2512.09646},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
modules/SwissArmyTransformer		modules/SwissArmyTransformer
sat		sat
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Teaser

Overview

Setup

Download Links

Inference with user defined trajectories

Training

Data Processing

Stage 1: Train the Trajectory Augmentor

Stage 2: Train the Dense Control Model

Metrics

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Teaser

Overview

Setup

Download Links

Inference with user defined trajectories

Training

Data Processing

Stage 1: Train the Trajectory Augmentor

Stage 2: Train the Dense Control Model

Metrics

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages