VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
Official codebase for the CVPR 2026 Findings paper.
Authors
Wanyue Zhang1, Lin Geng Foo1, Thabo Beeler3, Rishabh Dabral1,2, Christian Theobalt1,2
1 Max Planck Institute for Informatics, Saarland Informatics Campus
2 Saarbrucken Research Center for Visual Computing, Interaction and AI
3 Google
Links
Project Page | Paper
|
|
|
| Input Image with Trajectories | Augmentor Result | Final Result |
VHOI is organized as a two-stage pipeline:
- A trajectory augmentor densifies sparse human-object trajectories into HOI-aware mask sequences.
- A dense control model generates videos conditioned on the densified motion representation.
This repository is centered around these main entry points:
sat/train_video.py: trainingsat/sample_video.py: batch inference / evaluationsat/interactive_traj.py: trajectory drawing UIsat/inference.py: fixed two-stage launcher for user-defined trajectories
The current tora environment was tested with:
- Python
3.10.0 - PyTorch
2.4.0 - torchvision
0.19.0 - CUDA
12.1 - DeepSpeed
0.15.1
Important: this repository contains a modified copy of SwissArmyTransformer under modules/SwissArmyTransformer/. For a fresh environment, you should install that local copy in editable mode. The root requirements.txt intentionally does not install SwissArmyTransformer from PyPI, because the active sat package must come from the bundled source tree.
From scratch, the recommended setup is:
conda create -n vhoi python=3.10 -y
conda activate vhoi
# Install PyTorch first with a CUDA build that matches your machine.
# Example: the setup used for our current environment.
conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# Install the bundled SwissArmyTransformer so that `import sat` resolves to
# modules/SwissArmyTransformer/sat.
pip install -e modules/SwissArmyTransformer
# Install repository dependencies.
pip install -r requirements.txtIn a fresh environment, no manual CUDA-extension build step is needed. SAT's fused optimizer extension will be compiled automatically on the first training run.
Troubleshooting: fused_ema_adam build issues (click to expand or collapse)
Training requires a CUDA-12.1-compatible host compiler. If your cluster's default gcc is newer than 12 and the fused optimizer build fails, either load a compatible compiler module or install one into the conda environment, for example:
conda install -c conda-forge gcc_linux-64=12 gxx_linux-64=12If you reuse a machine that already has cached SAT/DeepSpeed CUDA extensions from another CUDA/PyTorch environment and you hit errors around fused_ema_adam or libcudart, clear the stale cache and rerun:
rm -rf ~/.cache/torch_extensions/fused_ema_adamThis is only a troubleshooting step for mixed or previously used environments, not a normal requirement of the repository.
All training, evaluation, and inference commands below assume your current working directory is sat/:
cd satModel and experiment configs are stored in:
sat/configs/model/sat/configs/ours/
Bundled example assets, prompts, masks, and trajectories are stored in:
sat/assets/
Download model weights and put them in the sat/ckpts folder.
- Download vhoi model weights from https://huggingface.co/wanyue-zhang/vhoi/tree/main
- Download the VAE and T5 model following CogVideo: VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 T5: text_encoder, tokenizer
- Download the Tora i2v base weights from Link. This is only necessary if you want to retrain from Tora.
The ckpts folder structure should look like:
VHOI_CODE/
└── sat/
└── ckpts/
├── t5-v1_1-xxl/
│ ├── model-00001-of-00002.safetensors
│ └── ...
├── vae/
│ └── 3d-vae.pt
├── tora/
│ └── i2v/
│ └── mp_rank_00_model_states.pt
└── vhoi/
├── augmentor/
│ └── mp_rank_00_model_states.pt
└── dense/
└── mp_rank_00_model_states.pt
Downloading this weight requires following the CogVideoX License.
- Run the integrated two-stage inference script.
cd sat
python3 inference.pyBy default, inference.py uses:
- trajectories from
assets/user_traj_human_ref.txtandassets/user_traj_object_ref.txt - prompts from
assets/augmentor_prompt.txtandassets/dense_prompt.txt - masks and sapien annotations from
assets/ - output folders
output/stage1/andoutput/stage2/
- To export new trajectories with the local drawing UI, save them anywhere you want and pass the files directly to
inference.py.
python3 interactive_traj.py \
--image assets/microphone.png \
--output-dir output \
--prefix user
python3 inference.py \
--human-traj output/user_human.txt \
--object-traj output/user_object.txt \
--output-dir output \
--seed 12345This path still requires a high-memory GPU for generation ( A100 or H100 ).
If you want to run inference on custom examples, use SAPIENS to generate the human masks and Grounded SAM 2 to generate the object masks.
We train on processed subsets of HOIGen-1M and BEHAVE. Download the raw data from the official sources, then prepare the inputs expected by this repo.
All paths below are relative to sat/, which is also the working directory assumed by the training commands.
- Downsample each source video to 49 frames.
- Run SAPIENS on every frame to obtain:
- a human mask video
- per-frame segmentation
.npyoutputs
- Run Grounded SAM 2 to obtain the object mask video for the key object in each prompt.
Example processed assets can be found in:
assets/human_mask.mp4assets/sapien_seg.npyassets/object_mask.avi
Place the processed data under the following structure:
sat/
└── data/
├── hoigen/
│ ├── videos/
│ ├── person_masks/
│ ├── object_masks/
│ ├── sapien_data/
│ ├── train_dense_model_video_paths.txt
│ ├── val_dense_model_video_paths.txt
│ ├── train_dense_model_prompts.txt
│ ├── val_dense_model_prompts.txt
│ ├── train_augmentor_prompts.txt
│ ├── val_augmentor_prompts.txt
│ ├── train_person_mask_paths.txt
│ ├── val_person_mask_paths.txt
│ ├── train_object_mask_paths.txt
│ ├── val_object_mask_paths.txt
│ ├── train_sapien_data_paths.txt
│ └── val_sapien_data_paths.txt
└── behave/
└── same structure and file naming pattern as `hoigen/`
The train and validation sequence lists are:
- HOIGen train:
data/hoigen/train_dense_model_video_paths.txt(34,657sequences) - HOIGen val:
data/hoigen/val_dense_model_video_paths.txt(50sequences) - BEHAVE train:
data/behave/train_dense_model_video_paths.txt(4,277sequences) - BEHAVE val:
data/behave/val_dense_model_video_paths.txt(50sequences)
The augmentor prompt files and dense-model prompt files are different for the same video sequences. Please refer to the paper appendix for the augmentor prompt construction details.
The .txt files under sat/data/ already contain relative paths such as data/hoigen/videos/... and data/hoigen/person_masks/.... They assume that the processed files are placed under the corresponding folders inside sat/data/.
If you store the processed data somewhere else, either rewrite the paths inside those .txt files or update the YAML configs under sat/configs/ours/ to point to your own path-list files.
For the first augmentor training run, use the embed_first model config. This starts from the 32-channel patch embed and expands it to 48 channels after loading the base checkpoint.
The example below uses 4 GPUs. Set nproc_per_node to match the number of GPUs you want to use.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
--base configs/model/cogvideox_5b_tora_i2v_embed_first.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
--experiment-name "augmentor"To resume augmentor training after that, switch to the 48-channel config.
Also update load in configs/ours/train_sapien_augmentor_hoigen.yaml so it points to the latest augmentor checkpoint.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
--base configs/model/cogvideox_5b_tora_i2v_embed.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
--experiment-name "augmentor"For the dense model, use:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
--base configs/model/cogvideox_5b_tora_i2v.yaml configs/ours/train_dense_hoigen.yaml \
--experiment-name "dense_model"We use the following external tools for evaluation:
- FVD: common_metrics_on_video_quality
- VBench I2V: VBench
- Contact detection: ContactHands
This repository builds on the following open-source projects:
The authors would like to thank Amin Parchami for helpful advice on cluster usage.
If you find this code useful, please cite:
@article{zhang2025vhoi,
title = {VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification},
author = {Zhang, Wanyue and Foo, Lin Geng, and Dabral, Rishabh and Beeler, Thabo and Theobalt, Christian},
year = {2025},
archivePrefix = {arXiv},
eprint={2512.09646},
}


