Skip to content

RosettaWYzhang/VHOI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

Official codebase for the CVPR 2026 Findings paper.

Authors
Wanyue Zhang1, Lin Geng Foo1, Thabo Beeler3, Rishabh Dabral1,2, Christian Theobalt1,2

1 Max Planck Institute for Informatics, Saarland Informatics Campus
2 Saarbrucken Research Center for Visual Computing, Interaction and AI
3 Google

Links
Project Page | Paper

Teaser

Input trajectory Augmentor result Final result
Input Image with Trajectories Augmentor Result Final Result

Overview

VHOI is organized as a two-stage pipeline:

  1. A trajectory augmentor densifies sparse human-object trajectories into HOI-aware mask sequences.
  2. A dense control model generates videos conditioned on the densified motion representation.

VHOI teaser

This repository is centered around these main entry points:

  • sat/train_video.py: training
  • sat/sample_video.py: batch inference / evaluation
  • sat/interactive_traj.py: trajectory drawing UI
  • sat/inference.py: fixed two-stage launcher for user-defined trajectories

Setup

The current tora environment was tested with:

  • Python 3.10.0
  • PyTorch 2.4.0
  • torchvision 0.19.0
  • CUDA 12.1
  • DeepSpeed 0.15.1

Important: this repository contains a modified copy of SwissArmyTransformer under modules/SwissArmyTransformer/. For a fresh environment, you should install that local copy in editable mode. The root requirements.txt intentionally does not install SwissArmyTransformer from PyPI, because the active sat package must come from the bundled source tree.

From scratch, the recommended setup is:

conda create -n vhoi python=3.10 -y
conda activate vhoi

# Install PyTorch first with a CUDA build that matches your machine.
# Example: the setup used for our current environment.
conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# Install the bundled SwissArmyTransformer so that `import sat` resolves to
# modules/SwissArmyTransformer/sat.
pip install -e modules/SwissArmyTransformer

# Install repository dependencies.
pip install -r requirements.txt

In a fresh environment, no manual CUDA-extension build step is needed. SAT's fused optimizer extension will be compiled automatically on the first training run.

Troubleshooting: fused_ema_adam build issues (click to expand or collapse)

Training requires a CUDA-12.1-compatible host compiler. If your cluster's default gcc is newer than 12 and the fused optimizer build fails, either load a compatible compiler module or install one into the conda environment, for example:

conda install -c conda-forge gcc_linux-64=12 gxx_linux-64=12

If you reuse a machine that already has cached SAT/DeepSpeed CUDA extensions from another CUDA/PyTorch environment and you hit errors around fused_ema_adam or libcudart, clear the stale cache and rerun:

rm -rf ~/.cache/torch_extensions/fused_ema_adam

This is only a troubleshooting step for mixed or previously used environments, not a normal requirement of the repository.

All training, evaluation, and inference commands below assume your current working directory is sat/:

cd sat

Model and experiment configs are stored in:

  • sat/configs/model/
  • sat/configs/ours/

Bundled example assets, prompts, masks, and trajectories are stored in:

  • sat/assets/

Download Links

Download model weights and put them in the sat/ckpts folder.

The ckpts folder structure should look like:

VHOI_CODE/
└── sat/
    └── ckpts/
        ├── t5-v1_1-xxl/
        │   ├── model-00001-of-00002.safetensors
        │   └── ...
        ├── vae/
        │   └── 3d-vae.pt
        ├── tora/
        │   └── i2v/
        │       └── mp_rank_00_model_states.pt
        └── vhoi/
            ├── augmentor/
            │   └── mp_rank_00_model_states.pt
            └── dense/
                └── mp_rank_00_model_states.pt

Downloading this weight requires following the CogVideoX License.

Inference with user defined trajectories

  1. Run the integrated two-stage inference script.
cd sat
python3 inference.py

By default, inference.py uses:

  • trajectories from assets/user_traj_human_ref.txt and assets/user_traj_object_ref.txt
  • prompts from assets/augmentor_prompt.txt and assets/dense_prompt.txt
  • masks and sapien annotations from assets/
  • output folders output/stage1/ and output/stage2/
  1. To export new trajectories with the local drawing UI, save them anywhere you want and pass the files directly to inference.py.
python3 interactive_traj.py \
  --image assets/microphone.png \
  --output-dir output \
  --prefix user

python3 inference.py \
  --human-traj output/user_human.txt \
  --object-traj output/user_object.txt \
  --output-dir output \
  --seed 12345

This path still requires a high-memory GPU for generation ( A100 or H100 ).

If you want to run inference on custom examples, use SAPIENS to generate the human masks and Grounded SAM 2 to generate the object masks.

Training

Data Processing

We train on processed subsets of HOIGen-1M and BEHAVE. Download the raw data from the official sources, then prepare the inputs expected by this repo.

All paths below are relative to sat/, which is also the working directory assumed by the training commands.

  1. Downsample each source video to 49 frames.
  2. Run SAPIENS on every frame to obtain:
    • a human mask video
    • per-frame segmentation .npy outputs
  3. Run Grounded SAM 2 to obtain the object mask video for the key object in each prompt.

Example processed assets can be found in:

  • assets/human_mask.mp4
  • assets/sapien_seg.npy
  • assets/object_mask.avi

Place the processed data under the following structure:

sat/
└── data/
    ├── hoigen/
    │   ├── videos/
    │   ├── person_masks/
    │   ├── object_masks/
    │   ├── sapien_data/
    │   ├── train_dense_model_video_paths.txt
    │   ├── val_dense_model_video_paths.txt
    │   ├── train_dense_model_prompts.txt
    │   ├── val_dense_model_prompts.txt
    │   ├── train_augmentor_prompts.txt
    │   ├── val_augmentor_prompts.txt
    │   ├── train_person_mask_paths.txt
    │   ├── val_person_mask_paths.txt
    │   ├── train_object_mask_paths.txt
    │   ├── val_object_mask_paths.txt
    │   ├── train_sapien_data_paths.txt
    │   └── val_sapien_data_paths.txt
    └── behave/
        └── same structure and file naming pattern as `hoigen/`

The train and validation sequence lists are:

  • HOIGen train: data/hoigen/train_dense_model_video_paths.txt (34,657 sequences)
  • HOIGen val: data/hoigen/val_dense_model_video_paths.txt (50 sequences)
  • BEHAVE train: data/behave/train_dense_model_video_paths.txt (4,277 sequences)
  • BEHAVE val: data/behave/val_dense_model_video_paths.txt (50 sequences)

The augmentor prompt files and dense-model prompt files are different for the same video sequences. Please refer to the paper appendix for the augmentor prompt construction details.

The .txt files under sat/data/ already contain relative paths such as data/hoigen/videos/... and data/hoigen/person_masks/.... They assume that the processed files are placed under the corresponding folders inside sat/data/. If you store the processed data somewhere else, either rewrite the paths inside those .txt files or update the YAML configs under sat/configs/ours/ to point to your own path-list files.

Stage 1: Train the Trajectory Augmentor

For the first augmentor training run, use the embed_first model config. This starts from the 32-channel patch embed and expands it to 48 channels after loading the base checkpoint. The example below uses 4 GPUs. Set nproc_per_node to match the number of GPUs you want to use.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v_embed_first.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
  --experiment-name "augmentor"

To resume augmentor training after that, switch to the 48-channel config. Also update load in configs/ours/train_sapien_augmentor_hoigen.yaml so it points to the latest augmentor checkpoint.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v_embed.yaml configs/ours/train_sapien_augmentor_hoigen.yaml \
  --experiment-name "augmentor"

Stage 2: Train the Dense Control Model

For the dense model, use:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
torchrun --standalone --nproc_per_node=4 train_video.py \
  --base configs/model/cogvideox_5b_tora_i2v.yaml configs/ours/train_dense_hoigen.yaml \
  --experiment-name "dense_model"

Metrics

We use the following external tools for evaluation:

Acknowledgements

This repository builds on the following open-source projects:

The authors would like to thank Amin Parchami for helpful advice on cluster usage.

Citation

If you find this code useful, please cite:

@article{zhang2025vhoi,
title = {VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification},
author = {Zhang, Wanyue and Foo, Lin Geng, and Dabral, Rishabh and Beeler, Thabo and Theobalt, Christian},
year = {2025},
archivePrefix = {arXiv},
eprint={2512.09646},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors