This is the official repository for the paper:
ReMoMask: Retrieval-Augmented Masked Motion Generation
Zhengdao Li*, Siheng Wang*, Zeyu Zhang*β , and Hao Tang#
*Equal contribution. β Project lead. #Corresponding author.
teaser.mp4
@article{li2025remomask,
title={ReMoMask: Retrieval-Augmented Masked Motion Generation},
author={Li, Zhengdao and Wang, Siheng and Zhang, Zeyu and Tang, Hao},
journal={arXiv preprint arXiv:2508.02605},
year={2025}
}
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatiotemporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks, including HumanML3D, demonstrate state-of-the-art performance, with the FID score significantly improved to 0.095 compared to SOTA RAG-t2m method.
- Upload our paper to arXiv and build project pages.
- Upload the code.
- Release TMR model.
- Release T2M model.
details
conda create -n remomask python=3.10
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
conda activate remomask
We tested our environment on both A800 and H20.
-
RAG
: Download the pretrained-rag-models (comming soon) and place at./Part_TMR
-
T2M
: Download the pretrained-t2m-models (comming soon) and place at./logs/humanml3d/
Follow previous method to prepare the evaluation models and gloves. Or you can download from here (provided by MoGenTS) and place to ./checkpoints
Follow the instruction in HumanML3D, then place the result dataset to ./dataset/HumanML3D
.
details
python demo.py --gpu_id 0 --ext exp1 --text_prompt "A person is walking on a circle." --checkpoints_dir logs --dataset_name humanml3d --mtrans_name pretrain_mtrans --rtrans_name pretrain_rtrans
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans after your training done
explanation:
--repeat_times
: number of replications for generation, default1
.--motion_length
: specify the number of poses for generation.
output will be in ./outputs/exp1/
details
python -m Part_TMR.scripts.train
then build a rag database for training t2m model:
python build_rag_database.py
you will get ./database
bash run_rvq.sh vq 0 humanml3d --batch_size 256 --num_quantizers 6 --max_epoch 50 --quantize_dropout_prob 0.2 --gamma 0.1 --code_dim2d 1024 --nb_code2d 256
# using one gpu
bash run_mtrans.sh mtrans 1 0 humanml3d --vq_name pretrain_vq --batch_size 256 --max_epoch 2000 --attnj --attnt --latent_dim 512 --n_heads 8
# using multi gpus
bash run_mtrans.sh mtrans 8 0,1,2,3,4,5,6,7 humanml3d --vq_name pretrain_vq --batch_size 256 --max_epoch 2000 --attnj --attnt --latent_dim 512 --n_heads 8
# using multi gpus
bash run_rtrans.sh rtrans 2 humanml3d --batch_size 64 --vq_name vq --cond_drop_prob 0.01 --share_weight --max_epoch 2000 --attnj --attnt
# here, 2 means cuda:0,1
details
python -m Part_TMR.scripts.test
python eval_vq.py --gpu_id 0 --name pretrain_vq --dataset_name humanml3d --ext eval --which_epoch net_best_fid.tar
# change pretrain_vq to your vq
python eval_mask.py --dataset_name humanml3d --mtrans_name pretrain_mtrans --gpu_id 0 --cond_scale 4 --time_steps 10 --ext eval --which_epoch fid
# change pretrain_mtrans to your mtrabs
HumanML3D:
python eval_res.py --gpu_id 0 --dataset_name humanml3d --mtrans_name pretrain_mtrans --rtrans_name pretrain_rtrans --cond_scale 4 --time_steps 10 --ext eval --which_ckpt net_best_fid.tar --which_epoch fid --traverse_res
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
KIT-ML:
python eval_res.py --gpu_id 0 --dataset_name kit --mtrans_name pretrain_mtrans_kit --rtrans_name pretrain_rtrans_kit --cond_scale 4 --time_steps 10 --ext eval --which_ckpt net_best_fid.tar --which_epoch fid --traverse_res
# change pretrain_mtrans and pretrain_rtrans to your mtrans and rtrans
details
details
You can download the blender from [instructions](https://www.blender.org/download/lts/2-93/). Please install exactly this version. For our paper, we use `blender-2.93.18-linux-x64`. > ### a. unzip it: ```bash tar -xvf blender-2.93.18-linux-x64.tar.xz ```cd blender-2.93.18-linux-x64
./blender --background --version
you should see: Blender 2.93.18 (hash cb886axxxx built 2023-05-22 23:33:27)
./blender --background --python-expr "import sys; import os; print('\nThe version of python is ' + sys.version.split(' ')[0])"
you should see: The version of python is 3.9.2
./blender --background --python-expr "import sys; import os; print('\nThe path to the installation of python is\n' + sys.executable)"
you should see: The path to the installation of python is /xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9s
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m ensurepip --upgrade
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install --upgrade pip
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install numpy==2.0.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install matplotlib==3.9.4
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra-core==1.3.2
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install hydra_colorlog==1.2.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install moviepy==1.0.3
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install shortuuid==1.0.13
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install natsort==8.4.0
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install pytest-shutil==1.8.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==4.67.1
/xxx/blender-2.93.18-linux-x64/2.93/python/bin/python3.9 -m pip install tqdm==1.17.0
python -m fit --dir new_test_npy --save_folder new_temp_npy --cuda cuda:0
/xxx/blender-2.93.18-linux-x64/blender --background --python render.py -- --cfg=./configs/render_mld.yaml --dir=test_npy --mode=video --joint_type=HumanML3D
--mode=video
: render to mp4 video--mode=sequence
: render to a png image, calle sequence.
We sincerely thank the open-sourcing of these works where our code is based on:
MoMask, MoGenTS, ReMoDiffuse, MDM, TMR, ReMoGPT
This code is distributed under an CC BY-NC-SA 4.0.
Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.