Description
First of all, thank you for your excellent work and open-sourcing the code of OmniControl, which has brought great inspiration to my research on controllable human motion generation.
I have been reproducing the experiments in your ICLR 2024 paper recently, and found that the generation performance is significantly inconsistent with the results reported in the paper under the following settings, and I would like to ask for your advice on the possible causes and solutions.
Core Problem
When I set the control signal density >5 (i.e., number of keyframes >5, including 49 frames/25% density and 196 frames/100% density), and specify the 6 core interactive joints mentioned in the paper as the controllable joints, the generated motion has a huge gap with the paper's results in both control accuracy and motion realism.
Reproduction Environment
| Item |
Details |
| Hardware |
NVIDIA RTX 3090 |
| OS |
Ubuntu 20.04 |
| PyTorch Version |
1.13.1 |
| CUDA Version |
11.7 |
| Checkpoint |
Official pre-trained Ours (on all) checkpoint (for all joints control) |
| Dataset |
HumanML3D (processed with the official preprocessing code) |
| Inference Hyperparameters |
All default values from the paper: T=1000, T_s=10, K_e=10, K_l=500, guidance strength τ calculated with the official formula |
Key Experimental Settings
-
Controllable Joints Setting
I set controllable_joints = np.array([0, 10, 11, 15, 20, 21]), which corresponds to the 6 core joints mentioned in the paper:
- 0: pelvis
- 10: left foot
- 11: right foot
- 15: head
- 20: left wrist
- 21: right wrist
This is completely consistent with the joint selection in the paper's "Ours (on all)" experiments.
-
Control Signal Setting
- The control signals are extracted from the ground-truth motion sequences in the HumanML3D test set (consistent with the evaluation protocol in the paper)
- Tested 2 density levels with keyframe number >=5: 5 frames, 49 frames (25% density) and 196 frames (100% density)
- The mask of the control signal is set correctly: valid values for the target joints at the keyframes, and 0 for the rest.
Observed Problem Phenomena
-
Quantitative Performance Gap
The evaluation metrics are far worse than the results reported in Table 1 of the paper: (the case blow is test on density=5)
Avg. err. of the controlled joints is 5-10 times nearly close to the 0.0404 average value reported , but the foot skating ratio is 0.2109
-
Visualization Phenomena
- The controlled joints (especially the wrists and feet) have a large position deviation from the input control signal, and cannot follow the preset trajectory
- Severe foot sliding, unnatural limb stretching, and incoherent whole-body motion
- The motion semantics are inconsistent with the text prompt in some cases
Questions to the Authors
- For the
Ours (on all) model that supports 6-joint control, is there any special training strategy for multi-joint joint control in the training phase? For example, the weight of the loss function, the sampling method of the control signal for different joints, or the joint-specific guidance strength?
- When performing dense control with density >5 (49/196 frames) for multiple joints, do we need to adjust the inference hyperparameters (such as
τ, the number of iterations K in spatial guidance)? Is the default parameter in the paper only optimized for single-joint control, not for multi-joint dense control?
- Is there a possible mismatch in the joint index? Is the index of the 6 core joints in the HumanML3D dataset used in the paper consistent with the SMPL-H 22-joint index I used above?
I can provide the complete reproduction code, full evaluation logs, and visualization videos of the generated motion at any time. Thank you again for your great work and look forward to your reply!
Description
First of all, thank you for your excellent work and open-sourcing the code of OmniControl, which has brought great inspiration to my research on controllable human motion generation.
I have been reproducing the experiments in your ICLR 2024 paper recently, and found that the generation performance is significantly inconsistent with the results reported in the paper under the following settings, and I would like to ask for your advice on the possible causes and solutions.
Core Problem
When I set the control signal density >5 (i.e., number of keyframes >5, including 49 frames/25% density and 196 frames/100% density), and specify the 6 core interactive joints mentioned in the paper as the controllable joints, the generated motion has a huge gap with the paper's results in both control accuracy and motion realism.
Reproduction Environment
Ours (on all)checkpoint (for all joints control)T=1000,T_s=10,K_e=10,K_l=500, guidance strengthτcalculated with the official formulaKey Experimental Settings
Controllable Joints Setting
I set
controllable_joints = np.array([0, 10, 11, 15, 20, 21]), which corresponds to the 6 core joints mentioned in the paper:This is completely consistent with the joint selection in the paper's "Ours (on all)" experiments.
Control Signal Setting
Observed Problem Phenomena
Quantitative Performance Gap
The evaluation metrics are far worse than the results reported in Table 1 of the paper: (the case blow is test on density=5)
Avg. err.of the controlled joints is 5-10 times nearly close to the0.0404average value reported , but the foot skating ratio is 0.2109Visualization Phenomena
Questions to the Authors
Ours (on all)model that supports 6-joint control, is there any special training strategy for multi-joint joint control in the training phase? For example, the weight of the loss function, the sampling method of the control signal for different joints, or the joint-specific guidance strength?τ, the number of iterationsKin spatial guidance)? Is the default parameter in the paper only optimized for single-joint control, not for multi-joint dense control?I can provide the complete reproduction code, full evaluation logs, and visualization videos of the generated motion at any time. Thank you again for your great work and look forward to your reply!