Hi,
I’m trying to generate a video using a text prompt and LiDAR control only, with the AV-Sample model.
The control weight is set to 1.0, so I would expect the generated video to strictly follow the control signal (LiDAR, in this case).
However, I’ve noticed that the generated video deviates from the LiDAR control even at this maximum control weight.
In contrast, when using depth control, the generated output remains much more consistent with the input control.
Here is the code I’m running:
PYTHONPATH="$(pwd)" torchrun --nproc_per_node="${NUM_GPU}" --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py --checkpoint_dir "${CHECKPOINT_DIR}" \
--video_save_folder "my_output \
--controlnet_specs "cosmos_lidar_defult.json" \
--is_av_sample \
--sigma_max 80 \
--fps 30 \
--num_gpus "${NUM_GPU}" \
--batch_input_path "waymo_reg_3_spec.json"
Here is the spec files:
cosmos_lidar_defult.json
{
"prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.",
"lidar": {
"input_control": "lidar_control_vid_defult.mp4",
"control_weight": 1.0
}
}
waymo_reg_2_spec.json
{"prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.", "control_overrides": {"lidar": {"input_control": "lidar_control_vid_0.mp4", "control_weight": 1.0}}, "video_save_name": "0_gen_vid"}
{"prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.", "control_overrides": {"lidar": {"input_control": "lidar_control_vid_1.mp4", "control_weight": 1.0}}, "video_save_name": "1_gen_vid"}
Here are the results:
You can observe clear inconsistencies in the highlighted zoomed-in regions between the LiDAR signal and the generated video
(left to right: generated video, LiDAR control, LiDAR overlaid on generated video).
My question is:
Is there a way to make the generated video more strictly consistent with the LiDAR control signal (even if this results in a less visually pleasing generation)?
@caotians1 @pjannaty
Thanks!
Hi,
I’m trying to generate a video using a text prompt and LiDAR control only, with the AV-Sample model.
The control weight is set to 1.0, so I would expect the generated video to strictly follow the control signal (LiDAR, in this case).
However, I’ve noticed that the generated video deviates from the LiDAR control even at this maximum control weight.
In contrast, when using depth control, the generated output remains much more consistent with the input control.
Here is the code I’m running:
Here is the spec files:
cosmos_lidar_defult.jsonwaymo_reg_2_spec.jsonHere are the results:
You can observe clear inconsistencies in the highlighted zoomed-in regions between the LiDAR signal and the generated video
(left to right: generated video, LiDAR control, LiDAR overlaid on generated video).
My question is:
Is there a way to make the generated video more strictly consistent with the LiDAR control signal (even if this results in a less visually pleasing generation)?
@caotians1 @pjannaty
Thanks!