Skip to content

Generated video not consistent with LiDAR control when using AV-Sample model (control weight = 1.0) #226

Description

@yuvalH9

Hi,
I’m trying to generate a video using a text prompt and LiDAR control only, with the AV-Sample model.
The control weight is set to 1.0, so I would expect the generated video to strictly follow the control signal (LiDAR, in this case).

However, I’ve noticed that the generated video deviates from the LiDAR control even at this maximum control weight.
In contrast, when using depth control, the generated output remains much more consistent with the input control.

Here is the code I’m running:

PYTHONPATH="$(pwd)" torchrun --nproc_per_node="${NUM_GPU}" --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py --checkpoint_dir "${CHECKPOINT_DIR}" \
  --video_save_folder "my_output \
  --controlnet_specs "cosmos_lidar_defult.json" \
  --is_av_sample \
  --sigma_max 80 \
  --fps 30 \
  --num_gpus "${NUM_GPU}" \
  --batch_input_path "waymo_reg_3_spec.json" 

Here is the spec files:
cosmos_lidar_defult.json

{
    "prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.",
    "lidar": {
        "input_control": "lidar_control_vid_defult.mp4",
        "control_weight": 1.0
    }
}

waymo_reg_2_spec.json

{"prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.", "control_overrides": {"lidar": {"input_control": "lidar_control_vid_0.mp4", "control_weight": 1.0}}, "video_save_name": "0_gen_vid"}
{"prompt": "The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a road with a clear blue sky overhead and a few scattered clouds. The road is lined with palm trees and power lines, and there are a few cars driving in both directions. The road appears to be in a suburban area, with houses and buildings visible on either side. The weather is sunny and clear, with no signs of rain or clouds. The time of day is not specified, but the lighting suggests it is daytime.", "control_overrides": {"lidar": {"input_control": "lidar_control_vid_1.mp4", "control_weight": 1.0}}, "video_save_name": "1_gen_vid"}

Here are the results:

Image Image

You can observe clear inconsistencies in the highlighted zoomed-in regions between the LiDAR signal and the generated video
(left to right: generated video, LiDAR control, LiDAR overlaid on generated video).

My question is:
Is there a way to make the generated video more strictly consistent with the LiDAR control signal (even if this results in a less visually pleasing generation)?

@caotians1 @pjannaty

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions