CV24 Fall Final Project
This is the implementation of our project: Bringing Static 3D Scenes to Life: Language-Guided Video Diffusion for Dynamic 4D Generation
The overall pipeline is as follows:
Basically, we have the following 3 stages in this pipeline:
- A. Language-embedded 3DGS segmentation
- B. Motion Video Generation for Trajectory Design
- C. Depth Estimation on Generated Motion Video
The detailed implementation guide for each stage is provided below.
We imeplement 3DGS segmentation with reference to the official implementation of SAGA (Segment Any 3D GAussians). Please refer to language_embedded_3DGS for more information.
First, install the dependencies:
conda env create --file environment.yml
conda activate gaussian_splatting
In default, we use the public ViT-H model for SAM. You can download the pre-trained model from here and put it under ./third_party/segment-anything/sam_ckpt.
We inherit all attributes from 3DGS, more information about training the Gaussians can be found in their repo.
python train_scene.py -s <path to COLMAP or NeRF Synthetic dataset>
Then, to get the sam_masks and corresponding mask scales, run the following command:
python extract_segment_everything_masks.py --image_root <path to the scene data> --sam_checkpoint_path <path to the pre-trained SAM model> --downsample <1/2/4/8>
python get_scale.py --image_root <path to the scene data> --model_path <path to the pre-trained 3DGS model>
Note that sometimes the downsample is essential due to the limited GPU memory.
If you want to try the open-vocabulary segmentation, extract the CLIP features first:
python get_clip_features.py --image_root <path to the scene data>
python train_contrastive_feature.py -m <path to the pre-trained 3DGS model> --iterations 10000 --num_sampled_rays 1000
Currently SAGA provides an interactive GUI (saga_gui.py) implemented with dearpygui and a jupyter-notebook (prompt_segmenting.ipynb). To run the GUI:
python saga_gui.py --model_path <path to the pre-trained 3DGS model>
Temporarily, open-vocabulary segmentation is only implemented in the jupyter notebook. Please refer to prompt_segmenting.ipynb for detailed instructions.
After saving segmentation results in the interactive GUI or running the scripts in prompt_segmenting.ipynb, the bitmap of the Gaussians will be saved in ./segmentation_res/your_name.pt
(you can set the name by yourself). To render the segmentation results on training views (get the segmented object by removing the background), run the following command:
python render.py -m <path to the pre-trained 3DGS model> --precomputed_mask <path to the segmentation results> --target scene --segment
To get the 2D rendered masks, run the following command:
python render.py -m <path to the pre-trained 3DGS model> --precomputed_mask <path to the segmentation results> --target seg
You can also render the pre-trained 3DGS model without segmentation:
python render.py -m <path to the pre-trained 3DGS model> --target scene
We explored three methods that are optimized for generating moving objects video specifically. The results are the following.
All of them have the same issue, weak capability of doing zero-shot tasks. Eventually, we decided to use the commercial diffusional model based video generation model, Runway AI and KLing AI.
Both of them are able to achieve exactly equivalently well results. The only difference is that Runway AI is significantly faster than KLing AI for the same task, usually 10 seconds vs 10 minutes. So Runway AI is preferred. There are several important tips for this stage.
- Use two images as inputs instead of only one images, using the images of the initial position and the ending position as the first and last frames.
- The condition provided must follow the rules in the physics worlds.
- The prompt format should follow ”object+action” instead of "action+subject".
The sources code of methods we tried can be found in the file "motion video method". The final generated result can be found in the google drive https://drive.google.com/drive/folders/10VHhwZ4RnGIvJjm_OQ-MOkmij5cHy1Lv?usp=drive_link.
To generate these depth maps, we leverage state-of-the-art monocular depth estimation models, such as MiDaS, DPT, or Depth Any Video.
Please refer to project page for more information.
Setting up the environment with conda. With support for the app.
git clone https://github.com/Nightmare-n/DepthAnyVideo
cd DepthAnyVideo
# create env using conda
conda create -n dav python==3.10
conda activate dav
pip install -r requirements.txt
pip install gradio
- To run inference on an image, use the following command:
python run_infer.py --data_path ./demos/arch_2.jpg --output_dir ./outputs/ --max_resolution 2048
- To run inference on a video, use the following command:
python run_infer.py --data_path ./demos/wooly_mammoth.mp4 --output_dir ./outputs/ --max_resolution 960
The implementation of 3dgs segmentation refers to SAGA (Segment Any 3D GAussians), GARField, OmniSeg3D, Gaussian Splatting.
The implementation of video diffusion refers to Runway AI and KLing AI.
The implementation of depth map generation refers to MiDaS, DPT, and Depth Any Video.
We sincerely thank them for their contributions to the community.