Skip to content

Louis-Leee/Animate3DGS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Animate3DGS

CV24 Fall Final Project

This is the implementation of our project: Bringing Static 3D Scenes to Life: Language-Guided Video Diffusion for Dynamic 4D Generation

The overall pipeline is as follows:

Basically, we have the following 3 stages in this pipeline:

The detailed implementation guide for each stage is provided below.

Stage A: Language-embedded 3DGS segmentation

We imeplement 3DGS segmentation with reference to the official implementation of SAGA (Segment Any 3D GAussians). Please refer to language_embedded_3DGS for more information.

First, install the dependencies:

conda env create --file environment.yml
conda activate gaussian_splatting

In default, we use the public ViT-H model for SAM. You can download the pre-trained model from here and put it under ./third_party/segment-anything/sam_ckpt.

Pre-train the 3D Gaussians

We inherit all attributes from 3DGS, more information about training the Gaussians can be found in their repo.

python train_scene.py -s <path to COLMAP or NeRF Synthetic dataset>

Prepare data

Then, to get the sam_masks and corresponding mask scales, run the following command:

python extract_segment_everything_masks.py --image_root <path to the scene data> --sam_checkpoint_path <path to the pre-trained SAM model> --downsample <1/2/4/8>
python get_scale.py --image_root <path to the scene data> --model_path <path to the pre-trained 3DGS model>

Note that sometimes the downsample is essential due to the limited GPU memory.

If you want to try the open-vocabulary segmentation, extract the CLIP features first:

python get_clip_features.py --image_root <path to the scene data>

Train 3D Gaussian Affinity Features

python train_contrastive_feature.py -m <path to the pre-trained 3DGS model> --iterations 10000 --num_sampled_rays 1000

3D Segmentation

Currently SAGA provides an interactive GUI (saga_gui.py) implemented with dearpygui and a jupyter-notebook (prompt_segmenting.ipynb). To run the GUI:

python saga_gui.py --model_path <path to the pre-trained 3DGS model>

Temporarily, open-vocabulary segmentation is only implemented in the jupyter notebook. Please refer to prompt_segmenting.ipynb for detailed instructions.

Rendering

After saving segmentation results in the interactive GUI or running the scripts in prompt_segmenting.ipynb, the bitmap of the Gaussians will be saved in ./segmentation_res/your_name.pt (you can set the name by yourself). To render the segmentation results on training views (get the segmented object by removing the background), run the following command:

python render.py -m <path to the pre-trained 3DGS model> --precomputed_mask <path to the segmentation results> --target scene --segment

To get the 2D rendered masks, run the following command:

python render.py -m <path to the pre-trained 3DGS model> --precomputed_mask <path to the segmentation results> --target seg

You can also render the pre-trained 3DGS model without segmentation:

python render.py -m <path to the pre-trained 3DGS model> --target scene

Example Demo

Stage B: Motion Video Generation for Trajectory Design

We explored three methods that are optimized for generating moving objects video specifically. The results are the following.

All of them have the same issue, weak capability of doing zero-shot tasks. Eventually, we decided to use the commercial diffusional model based video generation model, Runway AI and KLing AI.

Both of them are able to achieve exactly equivalently well results. The only difference is that Runway AI is significantly faster than KLing AI for the same task, usually 10 seconds vs 10 minutes. So Runway AI is preferred. There are several important tips for this stage.

  • Use two images as inputs instead of only one images, using the images of the initial position and the ending position as the first and last frames.
  • The condition provided must follow the rules in the physics worlds.
  • The prompt format should follow ”object+action” instead of "action+subject".

Example Demo

The sources code of methods we tried can be found in the file "motion video method". The final generated result can be found in the google drive https://drive.google.com/drive/folders/10VHhwZ4RnGIvJjm_OQ-MOkmij5cHy1Lv?usp=drive_link.

Stage C: Depth Estimation on Generated Motion Video

To generate these depth maps, we leverage state-of-the-art monocular depth estimation models, such as MiDaS, DPT, or Depth Any Video.

Please refer to project page for more information.

Installation

Setting up the environment with conda. With support for the app.

git clone https://github.com/Nightmare-n/DepthAnyVideo
cd DepthAnyVideo

# create env using conda
conda create -n dav python==3.10
conda activate dav
pip install -r requirements.txt
pip install gradio

Inference

  • To run inference on an image, use the following command:
python run_infer.py --data_path ./demos/arch_2.jpg --output_dir ./outputs/ --max_resolution 2048
  • To run inference on a video, use the following command:
python run_infer.py --data_path ./demos/wooly_mammoth.mp4 --output_dir ./outputs/ --max_resolution 960

Example Demo

Acknowledgement

The implementation of 3dgs segmentation refers to SAGA (Segment Any 3D GAussians), GARField, OmniSeg3D, Gaussian Splatting.

The implementation of video diffusion refers to Runway AI and KLing AI.

The implementation of depth map generation refers to MiDaS, DPT, and Depth Any Video.

We sincerely thank them for their contributions to the community.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published