Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Video Generation from RealEstate10K as in Fig. 3 #29

Open
sixiaozheng opened this issue Oct 9, 2024 · 2 comments
Open

Reproducing Video Generation from RealEstate10K as in Fig. 3 #29

sixiaozheng opened this issue Oct 9, 2024 · 2 comments

Comments

@sixiaozheng
Copy link

I would like to express my sincere appreciation for your impressive work. The approach and results presented in your paper are inspiring, especially the generated videos that align well with the input sequences.

I have a question regarding reproducing the video generation process using RealEstate10K as depicted in Fig. 3 of your paper. Specifically, I would like to know how I can take the first frame of a RealEstate10K video and the corresponding camera pose sequence as input, render the sequence of frames, and then use the diffusion model to generate the final video.

Could you provide some guidance or example code on how to proceed with this pipeline?

@Drexubery
Copy link
Owner

Hi, thanks for your interest in our work.

We use DUSt3R to process a video clip of 25 frames, then the camera pose and point cloud of every frame can be obtained.

For your test video, you can pass the video frame (must be 25 frames) folder into run_sparse.sh and delete this line

c2ws = interp_traj(c2ws,n_inserts= ns,device=device)

Then select the frame you want through a simple index operation here
pts3d = to_numpy(pts3d)

Then it should produce a render result align with your test video.

@vidit98
Copy link

vidit98 commented Nov 3, 2024

Hi, thanks for your work. If we want to evaluate model on single view task then using Dust3r to estimate point cloud using all the 25 frames might be unfair right? Because it would be easy to estimate point cloud from 25 frames than just one input frame. I understand the part you need to run Dust3r to get the reference camera trajectory.

Here my assumption is that Fig 3 and Table 1 report results for single image conditioned novel view synthesis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants