Skip to content

Conversation

@Dencel-CleverAI
Copy link

Processing the spatial before the temporal upscaler has about the same performance, but uses less RAM and has better video quality as the artifacts/ghosting is reduced.

@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 20, 2025

  1. (Done) I added 20 FPS to choose from for the model default frames. For a default model of 16 FPS I saw that 20 FPS makes it 1.25x faster in movement so that it doesn't look like slow-mo anymore, but like realistic movement speed. The performance cost is also not that hard to hit the same video length of 5s (81 frames with 16 FPS and 101 frames with 20 FPS). It takes about 30% longer to generate.

  2. (Failed) Moreover, I tried to add a 3x temporal upsampling so that it will hit 60 FPS, but I failed to do so, as RIFLE supports only binary inputs (2, 4, 8, 16, etc.). If someone could make it possible to hit a 3x, that would be the cherry on top.

  3. (Failed) I also failed to crop the resolution, as a 1280x720 video gets upscaled with 1.5x factor to 1920x1088 instead of 1080p. It seems the save_video method will always force it to a resolution that is dividable by 16.

  4. (Done) I also failed to make it possible that the last video stays untouched, while the generated video is upsampled first and after that will be combined with the last video, if "Continue Last Video" is set. Therefore, it doesn't need to upsample the whole long video, which requires more and more RAM and takes longer and longer, but rather doing it only part by part which is manageable.

  5. (Failed) I haven't figured out a way to make custom checkpoints automatic useable, especially on how to use Q int quantized .gguf format ckpt's instead of .safetensors

Other problems I noticed:
A. The colors don't match with the end frame of the first video and the start frame of the second video. I think it has to do with the VAE encoding/decoding. However, adding an end image, it transitions to the correct color of that image, so why is the start wrong?

B. The motion does not match as the second video gets only the last frame from the first video as information. I tried multiple frames, but then it will only generate weird noisy artifacts at the beginning. So something like motion vectors would be nice.

C. The quality of the video degrades further and further with each new generation. The only "trick" I came across is to use an end image every 10s (every second generation) to avoid this degradation, as it orientates anew on this quality provided image.

@deepbeepmeep
Copy link
Owner

thx nice idea, do you have some sample videos that compare spatial/temporal and temporal/spatial ?

@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 23, 2025

@deepbeepmeep Alright, will do.

I have seen in the code there is a way to edit post-processing a video (wgp.py line 4576). So can I generate the video without upsampling first and then apply the upsamlping later? I don't see how I can do that in the UI.

@Dencel-CleverAI
Copy link
Author

I can't upload them here, as they are a bit too big.
https://drive.google.com/file/d/1o6Q9bfS9HRCi5zYlF8vv5MEsNKvwgNmm

I have generated a few now always from scratch with the specific spatial vs temporal order. The morphing and noise from the model itself (before upsampling) makes it hard to see, but have a look at the dudes face in the I2V sample.

At the end, even if you don't see a difference here due to the noise, the RAM usage is a bit lower for some cases.

@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 25, 2025

Belongs to point 4:
I was able to implement that the upsampling is only applied to the new generation and then it will be merged with the last video. However, the RAM usage goes still through the roof the longer the video gets and the quality degrades with each further generation too. So still not usable to generate long upsampled videos. Have to find out what eats up the RAM.

@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 26, 2025

I exposed the Fit Canvas value, as it was set fix to 0, which means always Resolution Budget (Pixels will be reallocated to preserve Inputs W/H ratio). Beside that, now one can choose either "Outer Box Resolution (one dimension may be less to preserve video W/H ratio)" or "Output Resolution (Input Images wil be Cropped if the W/H ratio is different)".

Belongs to point 4:
I have found out that spatial upsampling RAM usage is low, temporal is higher. Processing a bigger source video is even higher, but saving all that with save_video takes the most RAM. I think we should save the generated video on disk first, then merge with last video via ffmpeg to avoid creating a huge tensor in RAM.

@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 27, 2025

Belongs to point 4:
I was finally able to fix the RAM problem with continuing videos. Instead of torch.cat, which creates a big tensor the longer the videos get and eats a lot of RAM, I used ffmpeg on the saved video sample (5s) and merged it with the other video (10s or longer). I kept the 5s generation as a preview alongside the combined one, but only the combined one is shown in the UI so that "Continue Last Video" works flawlessly.

No more RAM spikes and much longer upsampled videos are now possible!

@Dencel-CleverAI Dencel-CleverAI changed the title Changed order of spatial and temporal upscaling Spatial and temporal upscaling order; Continue Video less RAM Dec 27, 2025
@Dencel-CleverAI
Copy link
Author

Dencel-CleverAI commented Dec 30, 2025

@deepbeepmeep You're welcome to take this into the main branch now. I haven't found any bugs anymore and the remaining problems are either to complex that I have to change too much deep down in the code or it will be addressed by future AI models anyway.

These are the changes:

I. Spatial upsampling will be done first before temporal upsampling -> less RAM usage and less ghosting artifacts

II. Added 20 FPS as model default to choose from dropdown -> Speeds up the movement by 1.25x and makes it feel realistic instead of slow-mo

III. Exposed Fit_Canvas in the UI and saved as model specific setting -> User can choose if the video should be scaled or cropped to match input image/video

IV. Each little process of the video generation is exposed in more detail in the progress bar -> User knows now better what takes so long or a lot of resources during the generation

V. If continue (last) video is active, first it will check if the result of the generation would have the correct resolution and FPS before starting the actual generation process; otherwise it stops and returns an user friendly error

VI. If continue (last) video is active, it will only upsample the new generated video and not the last video anymore -> Avoids big RAM spikes by upsampling only short and not long videos anymore

VII. If continue (last) video is active, it will save only the new generated video and then combine it with the last video via ffmpeg -> ffmpeg only uses ~2 GB of RAM and is very fast, which enables continue video generation of every length; whereas a bigger and bigger tensor has eaten up more and more RAM, has taken longer and longer to process and it was not possible for me to come over 15s of video length (3x 5s videos)

With all these changes, I am now able to create 1920x1088 videos with 40 FPS (60 requires 3x RIFEx temporal interpolation) and easily surpassed 2 min length without quality loss by feeding an end image every second generation (after every 10s). Unfortunately, it doesn't fix the color (VAE encoding/decoding) and motion (only mitigated by removing the last and first frame) discrepancy between each video generation though, but that is for future AI models a problem to solve.

@deepbeepmeep
Copy link
Owner

I think with the latest RAM optimizations this may no longer be needed. have you add the chance to compare ?

@Dencel-CleverAI
Copy link
Author

I think with the latest RAM optimizations this may no longer be needed. have you add the chance to compare ?

Here it is, if you mean this comparison:

I can't upload them here, as they are a bit too big. https://drive.google.com/file/d/1o6Q9bfS9HRCi5zYlF8vv5MEsNKvwgNmm

I have generated a few now always from scratch with the specific spatial vs temporal order. The morphing and noise from the model itself (before upsampling) makes it hard to see, but have a look at the dudes face in the I2V sample.

At the end, even if you don't see a difference here due to the noise, the RAM usage is a bit lower for some cases.

As long as you can continue videos over and over again, without exceeding the RAM, with your latest update, then you're good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants