-
Notifications
You must be signed in to change notification settings - Fork 609
Spatial and temporal upscaling order; Continue Video less RAM #1234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Other problems I noticed: B. The motion does not match as the second video gets only the last frame from the first video as information. I tried multiple frames, but then it will only generate weird noisy artifacts at the beginning. So something like motion vectors would be nice. C. The quality of the video degrades further and further with each new generation. The only "trick" I came across is to use an end image every 10s (every second generation) to avoid this degradation, as it orientates anew on this quality provided image. |
|
thx nice idea, do you have some sample videos that compare spatial/temporal and temporal/spatial ? |
|
@deepbeepmeep Alright, will do. I have seen in the code there is a way to edit post-processing a video (wgp.py line 4576). So can I generate the video without upsampling first and then apply the upsamlping later? I don't see how I can do that in the UI. |
|
I can't upload them here, as they are a bit too big. I have generated a few now always from scratch with the specific spatial vs temporal order. The morphing and noise from the model itself (before upsampling) makes it hard to see, but have a look at the dudes face in the I2V sample. At the end, even if you don't see a difference here due to the noise, the RAM usage is a bit lower for some cases. |
|
Belongs to point 4: |
|
I exposed the Fit Canvas value, as it was set fix to 0, which means always Resolution Budget (Pixels will be reallocated to preserve Inputs W/H ratio). Beside that, now one can choose either "Outer Box Resolution (one dimension may be less to preserve video W/H ratio)" or "Output Resolution (Input Images wil be Cropped if the W/H ratio is different)". Belongs to point 4: |
|
Belongs to point 4: No more RAM spikes and much longer upsampled videos are now possible! |
|
@deepbeepmeep You're welcome to take this into the main branch now. I haven't found any bugs anymore and the remaining problems are either to complex that I have to change too much deep down in the code or it will be addressed by future AI models anyway. These are the changes: I. Spatial upsampling will be done first before temporal upsampling -> less RAM usage and less ghosting artifacts II. Added 20 FPS as model default to choose from dropdown -> Speeds up the movement by 1.25x and makes it feel realistic instead of slow-mo III. Exposed Fit_Canvas in the UI and saved as model specific setting -> User can choose if the video should be scaled or cropped to match input image/video IV. Each little process of the video generation is exposed in more detail in the progress bar -> User knows now better what takes so long or a lot of resources during the generation V. If continue (last) video is active, first it will check if the result of the generation would have the correct resolution and FPS before starting the actual generation process; otherwise it stops and returns an user friendly error VI. If continue (last) video is active, it will only upsample the new generated video and not the last video anymore -> Avoids big RAM spikes by upsampling only short and not long videos anymore VII. If continue (last) video is active, it will save only the new generated video and then combine it with the last video via ffmpeg -> ffmpeg only uses ~2 GB of RAM and is very fast, which enables continue video generation of every length; whereas a bigger and bigger tensor has eaten up more and more RAM, has taken longer and longer to process and it was not possible for me to come over 15s of video length (3x 5s videos) With all these changes, I am now able to create 1920x1088 videos with 40 FPS (60 requires 3x RIFEx temporal interpolation) and easily surpassed 2 min length without quality loss by feeding an end image every second generation (after every 10s). Unfortunately, it doesn't fix the color (VAE encoding/decoding) and motion (only mitigated by removing the last and first frame) discrepancy between each video generation though, but that is for future AI models a problem to solve. |
|
I think with the latest RAM optimizations this may no longer be needed. have you add the chance to compare ? |
Here it is, if you mean this comparison:
As long as you can continue videos over and over again, without exceeding the RAM, with your latest update, then you're good to go. |
Processing the spatial before the temporal upscaler has about the same performance, but uses less RAM and has better video quality as the artifacts/ghosting is reduced.