You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've managed to run run_tokenflow_pnp.py for a small excerpt of my video (5s) - and it looks really cool - but when I run it on the full one (5min) it crashes with CUDA OOM error even when I drop the batch size down to 1.
This scaling dependence on the video length probably caused by the extended attention seems like a major limitation of the method and is not highlighted neither in the discussion section nor somewhere else in the paper (as far as I can tell).
Is it possible to offload part of the attention computation to the CPU so that the number of frames is not a bottleneck?
The text was updated successfully, but these errors were encountered:
that's exactly what i did here #32 (in a way).
it handled longer sequences but not unlimited ones, as required to feed that whole attention data back to GPU before denoising latents step (and i didn't manage to make it feedable in batches on that step)
I've managed to run
run_tokenflow_pnp.py
for a small excerpt of my video (5s) - and it looks really cool - but when I run it on the full one (5min) it crashes with CUDA OOM error even when I drop the batch size down to 1.This scaling dependence on the video length probably caused by the extended attention seems like a major limitation of the method and is not highlighted neither in the discussion section nor somewhere else in the paper (as far as I can tell).
Is it possible to offload part of the attention computation to the CPU so that the number of frames is not a bottleneck?
The text was updated successfully, but these errors were encountered: