You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many thanks to the author for his outstanding work.
I am very inspired by your work.
I ran into some problems while reading your paper and code on Attention Sharing across Timesteps (AST). Taking DiT-XL-2 in Fig.5 as an example, I noticed that most of the layers in t0-t9 use the AST technique. Taking layer20 of t9 as an example, if we use the cached results of t8 or earlier t0 at this moment, the computation results from layer20 of t0 to layer20 of t9 should be useless, just like Fig.3 in deepcache, because the attention map of layer20 of t9 is the same as that of layer20 of t0, which is the same as that of layer20 of t9, which is the same as that of layer20 of t0. The attention map used for layer20 of t9 is exactly the same as that used for layer20 of t0. I would like to know how the compression method obtained in Fig.5 knows from which timestep cache the result is cached? Is it the most recent one?
I would be very glad to get your answer!
Translated with DeepL.com (free version)
The text was updated successfully, but these errors were encountered:
@HayuZH Your understanding is correct. One thing to note is that for AST we cache the output of attention instead of attention map. If AST is used for layer 20 from t0-t9, the attention output of t0 is cached and reused in t1-t9 (attention computation from t1-t9 is skipped).
Many thanks to the author for his outstanding work.
I am very inspired by your work.
I ran into some problems while reading your paper and code on Attention Sharing across Timesteps (AST). Taking DiT-XL-2 in Fig.5 as an example, I noticed that most of the layers in t0-t9 use the AST technique. Taking layer20 of t9 as an example, if we use the cached results of t8 or earlier t0 at this moment, the computation results from layer20 of t0 to layer20 of t9 should be useless, just like Fig.3 in deepcache, because the attention map of layer20 of t9 is the same as that of layer20 of t0, which is the same as that of layer20 of t9, which is the same as that of layer20 of t0. The attention map used for layer20 of t9 is exactly the same as that used for layer20 of t0. I would like to know how the compression method obtained in Fig.5 knows from which timestep cache the result is cached? Is it the most recent one?
I would be very glad to get your answer!
Translated with DeepL.com (free version)
The text was updated successfully, but these errors were encountered: