Some questions about Fig.5 and AST #13

HayuZH · 2025-01-01T12:57:32Z

Many thanks to the author for his outstanding work.
I am very inspired by your work.
I ran into some problems while reading your paper and code on Attention Sharing across Timesteps (AST). Taking DiT-XL-2 in Fig.5 as an example, I noticed that most of the layers in t0-t9 use the AST technique. Taking layer20 of t9 as an example, if we use the cached results of t8 or earlier t0 at this moment, the computation results from layer20 of t0 to layer20 of t9 should be useless, just like Fig.3 in deepcache, because the attention map of layer20 of t9 is the same as that of layer20 of t0, which is the same as that of layer20 of t9, which is the same as that of layer20 of t0. The attention map used for layer20 of t9 is exactly the same as that used for layer20 of t0. I would like to know how the compression method obtained in Fig.5 knows from which timestep cache the result is cached? Is it the most recent one?
I would be very glad to get your answer!

Translated with DeepL.com (free version)

Probe100 · 2025-01-07T12:25:12Z

@HayuZH Your understanding is correct. One thing to note is that for AST we cache the output of attention instead of attention map. If AST is used for layer 20 from t0-t9, the attention output of t0 is cached and reused in t1-t9 (attention computation from t1-t9 is skipped).

HayuZH closed this as completed Jan 1, 2025

HayuZH reopened this Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about Fig.5 and AST #13

Some questions about Fig.5 and AST #13

HayuZH commented Jan 1, 2025 •

edited

Loading

Probe100 commented Jan 7, 2025

Some questions about Fig.5 and AST #13

Some questions about Fig.5 and AST #13

Comments

HayuZH commented Jan 1, 2025 • edited Loading

Probe100 commented Jan 7, 2025

HayuZH commented Jan 1, 2025 •

edited

Loading