Per-occurrence KV cache for transformer_block_repeat_config#19324
Open
pssrawat wants to merge 2 commits intopytorch:mainfrom
Open
Per-occurrence KV cache for transformer_block_repeat_config#19324pssrawat wants to merge 2 commits intopytorch:mainfrom
pssrawat wants to merge 2 commits intopytorch:mainfrom
Conversation
Summary:
Add configurable block repetition to MultimodalTransformer, enabling weight-shared depth scaling. A contiguous range of transformer layers can now be executed multiple times with shared weights.
Add block_repeat_config field to ModelArgs (list of {start, end, count} dicts)
Example params.json:
"block_repeat_config": [{"start": 5, "end": 10, "count": 2}]
Reviewed By: AdithyaSagar007
Differential Revision: D102393826
Summary: Currently, when a TransformerBlock appears multiple times in MultimodalTransformer.layer_schedule (via ``args.transformer_block_repeat_config``), each visit to that layer reads and writes the same ``self.attention.kv_cache`` buffer. The repeated layer therefore shares its K/V history across both visits — this is "weight-shared loop with shared KV", which is not numerically equivalent to a physically unrolled N-layer model where each duplicated layer slot owns its own K/V cache. This diff adds an opt-in path so each occurrence in the schedule can use its own KV cache buffer while still sharing the layer's weight Parameters, giving the same numerical inference behavior as lowering an unrolled checkpoint. The model size (with transformer_block_repeat_config) remains the same as the original model. Differential Revision: D103962616
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19324
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Cancelled Jobs, 2 Unrelated FailuresAs of commit 167842a with merge base 1debeb6 ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
|
@pssrawat has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103962616. |
kimishpatel
requested changes
May 6, 2026
Contributor
kimishpatel
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Currently, when a TransformerBlock appears multiple times in MultimodalTransformer.layer_schedule (via
args.transformer_block_repeat_config), each visit to that layer reads and writes the sameself.attention.kv_cachebuffer. The repeated layer therefore shares its K/V history across both visits — this is "weight-shared loop with shared KV", which is not numerically equivalent to a physically unrolled N-layer model where each duplicated layer slot owns its own K/V cache.This diff adds an opt-in path so each occurrence in the schedule can use its own KV cache buffer while still sharing the layer's weight Parameters, giving the same numerical inference behavior as lowering an unrolled checkpoint.
The model size (with transformer_block_repeat_config) remains the same as the original model.
Differential Revision: D103962616