Integrate xlstm cleanly. #35377

kpoeppel · 2024-12-20T22:46:09Z

What does this PR do?

This PR integrates xLSTM via the xlstm-library including certain optimizations (potentially use torch.compile and cuda graphs for speed up). This enables using the NX-AI/xLSTM-7b without a special fork of transformers.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests? Yes, I adapted the tests of the recurrent Mamba2 model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

stevhliu

Thanks!

docs/source/en/model_doc/xlstm.md

kpoeppel · 2025-01-10T13:32:33Z

How can this PR keep failing inside other models' tests, how can they get into 'main'?

gante

comment on the text-generation-level changes: the xLSTMCache class is missing proper documentation (see MambaCache for an example) and should be added to __init__.py

ArthurZucker

Super grateful for the PR! In general happy to have more arch, but we want to make sure we are aligned in terms of what can be done!
🤗

src/transformers/models/xlstm/modeling_xlstm.py

ArthurZucker · 2024-12-23T16:35:40Z

src/transformers/models/xlstm/modeling_xlstm.py

+        xlstm_block_config = xLSTMLargeConfig(
+            vocab_size=config.vocab_size,
+            embedding_dim=config.embedding_dim,
+            num_blocks=config.num_blocks,
+            num_heads=config.num_heads,
+            use_bias=config.use_bias,
+            add_out_norm=config.add_out_norm,
+            norm_eps=config.norm_eps,
+            norm_reduction_force_float32=config.norm_reduction_force_float32,
+            # mlstm_layer
+            qk_dim_factor=config.qk_dim_factor,
+            v_dim_factor=config.v_dim_factor,
+            # mlstm backend
+            chunkwise_kernel=config.chunkwise_kernel,
+            sequence_kernel=config.sequence_kernel,
+            step_kernel=config.step_kernel,
+            mode=config.mode,
+            chunk_size=config.chunk_size,
+            return_last_states=config.return_last_states,
+            autocast_kernel_dtype=config.autocast_kernel_dtype,
+            eps=config.eps,
+            inference_state_dtype=config.inference_state_dtype,
+            # feedforward
+            ffn_proj_factor=config.ffn_proj_factor,
+            ffn_round_up_to_multiple_of=config.ffn_round_up_to_multiple_of,
+            # capping
+            gate_soft_cap=config.gate_soft_cap,
+            output_logit_soft_cap=config.output_logit_soft_cap,
+            weight_mode=config.weight_mode,
+        )


we should align xLSTMLargeConfig to match the inputs of mLSTMBlock

and this not have to do this here

There are still slight deviations of the xLSTMLargeConfig compare to the xLSTMConfig in configuration_xlstm.py. So I think this conversion is necessary actually.

src/transformers/models/xlstm/modeling_xlstm.py

ArthurZucker · 2024-12-23T16:38:39Z

src/transformers/models/xlstm/configuration_xlstm.py

Some refactoring needed for the camel casing of classes!

Is the casing ok as it is now?

src/transformers/models/xlstm/configuration_xlstm.py

src/transformers/models/xlstm/modeling_xlstm.py

kpoeppel · 2025-07-02T12:04:33Z

Hey! Super super sorry for the delay! Here is a new round of reviews! Let me know is something is unclear 🤗 Still mostly concerned about the Cache (is it needed??), unnecesary abstractions, single letter variables, and asserts 🤗

Hey! Sorry also for the delay from my side. I integrated all your comments, the xLSTMCache is still necessary I think (like a MambaCache or KVCache). As it has a different structure than these, we need a separate class.

Cyrilvallez

Hey! Sorry for the delay! Here is a new review 🤗 Let me know if something is still unclear!

src/transformers/__init__.py

Cyrilvallez · 2025-07-08T16:06:29Z

src/transformers/cache_utils.py

+class xLSTMCache:
+    """
+    Cache for xLSTM model which does not have attention mechanism and key value states.
+
+    Arguments:
+        config (`PretrainedConfig):
+            The configuration file defining the shape-related attributes required to initialize the static cache.
+        max_batch_size (`int`):
+            The batch size with which the model will be used.
+        dtype (`torch.dtype`, *optional*, defaults to `torch.bfloat16`):
+            The default `dtype` to use when initializing the layer.
+        device (`torch.device` or `str`, *optional*):
+            The device on which the cache should be initialized. Should be the same as the layer.
+
+    Attributes:
+        seqlen_offset: int
+        dtype: torch.dtype
+
+    Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, xLSTMForCausalLM, xLSTMCache
+
+        >>> model = xLSTMForCausalLM.from_pretrained("NX-AI/xLSTM-7b")
+        >>> tokenizer = xLSTMTokenizer.from_pretrained("NX-AI/xLSTM-7b")
+
+        >>> inputs = tokenizer(text="I am an xLSTM", return_tensors="pt")
+
+        >>> # Prepare a cache class and pass it to model's forward
+        >>> cache_params = xLSTMCache(config=model.config, max_batch_size=1, device=model.device, dtype=model.dtype)
+        >>> outputs = model(**inputs, cache_params=cache_params, use_cache=True)
+        >>> outputs.cache_params
+        xLSTMCache()
+    """
+
+    def __init__(
+        self,
+        config: PretrainedConfig,
+        max_batch_size: int,
+        dtype: torch.dtype = torch.bfloat16,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        self.seqlen_offset = 0
+        self.dtype = dtype
+        self.config = config
+        self.rnn_state = {
+            layer: (
+                torch.zeros(
+                    [max_batch_size, config.num_heads, config.qk_head_dim, config.v_head_dim],
+                    dtype=dtype,
+                    device=device,
+                ),
+                torch.zeros([max_batch_size, config.num_heads, config.qk_head_dim], dtype=dtype, device=device),
+                torch.zeros([max_batch_size, config.num_heads, 1], dtype=dtype, device=device),
+            )
+            for layer in range(config.num_hidden_layers)
+        }
+
+    def reset(self):
+        self.rnn_state = {
+            layer: (
+                torch.zeros_like(self.rnn_state[layer][0]),
+                torch.zeros_like(self.rnn_state[layer][1]),
+                torch.zeros_like(self.rnn_state[layer][2]),
+            )
+            for layer in self.rnn_state
+        }
+
+


All right, but it should be moved to the modeling file instead then, not general cache_utils

src/transformers/generation/configuration_utils.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/xlstm/modeling_xlstm.py

Cyrilvallez · 2025-07-08T17:31:22Z

src/transformers/models/xlstm/modeling_xlstm.py

+        for param_name, param in self.named_parameters():
+            if "bias" in param_name and param is not None:
+                torch.nn.init.zeros_(param)
+            elif "weight" in param_name and param is not None and param.ndim > 1:
+                small_init_method(self.config.hidden_size)(param)
+
+        small_init_method(self.config.hidden_size)(self.embeddings.weight)
+        torch.nn.init.ones_(self.out_norm.weight)
+
+        for block in self.blocks:
+            torch.nn.init.ones_(block.mlstm_layer.multihead_norm.weight)
+            torch.nn.init.ones_(block.norm_mlstm.weight)
+            torch.nn.init.ones_(block.norm_ffn.weight)
+
+            wang_init_method(dim=block.ffn.up_proj_dim, n_layers=self.config.num_hidden_layers)(
+                block.ffn.proj_down.weight
+            )
+            wang_init_method(dim=self.config.hidden_size, n_layers=self.config.num_hidden_layers)(
+                block.mlstm_layer.out_proj.weight
+            )
+
+            if self.config.weight_mode == "single":
+                torch.nn.init.zeros_(block.mlstm_layer.ogate_preact.weight)
+                torch.nn.init.zeros_(block.mlstm_layer.igate_preact.weight)
+                torch.nn.init.zeros_(block.mlstm_layer.fgate_preact.weight)
+
+                with torch.no_grad():
+                    block.mlstm_layer.igate_preact.bias.copy_(
+                        -10.0 * torch.ones_like(block.mlstm_layer.igate_preact.bias)
+                    )
+                    block.mlstm_layer.fgate_preact.bias.copy_(
+                        torch.linspace(
+                            3.0,
+                            6.0,
+                            block.mlstm_layer.fgate_preact.bias.shape[-1],
+                        ).to(
+                            device=block.mlstm_layer.fgate_preact.bias.device,
+                            dtype=block.mlstm_layer.fgate_preact.bias.dtype,
+                        )
+                    )
+            elif self.config.weight_mode == "fused":
+                torch.nn.init.zeros_(block.mlstm_layer.ifgate_preact.weight)
+
+                with torch.no_grad():
+                    block.mlstm_layer.ifgate_preact.bias[: self.config.num_heads] += (
+                        -block.mlstm_layer.ifgate_preact.bias[: self.config.num_heads]
+                        - 10.0 * torch.ones_like(block.mlstm_layer.igate_preact.bias)
+                    )
+                    block.mlstm_layer.ifgate_preact.bias[: self.config.num_heads] += (
+                        -block.mlstm_layer.ifgate_preact.bias[self.config.num_heads :]
+                        + torch.linspace(
+                            3.0,
+                            6.0,
+                            block.mlstm_layer.fgate_preact.bias.shape[-1],
+                        ).to(
+                            device=block.mlstm_layer.fgate_preact.bias.device,
+                            dtype=block.mlstm_layer.fgate_preact.bias.dtype,
+                        )
+                    )


This function is applied iteratively on each module in the model -> we should not iterate on them again, see how it is usually done in e.g. Llama (each module decides what to do)

Since there are some special nn.Linear modules (gates) that need a certain init, I now added an additional utility method that can get the global name of a module within xLSTMPreTrainedModel to use it for adaptive initialization. I hope this matches better how the init is intended to work. Otherwise I would have need to wrap many special modules (also FF downprojection). Both within HF code and the original xLSTM repo.

src/transformers/models/xlstm/modeling_xlstm.py

src/transformers/utils/dummy_pt_objects.py

src/transformers/utils/import_utils.py

tests/utils/test_cache_utils.py

…raph captures.

kpoeppel · 2025-07-08T22:15:32Z

Hey! Sorry for the delay! Here is a new review 🤗 Let me know if something is still unclear!

Thanks for the next review!

I moved the xLSTMCache to modeling_xlstm.py and resolved all other issues. However now the auto_docstring decorator fails to work, as xLSTMCache probably is not global anymore. Should I switch back to the non-autodocstring docstring or is there a better way to fix this?

Cyrilvallez · 2025-07-09T08:43:50Z

I don't think it has anything to do with the class being public or not! But you can find everything you need about auto_docstring here! Basically, you need to add docstring only for "unknown" args, e.g. here cache_params is unknown in the library

kpoeppel · 2025-07-09T21:42:27Z

There was a leftover xLSTMCache mention in the generation docs files. So all your comments should be integrated now. :)

github-actions · 2025-07-14T15:30:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, xlstm

kpoeppel force-pushed the integrate_xlstm_clean branch 4 times, most recently from 4a4e347 to fe5759b Compare December 21, 2024 13:41

kpoeppel requested review from ArthurZucker, Rocketknight1, gante and stevhliu as code owners January 9, 2025 13:04

stevhliu reviewed Jan 9, 2025

View reviewed changes

docs/source/en/model_doc/xlstm.md Outdated Show resolved Hide resolved

Cyrilvallez self-assigned this Jan 13, 2025

Cyrilvallez self-requested a review January 13, 2025 10:55

gante reviewed Jan 13, 2025

View reviewed changes

ArthurZucker reviewed Jan 20, 2025

View reviewed changes

superbock force-pushed the integrate_xlstm_clean branch 3 times, most recently from 4545076 to 69b9d5c Compare March 27, 2025 16:01

kpoeppel added 13 commits March 28, 2025 11:24

Add xLSTM cleanly with optimizations.

e03391c

Fix style.

cf14cea

Fix modeling test.

4254fdc

Make xLSTM package optional.

87dd1db

Fix: Update torch version check.

7a8d10e

Fix: Bad variable naming in test.

4456672

Fix: Import structure cleaning with Ruff.

41173b8

Fix: Update docstrings.

e4afcd1

Fix: Mitigate unused config attr tests by explicit usage.

cf5bcdc

Fix: Skip tests, if xlstm library is not installed.

1d6642f

Feat: Enable longer context window for inference by chunking.

cfabeca

Fix: Make training test pass by lowering target accuracy.

fa5859c

Chore: Increase test verbosity for failing generation test.

6896afb

kpoeppel added 4 commits July 1, 2025 17:45

Chore: Update __init__.py structure of xLSTM.

44550fa

Chore: Clean xLSTM initialization of weights.

c7ce6a5

Fix index names in modeling_xlstm.py

5f8a399

Update xlstm model test typing annotations.

b15d6b8

kpoeppel added 3 commits July 2, 2025 16:11

Merge branch 'main' into integrate_xlstm_clean

bd28805

Merge branch 'main' into integrate_xlstm_clean

bd9ae67

Merge branch 'main' into integrate_xlstm_clean

3a5bab8

Cyrilvallez reviewed Jul 8, 2025

View reviewed changes

kpoeppel added 12 commits July 8, 2025 22:22

Fix: Remove all asserts.

a010b18

Revert changes to the main __init__.py

ab30109

Fix: Move xLSTMCache to modeling_xlstm.py

718e443

Fix: Remove xLSTMForCausalLM mapping from modeling_auto.py

55ef591

Remove xLSTMCache from dummy_pt_objects.py

733a42a

Fix: Remove extended torchdynamo compilation check integrating cuda g…

e955830

…raph captures.

Revert test_cache_utils.py xLSTM change.

0fe5633

Fix: Move xLSTM init functions before init call.

c1a69b3

Remove xLSTMCache from generation utils.

78eb257

Fix: Clean xLSTM init functionality for recursive calls.

66e98cb

Fix: Move xLSTMCache before its first call.

8dc3179

Fix formatting.

70f797c

kpoeppel and others added 4 commits July 9, 2025 11:52

Add partial docstring for xLSTMModel forward.

468b97f

Fix xLSTMCache docstring in xLSTMModel.

89f623f

Remove xLSTMCache from public documentation. Update auto_docstring.

55bc972

Merge branch 'main' into integrate_xlstm_clean

2480d31

Merge branch 'main' into integrate_xlstm_clean

5b13a31

Integrate xlstm cleanly. #35377

Are you sure you want to change the base?

Integrate xlstm cleanly. #35377

Conversation

kpoeppel commented Dec 20, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kpoeppel commented Jan 10, 2025

Uh oh!

gante left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kpoeppel commented Jul 2, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpoeppel commented Jul 8, 2025

Uh oh!

Cyrilvallez commented Jul 9, 2025

Uh oh!

kpoeppel commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 14, 2025

Uh oh!

Uh oh!

gante left a comment •

edited

Loading

kpoeppel commented Jul 9, 2025 •

edited

Loading