ONNX wrapping a kv-cached model

### Describe the bug

I'm trying to wrap a kv-cached Classifier in ONNX, (I managed to do it for default fit_mode, just like the example). I want it to receive one line as input, and infer it using the cached kv-cache in real time.

I get this annoying trace:

`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`

Coming from:

```
 File "TabPFN/src/tabpfn/model/transformer.py", line 543, in _forward
    embedded_y = self.y_encoder(
....
File "TabPFN/src/tabpfn/model/encoders.py", line 459, in forward
    out = self._transform(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "TabPFN/src/tabpfn/model/encoders.py", line 400, in _transform
    return (self.layer(x),)
            ^^^^^^^^^^^^^
  File "tabpfn/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "tabpfn/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "tabpfn/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1741, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "tabpfn/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
```

I'm not very familiar with PyTorch so I would appreciate some guidance. Online, this error occurs mostly when a layer in the model is not inheriting from `nn.module`, but it doesn't seem to be the case here.

### Steps/Code to Reproduce

Along these lines, I know I'm missing the preprocessing stage here, but this should work:

If there's interest, I can supply an exact piece of code.
```

class TabPFNModelWrapperWithTrainData(nn.Module):
    def __init__(self, classifier):
        super().__init__()
        self.classifier = classifier

    def forward(
        self,
        X
    ):

    return self.classifier.forward(X, use_inference_mode=True)


classifier = TabPFNClassifier(
    model_path="tabpfn-v2-classifier.ckpt",
    n_estimators=1,
    device="cpu",
    random_state=42,
    fit_mode="fit_with_cache",
    memory_saving_mode=False,
)
classifier.fit(X_df, Y_df)

with torch.no_grad():
        X = torch.randn(
            (1, X_df.shape[1]),
            generator=torch.Generator().manual_seed(42),
        )
        X.requires_grad = False

        torch.onnx.export(
            TabPFNModelWrapperWithTrainData(
                classifier
            ).eval(),
            (X, []),
            f=file_name,
            input_names=[
                "X",
            ],
            output_names=["output"],
            opset_version=17,  # using 17 since we use torch>=2.1
        )


```

### Expected Results

I want the onnx model to work :)

### Actual Results


`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient` while wrapping the model

### Versions

```shell
Building with the main branch
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONNX wrapping a kv-cached model #382

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ONNX wrapping a kv-cached model #382

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions