Incorrectly detect CublasLt version and cause error in an assert

**Describe the bug**

Incorrectly detect CublasLt version and cause error in an assert:

> [rank0]:   File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 836, in autocast
> [rank0]:     check_recipe_support(recipe)
> [rank0]:   File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 99, in check_recipe_support
> [rank0]:     assert recipe_supported, unsupported_reason
> [rank0]: AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.

Similarly:
>   File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 841, in autocast
>     FP8GlobalStateManager.autocast_enter(
>   File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 579, in autocast_enter
>     assert fp8_available, reason_for_no_fp8
> AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.
> 

My CublasLt version is 12.8.4 detected using
```
lib = ctypes.CDLL(path)
version_int = lib.cublasLtGetVersion()
```
So, the version of CublasLt meets the requirement. If I commented out the two `assert` statements on line 99 and 579, the program can run without any error. So, I think it is a false positive issue.

**Steps/Code to reproduce bug**

Run a training of a language model with FP8 support. An example code is [here](https://github.com/hkproj/pytorch-transformer-distributed). Change train.py to enable FP8:

1) Add at the top of the code:
```
import transformer_engine.pytorch as te
from transformer_engine.common import recipe
```

2) Modify from line 253 to 282:
```
    # Create FP8 recipe
    fp8_format = recipe.Format.HYBRID  # or recipe.Format.E4M3
    fp8_recipe = recipe.DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")

    for epoch in range(initial_epoch, config.num_epochs):
        torch.cuda.empty_cache()
        model.train()

        # Disable tqdm on all nodes except the rank 0 GPU on each server
        batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d} on rank {config.global_rank}", disable=config.local_rank != 0)

        for batch in batch_iterator:
            # Use FP8 context manager
            with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):            
                encoder_input = batch['encoder_input'].to(device) # (b, seq_len)
                decoder_input = batch['decoder_input'].to(device) # (B, seq_len)
                encoder_mask = batch['encoder_mask'].to(device) # (B, 1, 1, seq_len)
                decoder_mask = batch['decoder_mask'].to(device) # (B, 1, seq_len, seq_len)

                # # Run the tensors through the encoder, decoder and the projection layer
                # encoder_output = model.module.encode(encoder_input, encoder_mask) # (B, seq_len, d_model)
                # decoder_output = model.module.decode(encoder_output, encoder_mask, decoder_input, decoder_mask) # (B, seq_len, d_model)
                # proj_output = model.module.project(decoder_output) # (B, seq_len, vocab_size)
                proj_output = model(encoder_input, encoder_mask, decoder_input, decoder_mask)

                # Compare the output with the label
                label = batch['label'].to(device) # (B, seq_len)

                # Compute the loss using a simple cross entropy
                loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
                
            batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}", "global_step": global_step})
            # Rest of the code
```


**Expected behavior**

Training can run correctly.

**Environment overview (please complete the following information)**

 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]

A local workstation.

 - Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.

pip install in virtual environment: `pip install --no-build-isolation transformer_engine[pytorch]` 

 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version

Ubuntu 22.04 LTS

- PyTorch version

2.9.1

- Python version

Python 3.10.12 (main, Nov  4 2025, 08:48:33) [GCC 11.4.0] on linux

- Transformer Engine version

2.10.0

- CUDA version

12.8

- CUDNN version

9.17.0.29-1

**Device details**
- GPU model

NVIDIA RTX 6000 Ada Generation

**Additional context**

I used `LD_DEBUG=libs <the training command> 2>&1 | grep libcublasLt.so.12` to ensure that the correct CublasLt lib is located. Besides commenting out the two `assert` statements, another way to solve the false positive issue is setting `LD_PRELOAD` to `"(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12"`.

Please ask if you need anything else for this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrectly detect CublasLt version and cause error in an assert #2538

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Incorrectly detect CublasLt version and cause error in an assert #2538

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions