Skip to content

Incorrectly detect CublasLt version and cause error in an assert #2538

Description

@zzzhhh2025

Describe the bug

Incorrectly detect CublasLt version and cause error in an assert:

[rank0]: File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 836, in autocast
[rank0]: check_recipe_support(recipe)
[rank0]: File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 99, in check_recipe_support
[rank0]: assert recipe_supported, unsupported_reason
[rank0]: AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.

Similarly:

File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 841, in autocast
FP8GlobalStateManager.autocast_enter(
File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 579, in autocast_enter
assert fp8_available, reason_for_no_fp8
AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.

My CublasLt version is 12.8.4 detected using

lib = ctypes.CDLL(path)
version_int = lib.cublasLtGetVersion()

So, the version of CublasLt meets the requirement. If I commented out the two assert statements on line 99 and 579, the program can run without any error. So, I think it is a false positive issue.

Steps/Code to reproduce bug

Run a training of a language model with FP8 support. An example code is here. Change train.py to enable FP8:

  1. Add at the top of the code:
import transformer_engine.pytorch as te
from transformer_engine.common import recipe
  1. Modify from line 253 to 282:
    # Create FP8 recipe
    fp8_format = recipe.Format.HYBRID  # or recipe.Format.E4M3
    fp8_recipe = recipe.DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")

    for epoch in range(initial_epoch, config.num_epochs):
        torch.cuda.empty_cache()
        model.train()

        # Disable tqdm on all nodes except the rank 0 GPU on each server
        batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d} on rank {config.global_rank}", disable=config.local_rank != 0)

        for batch in batch_iterator:
            # Use FP8 context manager
            with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):            
                encoder_input = batch['encoder_input'].to(device) # (b, seq_len)
                decoder_input = batch['decoder_input'].to(device) # (B, seq_len)
                encoder_mask = batch['encoder_mask'].to(device) # (B, 1, 1, seq_len)
                decoder_mask = batch['decoder_mask'].to(device) # (B, 1, seq_len, seq_len)

                # # Run the tensors through the encoder, decoder and the projection layer
                # encoder_output = model.module.encode(encoder_input, encoder_mask) # (B, seq_len, d_model)
                # decoder_output = model.module.decode(encoder_output, encoder_mask, decoder_input, decoder_mask) # (B, seq_len, d_model)
                # proj_output = model.module.project(decoder_output) # (B, seq_len, vocab_size)
                proj_output = model(encoder_input, encoder_mask, decoder_input, decoder_mask)

                # Compare the output with the label
                label = batch['label'].to(device) # (B, seq_len)

                # Compute the loss using a simple cross entropy
                loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
                
            batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}", "global_step": global_step})
            # Rest of the code

Expected behavior

Training can run correctly.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]

A local workstation.

  • Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.

pip install in virtual environment: pip install --no-build-isolation transformer_engine[pytorch]

  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version

Ubuntu 22.04 LTS

  • PyTorch version

2.9.1

  • Python version

Python 3.10.12 (main, Nov 4 2025, 08:48:33) [GCC 11.4.0] on linux

  • Transformer Engine version

2.10.0

  • CUDA version

12.8

  • CUDNN version

9.17.0.29-1

Device details

  • GPU model

NVIDIA RTX 6000 Ada Generation

Additional context

I used LD_DEBUG=libs <the training command> 2>&1 | grep libcublasLt.so.12 to ensure that the correct CublasLt lib is located. Besides commenting out the two assert statements, another way to solve the false positive issue is setting LD_PRELOAD to "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12".

Please ask if you need anything else for this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions