Describe the bug
Incorrectly detect CublasLt version and cause error in an assert:
[rank0]: File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 836, in autocast
[rank0]: check_recipe_support(recipe)
[rank0]: File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 99, in check_recipe_support
[rank0]: assert recipe_supported, unsupported_reason
[rank0]: AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.
Similarly:
File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 841, in autocast
FP8GlobalStateManager.autocast_enter(
File "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/transformer_engine/pytorch/quantization.py", line 579, in autocast_enter
assert fp8_available, reason_for_no_fp8
AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.
My CublasLt version is 12.8.4 detected using
lib = ctypes.CDLL(path)
version_int = lib.cublasLtGetVersion()
So, the version of CublasLt meets the requirement. If I commented out the two assert statements on line 99 and 579, the program can run without any error. So, I think it is a false positive issue.
Steps/Code to reproduce bug
Run a training of a language model with FP8 support. An example code is here. Change train.py to enable FP8:
- Add at the top of the code:
import transformer_engine.pytorch as te
from transformer_engine.common import recipe
- Modify from line 253 to 282:
# Create FP8 recipe
fp8_format = recipe.Format.HYBRID # or recipe.Format.E4M3
fp8_recipe = recipe.DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")
for epoch in range(initial_epoch, config.num_epochs):
torch.cuda.empty_cache()
model.train()
# Disable tqdm on all nodes except the rank 0 GPU on each server
batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d} on rank {config.global_rank}", disable=config.local_rank != 0)
for batch in batch_iterator:
# Use FP8 context manager
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
encoder_input = batch['encoder_input'].to(device) # (b, seq_len)
decoder_input = batch['decoder_input'].to(device) # (B, seq_len)
encoder_mask = batch['encoder_mask'].to(device) # (B, 1, 1, seq_len)
decoder_mask = batch['decoder_mask'].to(device) # (B, 1, seq_len, seq_len)
# # Run the tensors through the encoder, decoder and the projection layer
# encoder_output = model.module.encode(encoder_input, encoder_mask) # (B, seq_len, d_model)
# decoder_output = model.module.decode(encoder_output, encoder_mask, decoder_input, decoder_mask) # (B, seq_len, d_model)
# proj_output = model.module.project(decoder_output) # (B, seq_len, vocab_size)
proj_output = model(encoder_input, encoder_mask, decoder_input, decoder_mask)
# Compare the output with the label
label = batch['label'].to(device) # (B, seq_len)
# Compute the loss using a simple cross entropy
loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}", "global_step": global_step})
# Rest of the code
Expected behavior
Training can run correctly.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
A local workstation.
- Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
pip install in virtual environment: pip install --no-build-isolation transformer_engine[pytorch]
- If method of install is [Docker], provide
docker pull & docker run commands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Ubuntu 22.04 LTS
2.9.1
Python 3.10.12 (main, Nov 4 2025, 08:48:33) [GCC 11.4.0] on linux
- Transformer Engine version
2.10.0
12.8
9.17.0.29-1
Device details
NVIDIA RTX 6000 Ada Generation
Additional context
I used LD_DEBUG=libs <the training command> 2>&1 | grep libcublasLt.so.12 to ensure that the correct CublasLt lib is located. Besides commenting out the two assert statements, another way to solve the false positive issue is setting LD_PRELOAD to "(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12".
Please ask if you need anything else for this issue.
Describe the bug
Incorrectly detect CublasLt version and cause error in an assert:
Similarly:
My CublasLt version is 12.8.4 detected using
So, the version of CublasLt meets the requirement. If I commented out the two
assertstatements on line 99 and 579, the program can run without any error. So, I think it is a false positive issue.Steps/Code to reproduce bug
Run a training of a language model with FP8 support. An example code is here. Change train.py to enable FP8:
Expected behavior
Training can run correctly.
Environment overview (please complete the following information)
A local workstation.
pip install in virtual environment:
pip install --no-build-isolation transformer_engine[pytorch]docker pull&docker runcommands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Ubuntu 22.04 LTS
2.9.1
Python 3.10.12 (main, Nov 4 2025, 08:48:33) [GCC 11.4.0] on linux
2.10.0
12.8
9.17.0.29-1
Device details
NVIDIA RTX 6000 Ada Generation
Additional context
I used
LD_DEBUG=libs <the training command> 2>&1 | grep libcublasLt.so.12to ensure that the correct CublasLt lib is located. Besides commenting out the twoassertstatements, another way to solve the false positive issue is settingLD_PRELOADto"(omitted)/paperspace/pytorch-transformer-distributed/venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublasLt.so.12".Please ask if you need anything else for this issue.