Skip to content

ltx-trainer: int4-quanto can be much slower than int2-quanto for LoRA training on CUDA/bf16 #173

@pyros-projects

Description

@pyros-projects

Description

While testing image-only LoRA training with ltx-trainer, I found that int4-quanto can be dramatically slower than int2-quanto on a single RTX 4090, even though int4 is the natural thing to try first for 24 GB cards.

In my case:

  • int4-quanto: roughly ~10x slower per training step in the real LTX training loop
  • int2-quanto: roughly ~1s/step

This was surprising because int4 did fit memory and did not look like a simple VRAM pressure issue.

What I checked

I added local device/step tracing to the trainer and confirmed:

  • the first traced training step showed cpu_involved=0
  • executed modules were on cuda:0
  • so this did not look like a hidden CPU fallback inside the LTX model forward

I then reduced the problem outside LTX and found that on this stack:

  • qint2 selects WeightQBitsTensor
  • qint4 selects TinyGemmWeightQBitsTensor

That appears to be the relevant backend difference.

I opened the upstream Quanto issue here:

Why I think this matters for LTX

The current trainer exposes int4-quanto as a practical low-VRAM option, and for 24 GB users it is a very natural choice. But on my setup it was much slower than int2-quanto, so the user-facing behavior is pretty counterintuitive.

I think it would help if ltx-trainer either:

  • documented that int4-quanto may be slower than int2-quanto for training on some CUDA/bf16 stacks, or
  • mentioned that the qint4 -> TinyGemm path may behave differently from qint2, especially for backward-heavy LoRA training workloads

Environment

  • ltx-trainer from current main as of 2026-03-22
  • optimum-quanto==0.2.7
  • torch==2.9.1+cu128
  • Python 3.12.12
  • GPU: NVIDIA GeForce RTX 4090 24 GB
  • Driver: 581.42

If useful, I can provide the exact trainer config and logs as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions