ltx-trainer: int4-quanto can be much slower than int2-quanto for LoRA training on CUDA/bf16

### Description

While testing image-only LoRA training with `ltx-trainer`, I found that `int4-quanto` can be dramatically slower than `int2-quanto` on a single RTX 4090, even though `int4` is the natural thing to try first for 24 GB cards.

In my case:

- `int4-quanto`: roughly `~10x` slower per training step in the real LTX training loop
- `int2-quanto`: roughly `~1s/step`

This was surprising because `int4` did fit memory and did not look like a simple VRAM pressure issue.

### What I checked

I added local device/step tracing to the trainer and confirmed:

- the first traced training step showed `cpu_involved=0`
- executed modules were on `cuda:0`
- so this did **not** look like a hidden CPU fallback inside the LTX model forward

I then reduced the problem outside LTX and found that on this stack:

- `qint2` selects `WeightQBitsTensor`
- `qint4` selects `TinyGemmWeightQBitsTensor`

That appears to be the relevant backend difference.

I opened the upstream Quanto issue here:

- huggingface/optimum-quanto#413

### Why I think this matters for LTX

The current trainer exposes `int4-quanto` as a practical low-VRAM option, and for 24 GB users it is a very natural choice. But on my setup it was much slower than `int2-quanto`, so the user-facing behavior is pretty counterintuitive.

I think it would help if `ltx-trainer` either:

- documented that `int4-quanto` may be slower than `int2-quanto` for training on some CUDA/bf16 stacks, or
- mentioned that the `qint4 -> TinyGemm` path may behave differently from `qint2`, especially for backward-heavy LoRA training workloads

### Environment

- `ltx-trainer` from current `main` as of 2026-03-22
- `optimum-quanto==0.2.7`
- `torch==2.9.1+cu128`
- `Python 3.12.12`
- GPU: `NVIDIA GeForce RTX 4090 24 GB`
- Driver: `581.42`

If useful, I can provide the exact trainer config and logs as well.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ltx-trainer: int4-quanto can be much slower than int2-quanto for LoRA training on CUDA/bf16 #173

Description

What I checked

Why I think this matters for LTX

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ltx-trainer: int4-quanto can be much slower than int2-quanto for LoRA training on CUDA/bf16 #173

Description

Description

What I checked

Why I think this matters for LTX

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions