-
Notifications
You must be signed in to change notification settings - Fork 821
ltx-trainer: int4-quanto can be much slower than int2-quanto for LoRA training on CUDA/bf16 #173
Description
Description
While testing image-only LoRA training with ltx-trainer, I found that int4-quanto can be dramatically slower than int2-quanto on a single RTX 4090, even though int4 is the natural thing to try first for 24 GB cards.
In my case:
int4-quanto: roughly~10xslower per training step in the real LTX training loopint2-quanto: roughly~1s/step
This was surprising because int4 did fit memory and did not look like a simple VRAM pressure issue.
What I checked
I added local device/step tracing to the trainer and confirmed:
- the first traced training step showed
cpu_involved=0 - executed modules were on
cuda:0 - so this did not look like a hidden CPU fallback inside the LTX model forward
I then reduced the problem outside LTX and found that on this stack:
qint2selectsWeightQBitsTensorqint4selectsTinyGemmWeightQBitsTensor
That appears to be the relevant backend difference.
I opened the upstream Quanto issue here:
Why I think this matters for LTX
The current trainer exposes int4-quanto as a practical low-VRAM option, and for 24 GB users it is a very natural choice. But on my setup it was much slower than int2-quanto, so the user-facing behavior is pretty counterintuitive.
I think it would help if ltx-trainer either:
- documented that
int4-quantomay be slower thanint2-quantofor training on some CUDA/bf16 stacks, or - mentioned that the
qint4 -> TinyGemmpath may behave differently fromqint2, especially for backward-heavy LoRA training workloads
Environment
ltx-trainerfrom currentmainas of 2026-03-22optimum-quanto==0.2.7torch==2.9.1+cu128Python 3.12.12- GPU:
NVIDIA GeForce RTX 4090 24 GB - Driver:
581.42
If useful, I can provide the exact trainer config and logs as well.