Make sure you already checked the examples and documentation before submitting an issue.
How would you like to use ModelOpt
https://github.com/nvidia/model-optimizer/blob/main/examples/windows/diffusers/qad_example/README.md
I’m looking at the diffusers QAD example here:
examples/windows/diffusers/qad_example
The example config uses:
batch_size: 1
gradient_accumulation_steps: 4
steps: 300
FSDP num_processes: 8
So the effective batch size seems to be 1 × 4 × 8 = 32.
Could you clarify the recommended settings for a real production-quality QAD run, not just a smoke test?
Main questions:
What effective batch size is usually recommended for stable QAD quality recovery? Is 32 enough, or should we target 64 or higher?
Do you have any internal or public successful QAD examples for large Diffusers/DiT models such as LTX-2 or similar-size models? If yes, what batch size, steps, and learning rate were used?
For a full QAD recovery run on large diffusion models, roughly how many B200/B300 GPUs should users expect to need?
Is steps=300 only intended as a demo/smoke-test config, or has it been enough in real QAD recovery cases?
Who can help?
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
- Python: ?
- ModelOpt version or commit hash: ?
- CUDA: ?
- PyTorch: ?
- Transformers: ?
- TensorRT-LLM: ?
- ONNXRuntime: ?
- TensorRT: ?
- Any other details that may help: ?
Make sure you already checked the examples and documentation before submitting an issue.
How would you like to use ModelOpt
https://github.com/nvidia/model-optimizer/blob/main/examples/windows/diffusers/qad_example/README.md
I’m looking at the diffusers QAD example here:
examples/windows/diffusers/qad_example
The example config uses:
batch_size: 1
gradient_accumulation_steps: 4
steps: 300
FSDP num_processes: 8
So the effective batch size seems to be 1 × 4 × 8 = 32.
Could you clarify the recommended settings for a real production-quality QAD run, not just a smoke test?
Main questions:
What effective batch size is usually recommended for stable QAD quality recovery? Is 32 enough, or should we target 64 or higher?
Do you have any internal or public successful QAD examples for large Diffusers/DiT models such as LTX-2 or similar-size models? If yes, what batch size, steps, and learning rate were used?
For a full QAD recovery run on large diffusion models, roughly how many B200/B300 GPUs should users expect to need?
Is steps=300 only intended as a demo/smoke-test config, or has it been enough in real QAD recovery cases?
Who can help?
System information