This is the official implementation for the paper "[ICCV 2025] Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation".
- Code & checkpoint upload completed
- FlexAttention finetuning option enabled
- Infinity-8B checkpoint finetuning enabled
We introduce a method for fine-tuning visual autoregressive (VAR) models tailored for subject-driven generation tasks. Our approach efficiently customizes VAR models, enabling high-quality personalized image generation.
Our experiments were conducted using a NVIDIA A6000 GPU. Please ensure your hardware meets the following minimum specification:
- GPU Memory: ≥ 40GB
Clone the repository and install dependencies:
git clone https://github.com/jiwoogit/ARBooth.git
cd arbooth
pip install -r requirements.txtUse our pre-configured Docker image:
docker pull wldn0202/arbooth:latestDocker Hub link: Docker Image
Please download the official pretrained VAR checkpoints from Infinity's repository and organize them as follows:
weights/
├── infinity_2b_reg.pth
└── infinity_vae_d32_reg.pth
You can download our fine-tuned checkpoints from Hugging Face (wldn0202/ARBooth).
We adopt the preprocessing pipeline of DreamMatcher. Please follow their instructions for detailed steps or refer inputs directory.
Customize training parameters by modifying exp_name and cls_name in the provided script:
bash scripts/train_arbooth.shAll training results and logs will be saved under the LOCAL_OUT directory.
For detailed configuration options and parameters for fine-tuning, please refer to infinity/utils/arg_util.py.
We evaluate performance using metrics: DINO, CLIP, PRES, and DIV. Update the paths in scripts/eval_arbooth.sh to match your training setup:
bash scripts/eval_arbooth.shGenerate images using your custom prompts with the fine-tuned checkpoints:
bash scripts/infer_arbooth.sh-
Iteration Settings:
- For 2-batch configuration: 500 iterations is recommended
- For 1-batch configuration: 100-150 iterations is recommended
- Adjust these values based on your specific input data and requirements
-
Class Prompt Selection:
- The choice of class prompt (e.g., "dog", "cat") significantly impacts the final generation quality
- Use general, broad category nouns for optimal results
This repository is built upon the following projects:
- VAR (FoundationVision/VAR)
- Infinity (FoundationVision/Infinity)
- DreamMatcher (cvlab-kaist/DreamMatcher)
- diffusers (huggingface/diffusers)
We sincerely appreciate their invaluable contributions.
If our paper or repository assists your research, kindly cite us:
@article{chung2025fine,
title={Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation},
author={Chung, Jiwoo and Hyun, Sangeek and Kim, Hyunjun and Koh, Eunseo and Lee, MinKyu and Heo, Jae-Pil},
journal={arXiv preprint arXiv:2504.02612},
year={2025}
}For any questions, please reach out to:
- Jiwoo Chung ([email protected])
