Skip to content

(ICCV 2025) 🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

License

Notifications You must be signed in to change notification settings

mlpc-ucsd/Lay-Your-Scene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Website

Divyansh Srivastava Β· Xiang Zhang Β· He Wen Β· Chenru Wen Β· Zhuowen Tu


πŸ“– Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

πŸ”₯ Updates

  • 2025-12-30: Added training instructions and code
  • 2025-09-30: Inference and evaluation code is released
  • 2025-06-25: Paper accepted at ICCV 2025 πŸŽ‰πŸŽ‰

πŸš€ Quick Start

Installation

Clone the repository and install the requirements

conda env create -f environment.yml
conda activate LayYourScene

Demo

  1. Download trained models and configs from HuggingFace:
git clone https://huggingface.co/dsrivastavv/Lay-Your-Scene saved_models
  1. Run demo.py to generate layouts with Lay-Your-Scene followed by GLIGEN for converting generated layouts to image:
python demo.py --ckpt saved_models/grit/model.pt \
                --ckpt-config saved_models/grit/config.json \
                --caption "A big boat in the ocean and a man running on beach under blue sky"

The generated layout is saved at root directory as scene_layout.png and the generated image is saved as scene.png.

πŸ“Š Evaluation

  1. Layout-FID: Draws the layout as an image and map each object to a specific color following document layout generation literature taking into account semantic similarity between different objects based on CLIP similarity.
python -m layout_evaluation.evaluate \
    --evaluation_name coco_grounded_lfid \
    --layout_file results/coco_evaluation/layousyn/output.json \
    --evaluation_dir results/coco_evaluation/layousyn

or you can generate the layouts from trained model and evaluate by running below command:

python -m layousyn.evaluation.coco_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/coco_evaluation/layousyn \
    --partial-file
  1. NSR-1K evaluation: Evaluates the generated layouts on spatial and counting metrics using the NSR-1K benchmark.
# Spatial evaluation
python -m layout_evaluation.evaluate \
    --evaluation_name nsr_spatial \
    --layout_file results/nsr_1k/layousyn/spatial.json

# Counting evaluation
python -m layout_evaluation.evaluate \
    --evaluation_name nsr_counting \
    --layout_file results/nsr_1k/layousyn/counting.json

or you can generate the layouts from trained model and evaluate by running below command:

python -m layousyn.evaluation.nsr_counting_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/nsr_1k/layousyn/counting

python -m layousyn.evaluation.nsr_spatial_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/nsr_1k/layousyn/spatial
  1. GLIP: Please follow the instructions in LAYOUTGPT for generating images from layouts using GLIGEN and detecting objects using GLIP.

Training

  1. Download COCO-17, COCO-Caption-Grounded, and NSR-1K datasets and place them in datasets directory. The directory structure should look like:
datasets
β”œβ”€β”€ coco-2017
β”œβ”€β”€ COCOCaptionGrounded
└── NSR-1K
  1. Run the following command to train the model on a machine with 1 GPUs. Note: First run needs to be done on a single GPU to save embedding caches. Later runs can be done on multiple GPUs.
export OMP_NUM_THREADS=4
torchrun --nnodes=1 --nproc_per_node=1 train.py --config configs/config.json

Note: Training takes around 20 hours on 2 NVIDIA RTX A5000 GPUs after embedding caches are saved, running at approximately 15 Steps/Sec. Extracting embeddings takes around 2 hours on COCOGroundedDataset.

  1. (optional) Set --embed-dir to a very fast storage device like NVMe SSD to speed up training.
export OMP_NUM_THREADS=4
torchrun --nnodes=1 --nproc_per_node=1 train.py --num-workers 4 --model DiT-XS --embed-dir /path/to/fast-storage

🀝 Acknowledgements

We deeply appreciate the contributions of the following projects:

We would also like to thank Xiang Zhang, Ethan Armand, and Zeyuan Chen> for their valuable feedback and insightful discussions.

βœ’οΈ Citation

If you find our work useful, please consider citing:

@misc{srivastava2025layyourscenenaturalscenelayout,
      title={Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers}, 
      author={Divyansh Srivastava and Xiang Zhang and He Wen and Chenru Wen and Zhuowen Tu},
      year={2025},
      eprint={2505.04718},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.04718}, 
}

About

(ICCV 2025) 🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages