🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava · Xiang Zhang · He Wen · Chenru Wen · Zhuowen Tu

📖 Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

🔥 Updates

2025-12-30: Added training instructions and code
2025-09-30: Inference and evaluation code is released
2025-06-25: Paper accepted at ICCV 2025 🎉🎉

🚀 Quick Start

Installation

Clone the repository and install the requirements

conda env create -f environment.yml
conda activate LayYourScene

Demo

Download trained models and configs from HuggingFace:

git clone https://huggingface.co/dsrivastavv/Lay-Your-Scene saved_models

Run demo.py to generate layouts with Lay-Your-Scene followed by GLIGEN for converting generated layouts to image:

python demo.py --ckpt saved_models/grit/model.pt \
                --ckpt-config saved_models/grit/config.json \
                --caption "A big boat in the ocean and a man running on beach under blue sky"

The generated layout is saved at root directory as scene_layout.png and the generated image is saved as scene.png.

📊 Evaluation

Layout-FID: Draws the layout as an image and map each object to a specific color following document layout generation literature taking into account semantic similarity between different objects based on CLIP similarity.

python -m layout_evaluation.evaluate \
    --evaluation_name coco_grounded_lfid \
    --layout_file results/coco_evaluation/layousyn/output.json \
    --evaluation_dir results/coco_evaluation/layousyn

or you can generate the layouts from trained model and evaluate by running below command:

python -m layousyn.evaluation.coco_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/coco_evaluation/layousyn \
    --partial-file

NSR-1K evaluation: Evaluates the generated layouts on spatial and counting metrics using the NSR-1K benchmark.

# Spatial evaluation
python -m layout_evaluation.evaluate \
    --evaluation_name nsr_spatial \
    --layout_file results/nsr_1k/layousyn/spatial.json

# Counting evaluation
python -m layout_evaluation.evaluate \
    --evaluation_name nsr_counting \
    --layout_file results/nsr_1k/layousyn/counting.json

or you can generate the layouts from trained model and evaluate by running below command:

python -m layousyn.evaluation.nsr_counting_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/nsr_1k/layousyn/counting

python -m layousyn.evaluation.nsr_spatial_evaluation \
    --ckpt saved_models/coco_grounded/model.pt \
    --ckpt-config saved_models/coco_grounded/config.json \
    --num-sampling-step 15 \
    --cfg-scales 2.0 \
    --eval-dir results/nsr_1k/layousyn/spatial

GLIP: Please follow the instructions in LAYOUTGPT for generating images from layouts using GLIGEN and detecting objects using GLIP.

Training

Download COCO-17, COCO-Caption-Grounded, and NSR-1K datasets and place them in datasets directory. The directory structure should look like:

datasets
├── coco-2017
├── COCOCaptionGrounded
└── NSR-1K

Run the following command to train the model on a machine with 1 GPUs. Note: First run needs to be done on a single GPU to save embedding caches. Later runs can be done on multiple GPUs.

export OMP_NUM_THREADS=4
torchrun --nnodes=1 --nproc_per_node=1 train.py --config configs/config.json

Note: Training takes around 20 hours on 2 NVIDIA RTX A5000 GPUs after embedding caches are saved, running at approximately 15 Steps/Sec. Extracting embeddings takes around 2 hours on COCOGroundedDataset.

(optional) Set --embed-dir to a very fast storage device like NVMe SSD to speed up training.

export OMP_NUM_THREADS=4
torchrun --nnodes=1 --nproc_per_node=1 train.py --num-workers 4 --model DiT-XS --embed-dir /path/to/fast-storage

🤝 Acknowledgements

We deeply appreciate the contributions of the following projects:

We would also like to thank Xiang Zhang, Ethan Armand, and Zeyuan Chen> for their valuable feedback and insightful discussions.

✒️ Citation

If you find our work useful, please consider citing:

@misc{srivastava2025layyourscenenaturalscenelayout,
      title={Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers}, 
      author={Divyansh Srivastava and Xiang Zhang and He Wen and Chenru Wen and Zhuowen Tu},
      year={2025},
      eprint={2505.04718},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.04718}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
configs		configs
layousyn		layousyn
layout_evaluation		layout_evaluation
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
environment.yml		environment.yml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

📖 Abstract

🔥 Updates

🚀 Quick Start

Installation

Demo

📊 Evaluation

Training

🤝 Acknowledgements

✒️ Citation

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

mlpc-ucsd/Lay-Your-Scene

Folders and files

Latest commit

History

Repository files navigation

🎨 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

📖 Abstract

🔥 Updates

🚀 Quick Start

Installation

Demo

📊 Evaluation

Training

🤝 Acknowledgements

✒️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages