| 📃Paper | 🤗HuggingFace | 🧑💻Code |
In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification. In our experiments, we fine-tune SigLIP2, a strong multilingual model, on both WAON and the Japanese subset of ReLAION, one of the most influential vision-language datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all tasks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks.
This repository contains the code to construct WAON, a large-scale and high-quality Japanese image-text pair dataset for vision-language models.
Install dependencies using uv:
uv sync
source .venv/bin/activateWhen you install fasttext, you may need to define CC and CXX environment variables to gcc and g++ respectively.
CC=gcc CXX=g++ uv add fasttextThe WAON dataset is constructed through the following pipeline:
We begin by collecting URLs of WARC files from Common Crawl.
python src/crawl_mm/waon_cc/cc2url.pyNext, we download WARC files and extract HTML pages containing Japanese text. (In this example, we limit the number of WARC files to 1 for testing.)
python src/crawl_mm/waon_cc/url2html.py --max_num_files 1
python src/crawl_mm/waon_cc/html2goodhtml.pyWe then extract image–text pairs from the processed HTML files.
python src/crawl_mm/waon_cc/goodhtml2pair.pyDuplicate image URLs and captions are removed.
python src/crawl_mm/waon_cc/deduplicate_image_url.pyImages are downloaded from the corresponding URLs.
python src/crawl_mm/waon_cc/download_images.pyWe filter out unsuitable images based on resolution and aspect ratio.
python src/crawl_mm/waon_cc/image_size_filter.pyImages containing NSFW content are filtered out using a pre-trained NSFW detection model.
python src/crawl_mm/waon_cc/nsfw_filter.pyWe compute perceptual hashes for each image to enable deduplication.
python src/crawl_mm/waon_cc/annotate_phash.pyImages are deduplicated using a Bloom filter based on pHash similarity.
python src/crawl_mm/waon_cc/dedup_phash_bloom.pyEach image–text pair is annotated with a CLIP similarity score using the SigLIP2-base model.
python src/crawl_mm/waon_cc/annotate_clip_score.pyFinally, image–text pairs with low CLIP scores are removed.
python src/crawl_mm/waon_cc/filter_low_clipscore.pyStar us on GitHub if you find this repository useful! ⭐
If you find this work interesting, please cite our paper:
@misc{sugiura2025waonlargescalehighqualityjapanese,
title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models},
author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki},
year={2025},
eprint={2510.22276},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.22276},
}