WAON

Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models

Introduction

In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification. In our experiments, we fine-tune SigLIP2, a strong multilingual model, on both WAON and the Japanese subset of ReLAION, one of the most influential vision-language datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all tasks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks.

Repository Overview

This repository contains the code to construct WAON, a large-scale and high-quality Japanese image-text pair dataset for vision-language models.

Setup

Install dependencies using uv:

uv sync
source .venv/bin/activate

When you install fasttext, you may need to define CC and CXX environment variables to gcc and g++ respectively.

CC=gcc  CXX=g++  uv add fasttext

Constructing the WAON Dataset

The WAON dataset is constructed through the following pipeline:

1. Collect WARC File URLs

We begin by collecting URLs of WARC files from Common Crawl.

python src/crawl_mm/waon_cc/cc2url.py

2. Download and Extract HTML Files

Next, we download WARC files and extract HTML pages containing Japanese text. (In this example, we limit the number of WARC files to 1 for testing.)

python src/crawl_mm/waon_cc/url2html.py --max_num_files 1
python src/crawl_mm/waon_cc/html2goodhtml.py

3. Extract Image-Text Pairs

We then extract image–text pairs from the processed HTML files.

python src/crawl_mm/waon_cc/goodhtml2pair.py

4. Deduplicate Image URLs and Captions

Duplicate image URLs and captions are removed.

python src/crawl_mm/waon_cc/deduplicate_image_url.py

5. Download Images

Images are downloaded from the corresponding URLs.

python src/crawl_mm/waon_cc/download_images.py

6. Filter by Image Size

We filter out unsuitable images based on resolution and aspect ratio.

python src/crawl_mm/waon_cc/image_size_filter.py

7. NSFW Filtering

Images containing NSFW content are filtered out using a pre-trained NSFW detection model.

python src/crawl_mm/waon_cc/nsfw_filter.py

8. Annotate Perceptual Hashes (pHash)

We compute perceptual hashes for each image to enable deduplication.

python src/crawl_mm/waon_cc/annotate_phash.py

9. Deduplicate Images by pHash

Images are deduplicated using a Bloom filter based on pHash similarity.

python src/crawl_mm/waon_cc/dedup_phash_bloom.py

10. Annotate CLIP Scores

Each image–text pair is annotated with a CLIP similarity score using the SigLIP2-base model.

python src/crawl_mm/waon_cc/annotate_clip_score.py

11. Filter by CLIP Score

Finally, image–text pairs with low CLIP scores are removed.

python src/crawl_mm/waon_cc/filter_low_clipscore.py

Citation

Star us on GitHub if you find this repository useful! ⭐

If you find this work interesting, please cite our paper:

@misc{sugiura2025waonlargescalehighqualityjapanese,
      title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models},
      author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki},
      year={2025},
      eprint={2510.22276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.22276},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
src/crawl_mm		src/crawl_mm
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WAON

Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models

Introduction

Repository Overview

Setup

Constructing the WAON Dataset

1. Collect WARC File URLs

2. Download and Extract HTML Files

3. Extract Image-Text Pairs

4. Deduplicate Image URLs and Captions

5. Download Images

6. Filter by Image Size

7. NSFW Filtering

8. Annotate Perceptual Hashes (pHash)

9. Deduplicate Images by pHash

10. Annotate CLIP Scores

11. Filter by CLIP Score

Citation

About

Uh oh!

Releases

Packages

Languages

License

llm-jp/WAON

Folders and files

Latest commit

History

Repository files navigation

WAON

Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models

Introduction

Repository Overview

Setup

Constructing the WAON Dataset

1. Collect WARC File URLs

2. Download and Extract HTML Files

3. Extract Image-Text Pairs

4. Deduplicate Image URLs and Captions

5. Download Images

6. Filter by Image Size

7. NSFW Filtering

8. Annotate Perceptual Hashes (pHash)

9. Deduplicate Images by pHash

10. Annotate CLIP Scores

11. Filter by CLIP Score

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages