Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search with Task-Centric Benchmarks

🔗 Introduction

Iceberg is a comprehensive benchmark suite for end-to-end evaluation of VSS (Vector Similarity Search) methods in realistic application settings. It spans 7 diverse datasets across key domains including image classification, face recognition, text retrieval, and recommendation systems. Each dataset contains 1M to 100M vectors enriched with task-specific labels and metrics, enabling evaluation of retrieval algorithms within full application pipelines—not just in isolated recall-speed scenarios. Iceberg benchmarks 13 state-of-the-art VSS algorithms and re-ranks them using task-centric performance metrics, uncovering substantial deviations from conventional recall/speed-based rankings. Morever, Iceberg propose an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for specific workloads.

📚 Datasets

The dataset has been publicly released and is maintained on the Hugging Face platform.

Access Link: Iceberg-dataset

Overview

Dataset	Base Size	Dim	Query Size	Domain	Origin data source
ImageNet-DINOv2	1,281,167	768	50,000	Image Classification	https://image-net.org/index.php
ImageNet-EVA02	1,281,167	1024	50,000	Image Classification	https://image-net.org/index.php
ImageNet-ConvNeXt	1,281,167	1536	50,000	Image Classification	https://image-net.org/index.php
Glint360K-IR101	17,091,649	512	20,000	Face Recognition	https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#glint360k
Glint360K-ViT	17,091,649	512	20,000	Face Recognition	https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#glint360k
BookCorpus	9,250,529	1024	10,000	Text Retrieval	https://huggingface.co/datasets/bookcorpus/bookcorpus
Commerce	99,085,171	48	64,111	Recommendation

Detailed Description

D1: ImageNet

ImageNet is a large-scale dataset containing millions of high-resolution images spanning thousands of object categories. Each image is annotated with ground-truth labels, either manually or semi-automatically. The dataset has been widely used in the computer vision community for model training and benchmarking, particularly for image classification tasks.

Emebedding Models:

End Tasks:

Label Recall@K: It measures how many correct task-specific labels appear in the top-K retrieved results.

D2: Glint360K

Glint360K is a large-scale face dataset created by merging and cleaning multiple public face datasets to significantly expand both the number of identities and facial images.

Emebedding Models:

End Tasks:

Label Recall@K: It measures how many correct task-specific labels appear in the top-K retrieved results.

D3: BookCorpus

BookCorpus consists of text extracted from approximately 19,000 books spanning various domains and has been curated into a high-quality corpus. The text was segmented at the paragraph level, with each paragraph concatenated into chunks containing eight sentences. This preprocessing resulted in a base dataset of 9,250,529 paragraphs. From this corpus, 10,000 paragraphs were randomly sampled to construct the query set. The unique ID of each paragraph was used as the label for its corresponding embedding vector.

Emebedding Models:

Stella: https://huggingface.co/NovaSearch/stella\_en\_1.5B\_v5

End Tasks:

Hit@K: It measures whether the most semantic relevant paragraph is included in the top-K retrieved results.

D4: Commerce

Commerce dataset, derived from anonymized traffic logs of a major e-commerce platform, serves as a representative benchmark for large-scale E-commerce systems. Collected over several months, the dataset comprises 99,085,171 records of frequently purchased grocery items. In addition, a query set of 64,111 entries was constructed to represent user profiles and associated search keywords. Each query is linked to a sequence of high-popularity items, enabling evaluation on downstream recommendation tasks. Item IDs are used as labels throughout the dataset.

Emebedding Models:

ResFlow: https://github.com/FuCongResearchSquad/ResFlow

End Tasks:

Matching Score@K: It measures whether the vectors retrieved by a query are both relevant and popular, as well as the cumulative popularity of those items.

📑 Supported Algorithms

	Metric	Type	Original Code Link
Fargo	Inner Product	Parition-based	https://github.com/Jacyhust/FARGO_VLDB23
ScaNN	Inner Product	Parition-based	https://github.com/google-research/google-research/tree/master/scann
ip-NSW	Inner Product	Graph-based	https://github.com/stanis-morozov/ip-nsw
ip-NSW+	Inner Product	Graph-based	https://github.com/jerry-liujie/ip-nsw/tree/GraphMIPS
Mobius	Inner Product	Graph-based	Our own implementation
NAPG	Inner Product	Graph-based	Our own implementation
MAG	Inner Product	Graph-based	https://github.com/ZJU-DAILY/MAG
RaBitQ	Euclidean Distance	Parition-based	https://github.com/VectorDB-NTU/RaBitQ-Library
IVFPQ	Euclidean Distance	Parition-based	https://github.com/facebookresearch/faiss
DB-LSH	Euclidean Distance	Parition-based	https://github.com/Jacyhust/DB-LSH
HNSW	Euclidean Distance	Graph-based	https://github.com/nmslib/hnswlib
NSG	Euclidean Distance	Graph-based	https://github.com/ZJULearning/nsg
Vamana	Euclidean Distance	Graph-based	https://github.com/microsoft/DiskANN

🚀 Installation & Quick Start

Clone the repository

git clone project

Environment Requirements

Python 3.10+; docker; pyyaml

Run pip install -r requirements.txt.

Run the benchmark

Example: We use HNSW for the ImageNet dataset as an example to run the benchmark.

Configure the dataset (config/dataset.yaml):

imagenet1k_avg:
  dataset_type: imagenet
  data_pre: imagenet-1k
  train_name: convnext-avg-pool-train.bin
  test_name: convnext-avg-pool-validation.bin
  train_path: /workspace/data/imagenet-1k/convnext-avg-pool-train.bin
  test_path: /workspace/data/imagenet-1k/convnext-avg-pool-validation.bin
  prefix: convnext-avg-pool
  data_dim: 1536
  k: 100
  data_num: 1281167
  query_num: 50000

Configure the algorithm (config/algorithm.yaml)
```
hnsw:
  efc: 256
  M: 32
  efs: [100, 200, 300, 400, 500, 600, 800, 1000, 1500]
  type: nn
```
Configuration parameters:
- efc: build parameter for HNSW
- M: build parameter for HNSW
- efs: search parameter for HNSW
- type: distance metric type
run the algorithm & evaluation
1. Configure the dataset and algorithm parameters in config/dataset.yaml and config/algorithm.yaml
2. Run the algorithm using: python3 run.py hnsw imagenet1k_dinov2 --mode build/search
3. For more configuration options, refer to: python run.py --help

To-Do Lists

✅ Open-source code is available for the benchmarks.
✅ Docker Environment.
🔄 More real-worlds tasks, advanced embedding models, and new algorithms.
🔄 Visualization Interface.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
baselines		baselines
build_support		build_support
config		config
include		include
pictures		pictures
test		test
third_party/efanna_graph		third_party/efanna_graph
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search with Task-Centric Benchmarks

🔗 Introduction

📚 Datasets

Overview

Detailed Description

D1: ImageNet

D2: Glint360K

D3: BookCorpus

D4: Commerce

📑 Supported Algorithms

🚀 Installation & Quick Start

Clone the repository

Environment Requirements

Run the benchmark

To-Do Lists

📑 Pipeline

Dataset Selection -- Embedding Generation -- Benchmark Evaluationn

📝 Results

Iceberg LeaderBoard 1.0

Task-centric performance versus two similarity metrics

Query Performance on Synthetic Recall@100

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

ZJU-DAILY/Iceberg

Folders and files

Latest commit

History

Repository files navigation

Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search with Task-Centric Benchmarks

🔗 Introduction

📚 Datasets

Overview

Detailed Description

D1: ImageNet

D2: Glint360K

D3: BookCorpus

D4: Commerce

📑 Supported Algorithms

🚀 Installation & Quick Start

Clone the repository

Environment Requirements

Run the benchmark

To-Do Lists

📑 Pipeline

Dataset Selection -- Embedding Generation -- Benchmark Evaluationn

📝 Results

Iceberg LeaderBoard 1.0

Task-centric performance versus two similarity metrics

Query Performance on Synthetic Recall@100

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages