Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search with Task-Centric Benchmarks
Iceberg is a comprehensive benchmark suite for end-to-end evaluation of VSS (Vector Similarity Search) methods in realistic application settings. It spans 7 diverse datasets across key domains including image classification, face recognition, text retrieval, and recommendation systems. Each dataset contains 1M to 100M vectors enriched with task-specific labels and metrics, enabling evaluation of retrieval algorithms within full application pipelinesβnot just in isolated recall-speed scenarios. Iceberg benchmarks 13 state-of-the-art VSS algorithms and re-ranks them using task-centric performance metrics, uncovering substantial deviations from conventional recall/speed-based rankings. Morever, Iceberg propose an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for specific workloads.
The dataset has been publicly released and is maintained on the Hugging Face platform.
Access Link: Iceberg-dataset
| Dataset | Base Size | Dim | Query Size | Domain | Origin data source |
|---|---|---|---|---|---|
| ImageNet-DINOv2 | 1,281,167 | 768 | 50,000 | Image Classification | https://image-net.org/index.php |
| ImageNet-EVA02 | 1,281,167 | 1024 | 50,000 | Image Classification | https://image-net.org/index.php |
| ImageNet-ConvNeXt | 1,281,167 | 1536 | 50,000 | Image Classification | https://image-net.org/index.php |
| Glint360K-IR101 | 17,091,649 | 512 | 20,000 | Face Recognition | https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#glint360k |
| Glint360K-ViT | 17,091,649 | 512 | 20,000 | Face Recognition | https://github.com/deepinsight/insightface/tree/master/recognition/partial_fc#glint360k |
| BookCorpus | 9,250,529 | 1024 | 10,000 | Text Retrieval | https://huggingface.co/datasets/bookcorpus/bookcorpus |
| Commerce | 99,085,171 | 48 | 64,111 | Recommendation |
ImageNet is a large-scale dataset containing millions of high-resolution images spanning thousands of object categories. Each image is annotated with ground-truth labels, either manually or semi-automatically. The dataset has been widely used in the computer vision community for model training and benchmarking, particularly for image classification tasks.
Emebedding Models:
- DINOv2: https://huggingface.co/facebook/dinov2-base
- EVA02: https://huggingface.co/timm/eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
- ConvNeXt: https://huggingface.co/timm/convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384
End Tasks:
- Label Recall@K: It measures how many correct task-specific labels appear in the top-K retrieved results.
Glint360K is a large-scale face dataset created by merging and cleaning multiple public face datasets to significantly expand both the number of identities and facial images.
Emebedding Models:
- Resnet-IR101: https://huggingface.co/minchul/cvlface_arcface_ir101_webface4m
- ViT: https://huggingface.co/gaunernst/vit_tiny_patch8_112.arcface_ms1mv3
End Tasks:
- Label Recall@K: It measures how many correct task-specific labels appear in the top-K retrieved results.
BookCorpus consists of text extracted from approximately 19,000 books spanning various domains and has been curated into a high-quality corpus. The text was segmented at the paragraph level, with each paragraph concatenated into chunks containing eight sentences. This preprocessing resulted in a base dataset of 9,250,529 paragraphs. From this corpus, 10,000 paragraphs were randomly sampled to construct the query set. The unique ID of each paragraph was used as the label for its corresponding embedding vector.
Emebedding Models:
End Tasks:
- Hit@K: It measures whether the most semantic relevant paragraph is included in the top-K retrieved results.
Commerce dataset, derived from anonymized traffic logs of a major e-commerce platform, serves as a representative benchmark for large-scale E-commerce systems. Collected over several months, the dataset comprises 99,085,171 records of frequently purchased grocery items. In addition, a query set of 64,111 entries was constructed to represent user profiles and associated search keywords. Each query is linked to a sequence of high-popularity items, enabling evaluation on downstream recommendation tasks. Item IDs are used as labels throughout the dataset.
Emebedding Models:
End Tasks:
- Matching Score@K: It measures whether the vectors retrieved by a query are both relevant and popular, as well as the cumulative popularity of those items.
| Metric | Type | Original Code Link | |
|---|---|---|---|
| Fargo | Inner Product | Parition-based | https://github.com/Jacyhust/FARGO_VLDB23 |
| ScaNN | Inner Product | Parition-based | https://github.com/google-research/google-research/tree/master/scann |
| ip-NSW | Inner Product | Graph-based | https://github.com/stanis-morozov/ip-nsw |
| ip-NSW+ | Inner Product | Graph-based | https://github.com/jerry-liujie/ip-nsw/tree/GraphMIPS |
| Mobius | Inner Product | Graph-based | Our own implementation |
| NAPG | Inner Product | Graph-based | Our own implementation |
| MAG | Inner Product | Graph-based | https://github.com/ZJU-DAILY/MAG |
| RaBitQ | Euclidean Distance | Parition-based | https://github.com/VectorDB-NTU/RaBitQ-Library |
| IVFPQ | Euclidean Distance | Parition-based | https://github.com/facebookresearch/faiss |
| DB-LSH | Euclidean Distance | Parition-based | https://github.com/Jacyhust/DB-LSH |
| HNSW | Euclidean Distance | Graph-based | https://github.com/nmslib/hnswlib |
| NSG | Euclidean Distance | Graph-based | https://github.com/ZJULearning/nsg |
| Vamana | Euclidean Distance | Graph-based | https://github.com/microsoft/DiskANN |
git clone projectPython 3.10+; docker; pyyamlRun pip install -r requirements.txt.
Example: We use HNSW for the ImageNet dataset as an example to run the benchmark.
-
Configure the dataset (config/dataset.yaml):
imagenet1k_avg: dataset_type: imagenet data_pre: imagenet-1k train_name: convnext-avg-pool-train.bin test_name: convnext-avg-pool-validation.bin train_path: /workspace/data/imagenet-1k/convnext-avg-pool-train.bin test_path: /workspace/data/imagenet-1k/convnext-avg-pool-validation.bin prefix: convnext-avg-pool data_dim: 1536 k: 100 data_num: 1281167 query_num: 50000
-
Configure the algorithm (config/algorithm.yaml)
hnsw: efc: 256 M: 32 efs: [100, 200, 300, 400, 500, 600, 800, 1000, 1500] type: nn
Configuration parameters:
efc: build parameter for HNSWM: build parameter for HNSWefs: search parameter for HNSWtype: distance metric type
-
run the algorithm & evaluation
- Configure the dataset and algorithm parameters in
config/dataset.yamlandconfig/algorithm.yaml - Run the algorithm using:
python3 run.py hnsw imagenet1k_dinov2 --mode build/search - For more configuration options, refer to:
python run.py --help
- Configure the dataset and algorithm parameters in
- β Open-source code is available for the benchmarks.
- β Docker Environment.
- π More real-worlds tasks, advanced embedding models, and new algorithms.
- π Visualization Interface.











