CANDOR-Bench: Benchmarking In-Memory Continuous ANNS under Dynamic Open-World Streams

CANDOR-Bench (Continuous Approximate Nearest neighbor search under Dynamic Open-woRld Streams) is a benchmarking framework designed to evaluate in-memory ANNS algorithms under realistic, dynamic data stream conditions.

Project Structure

CANDY-Benchmark/
├── benchmark/             
├── big-ann-benchmarks/             # Core benchmarking framework (Dynamic Open-World conditions)
│   ├── benchmark/
│   │   ├── algorithms/             # Concurrent Track
│   │   ├── concurrent/             # Congestion Track
│   │   ├── congestion/
│   │   ├── main.py
│   │   ├── runner.py
│   │   └── ……
│   ├── create_dataset.py
│   ├── requirements_py3.10.txt
│   ├── logging.conf
│   ├── neurips21/
│   ├── neurips23/                  # NeurIPS'23 benchmark configurations and scripts
│   │   ├── concurrent/             # Concurrent Track
│   │   ├── congestion/             # Congestion Track
│   │   ├── filter/
│   │   ├── ood/
│   │   ├── runbooks/               # Dynamic benchmark scenario definitions (e.g., T1, T3, etc.)
│   │   ├── sparse/
│   │   ├── streaming/              
│   │   └── ……
│   └──……
├── DiskANN/                        # Integrated DiskANN-based algorithms
├── GTI/                            # Integrated GTI algorithm source
├── IP-DiskANN/                     # Integrated IP-DiskANN algorithm source
├── src/                            # Main algorithm implementations
├── include/                        # C++ header files
├── thirdparty/                     # External dependencies
├── Dockerfile                      # Docker build recipe
├── requirements.txt
├── setup.py                        # Python package setup
└── ……

Datasets and Algorithms

Our evaluation involves the following datasets and algorithms.

Summary of Datasets

Category	Name	Description	Dimension	Data Size	Query Size
Real-world	SIFT	Image	128	1M	10K
	OpenImagesStreaming	Image	512	1M	10K
	Sun	Image	512	79K	200
	SIFT100M	Image	128	100M	10K
	Trevi	Image	4096	100K	200
	Msong	Audio	420	990K	200
	COCO	Multi-Modal	768	100K	500
	Glove	Text	100	1.192M	200
	MSTuring	Text	100	30M	10K
Synthetic	Gaussian	i.i.d values	Adjustable	500K	1000
	Blob	Gaussian Blobs	768	500K	1000
	WTE	Text	768	100K	100
	FreewayML	Constructed	128	100K	1K

Summary of Algorithms

Category	Algorithm Name	Description
Tree-based	SPTAG	Space-partitioning tree structure for efficient data segmentation.
LSH-based	LSH	Data-independent hashing to reduce dimensionality and approximate nearest neighbors.
LSH-based	LSHAPG	LSH-driven optimization using LSB-Tree to differentiate graph regions.
Clustering-based	PQ	Product quantization for efficient clustering into compact subspaces.
	IVFPQ	Inverted index with product quantization for hierarchical clustering.
	OnlinePQ	Incremental updates of centroids in product quantization for streaming data.
	Puck	Non-orthogonal inverted indexes with multiple quantization optimized for large-scale datasets.
	SCANN	Small-bit quantization to improve register utilization.
Graph-based	NSW	Navigable Small World graph for fast nearest neighbor search.
	HNSW	Hierarchical Navigable Small World for scalable search.
	FreshDiskANN	Streaming graph construction for large-scale proximity-based search with refined robust edge pruning.
	MNRU	Enhances HNSW with efficient updates to prevent unreachable points in dynamic environments.
	Cufe	Enhances FreshDiskANN with batched neighbor expansion.
	Pyanns	Enhances FreshDiskANN with fix-sized huge pages for optimized memory access.
	IPDiskANN	Enables efficient in-place deletions for FreshDiskANN, improving update performance without reconstructions.
	GTI	Hybrid tree-graph indexing for efficient, dynamic high-dimensional search, with optimized updates and construction.
	ParlayHNSW	Parallel, deterministic HNSW for improved scalability and performance.
	ParlayVamana	Parallel, deterministic FreshDiskANN implementation using Vamana for graph construction, with performance improvement.

Quick Start Guide

🚨🚨 Strong Recommendation: Use Docker! 🚨🚨

We strongly recommend using Docker to build and run this project.

There are many algorithm libraries with complex dependencies. Setting up the environment locally can be difficult and error-prone. Docker provides a consistent and reproducible environment, saving you time and avoiding compatibility issues.

Note: Building the Docker image may take 10–20 minutes depending on your network and hardware.

Build With Docker

To build the project using Docker, simply use the provided Dockerfile located in the root directory. This ensures a consistent and reproducible environment for all dependencies and build steps.

To initialize and update all submodules in the project, you can run:

git submodule update --init --recursive

You can build the Docker image with:

docker build -t <your-image-name> .

Once the image is built, you can run a container from it using the following command.

docker run -it <your-image-name>

After entering the container, navigate to the project directory:

cd /app/big-ann-benchmarks

Usage

All the following operations are performed in the root directory of big-ann-benchmarks.

2.1 Preparing dataset

Create a small, sample dataset. For example, to create a dataset with 10000 20-dimensional random floating point vectors, run:

python create_dataset.py --dataset random-xs

To see a complete list of datasets, run the following:

python create_dataset.py --help

2.2 Running Algorithms on the congestion Track

To evaluate an algorithm under the congestion track, use the following command:

python3 run.py \
  --neurips23track congestion \
  --algorithm "$ALGO" \
  --nodocker \
  --rebuild \
  --runbook_path "$PATH" \
  --dataset "$DS"

algorithm "$ALGO": Name of the algorithm to evaluate.
dataset "$DS": Name of the dataset to use.
runbook_path "$PATH": Path to the runbook file describing the test scenario.
rebuild: Rebuild the target before running.

2.3 Computing Ground Truth for Runbooks

To compute ground truth for an runbook:

Clone and build the DiskANN repository
Use the provided script to compute ground truth at various checkpoints:

python3 benchmark/congestion/compute_gt.py \
  --runbook "$PATH_TO_RUNBOOK" \
  --dataset "$DATASET_NAME" \
  --gt_cmdline_tool ~/DiskANN/build/apps/utils/compute_groundtruth

2.4 Exporting Results

To make the results available for post-processing, change permissions of the results folder

sudo chmod 777 -R results/

The following command will summarize all results files into a single csv file

python data_export.py --out "$OUT" --track congestion

The --out path "$OUT" should be adjusted according to the testing scenario. Common values include:

gen
batch
event
conceptDrift
randomContamination
randomDrop
wordContamination
bulkDeletion
batchDeletion
multiModal
……

Additional Information

Click to Expand

Extra CMake Options
Manual Build Instructions
CUDA Installation (Optional)
- Install CUDA (if using CUDA-based Torch)
- CUDA on Jetson Devices
Torch Installation
- Install Python and Pip
- Install PyTorch
PAPI Support (Optional)
Distributed CANDY with Ray (Optional)
Local Documentation Generation (Optional)
- Install Required Packages
- Generate Documentation
  - Accessing Documentation
Known Issues

dd068b958060f24e4d0c2cdf899f229efccc0b2b

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github		.github
DiskANN @ b7a3b76		DiskANN @ b7a3b76
GTI @ a217898		GTI @ a217898
IP-DiskANN @ ab06fe2		IP-DiskANN @ ab06fe2
benchmark		benchmark
big-ann-benchmarks @ 9e34ab5		big-ann-benchmarks @ 9e34ab5
cmake		cmake
doc		doc
figures		figures
include		include
src		src
test		test
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
buildCPUOnly.sh		buildCPUOnly.sh
buildWithCuda.sh		buildWithCuda.sh
genDoc.sh		genDoc.sh
refman.pdf		refman.pdf
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CANDOR-Bench: Benchmarking In-Memory Continuous ANNS under Dynamic Open-World Streams

Table of Contents

Project Structure

Datasets and Algorithms

Summary of Datasets

Summary of Algorithms

Quick Start Guide

🚨🚨 Strong Recommendation: Use Docker! 🚨🚨

Build With Docker

Usage

2.1 Preparing dataset

2.2 Running Algorithms on the congestion Track

2.3 Computing Ground Truth for Runbooks

2.4 Exporting Results

Additional Information

Table of Contents

2. Algorithm and Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

intellistream/CANDOR-Benchmark

Folders and files

Latest commit

History

Repository files navigation

CANDOR-Bench: Benchmarking In-Memory Continuous ANNS under Dynamic Open-World Streams

Table of Contents

Project Structure

Datasets and Algorithms

Summary of Datasets

Summary of Algorithms

Quick Start Guide

🚨🚨 Strong Recommendation: Use Docker! 🚨🚨

Build With Docker

Usage

2.1 Preparing dataset

2.2 Running Algorithms on the congestion Track

2.3 Computing Ground Truth for Runbooks

2.4 Exporting Results

Additional Information

Table of Contents

2. Algorithm and Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages