Skip to content

intellistream/CANDOR-Benchmark

Repository files navigation

CANDOR-Bench: Benchmarking In-Memory Continuous ANNS under Dynamic Open-World Streams

CANDOR-Bench (Continuous Approximate Nearest neighbor search under Dynamic Open-woRld Streams) is a benchmarking framework designed to evaluate in-memory ANNS algorithms under realistic, dynamic data stream conditions.

Table of Contents


Project Structure

CANDY-Benchmark/
├── benchmark/             
├── big-ann-benchmarks/             # Core benchmarking framework (Dynamic Open-World conditions)
│   ├── benchmark/
│   │   ├── algorithms/             # Concurrent Track
│   │   ├── concurrent/             # Congestion Track
│   │   ├── congestion/
│   │   ├── main.py
│   │   ├── runner.py
│   │   └── ……
│   ├── create_dataset.py
│   ├── requirements_py3.10.txt
│   ├── logging.conf
│   ├── neurips21/
│   ├── neurips23/                  # NeurIPS'23 benchmark configurations and scripts
│   │   ├── concurrent/             # Concurrent Track
│   │   ├── congestion/             # Congestion Track
│   │   ├── filter/
│   │   ├── ood/
│   │   ├── runbooks/               # Dynamic benchmark scenario definitions (e.g., T1, T3, etc.)
│   │   ├── sparse/
│   │   ├── streaming/              
│   │   └── ……
│   └──……
├── DiskANN/                        # Integrated DiskANN-based algorithms
├── GTI/                            # Integrated GTI algorithm source
├── IP-DiskANN/                     # Integrated IP-DiskANN algorithm source
├── src/                            # Main algorithm implementations
├── include/                        # C++ header files
├── thirdparty/                     # External dependencies
├── Dockerfile                      # Docker build recipe
├── requirements.txt
├── setup.py                        # Python package setup
└── ……

Datasets and Algorithms

Our evaluation involves the following datasets and algorithms.

Summary of Datasets

Category Name Description Dimension Data Size Query Size
Real-world SIFTImage1281M10K
OpenImagesStreamingImage5121M10K
SunImage51279K200
SIFT100MImage128100M10K
TreviImage4096100K200
MsongAudio420990K200
COCOMulti-Modal768100K500
GloveText1001.192M200
MSTuringText10030M10K
Synthetic Gaussiani.i.d valuesAdjustable500K1000
BlobGaussian Blobs768500K1000
WTEText768100K100
FreewayMLConstructed128100K1K

Summary of Algorithms

Category Algorithm Name Description
Tree-based SPTAG Space-partitioning tree structure for efficient data segmentation.
LSH-based LSH Data-independent hashing to reduce dimensionality and approximate nearest neighbors.
LSHAPG LSH-driven optimization using LSB-Tree to differentiate graph regions.
Clustering-based PQ Product quantization for efficient clustering into compact subspaces.
IVFPQ Inverted index with product quantization for hierarchical clustering.
OnlinePQ Incremental updates of centroids in product quantization for streaming data.
Puck Non-orthogonal inverted indexes with multiple quantization optimized for large-scale datasets.
SCANN Small-bit quantization to improve register utilization.
Graph-based NSW Navigable Small World graph for fast nearest neighbor search.
HNSW Hierarchical Navigable Small World for scalable search.
FreshDiskANN Streaming graph construction for large-scale proximity-based search with refined robust edge pruning.
MNRU Enhances HNSW with efficient updates to prevent unreachable points in dynamic environments.
Cufe Enhances FreshDiskANN with batched neighbor expansion.
Pyanns Enhances FreshDiskANN with fix-sized huge pages for optimized memory access.
IPDiskANN Enables efficient in-place deletions for FreshDiskANN, improving update performance without reconstructions.
GTI Hybrid tree-graph indexing for efficient, dynamic high-dimensional search, with optimized updates and construction.
ParlayHNSW Parallel, deterministic HNSW for improved scalability and performance.
ParlayVamana Parallel, deterministic FreshDiskANN implementation using Vamana for graph construction, with performance improvement.

Quick Start Guide


🚨🚨 Strong Recommendation: Use Docker! 🚨🚨

We strongly recommend using Docker to build and run this project.

There are many algorithm libraries with complex dependencies. Setting up the environment locally can be difficult and error-prone. Docker provides a consistent and reproducible environment, saving you time and avoiding compatibility issues.

Note: Building the Docker image may take 10–20 minutes depending on your network and hardware.


Build With Docker

To build the project using Docker, simply use the provided Dockerfile located in the root directory. This ensures a consistent and reproducible environment for all dependencies and build steps.

  1. To initialize and update all submodules in the project, you can run:
git submodule update --init --recursive
  1. You can build the Docker image with:
docker build -t <your-image-name> .
  1. Once the image is built, you can run a container from it using the following command.
docker run -it <your-image-name>
  1. After entering the container, navigate to the project directory:
cd /app/big-ann-benchmarks

Usage

All the following operations are performed in the root directory of big-ann-benchmarks.

2.1 Preparing dataset

Create a small, sample dataset. For example, to create a dataset with 10000 20-dimensional random floating point vectors, run:

python create_dataset.py --dataset random-xs

To see a complete list of datasets, run the following:

python create_dataset.py --help

2.2 Running Algorithms on the congestion Track

To evaluate an algorithm under the congestion track, use the following command:

python3 run.py \
  --neurips23track congestion \
  --algorithm "$ALGO" \
  --nodocker \
  --rebuild \
  --runbook_path "$PATH" \
  --dataset "$DS"
  • algorithm "$ALGO": Name of the algorithm to evaluate.
  • dataset "$DS": Name of the dataset to use.
  • runbook_path "$PATH": Path to the runbook file describing the test scenario.
  • rebuild: Rebuild the target before running.

2.3 Computing Ground Truth for Runbooks

To compute ground truth for an runbook:

  1. Clone and build the DiskANN repository
  2. Use the provided script to compute ground truth at various checkpoints:
python3 benchmark/congestion/compute_gt.py \
  --runbook "$PATH_TO_RUNBOOK" \
  --dataset "$DATASET_NAME" \
  --gt_cmdline_tool ~/DiskANN/build/apps/utils/compute_groundtruth

2.4 Exporting Results

  1. To make the results available for post-processing, change permissions of the results folder
sudo chmod 777 -R results/
  1. The following command will summarize all results files into a single csv file
python data_export.py --out "$OUT" --track congestion

The --out path "$OUT" should be adjusted according to the testing scenario. Common values include:

  • gen
  • batch
  • event
  • conceptDrift
  • randomContamination
  • randomDrop
  • wordContamination
  • bulkDeletion
  • batchDeletion
  • multiModal
  • ……

Additional Information

Click to Expand

Table of Contents

dd068b958060f24e4d0c2cdf899f229efccc0b2b


2. Algorithm and Datasets

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7