DiskANN-Py

DiskANN-Py is a simplified Python implementation of DiskANN, designed to handle large-scale Approximate Nearest Neighbor (ANN) search efficiently using graph-based algorithms and SSD storage.

Project Structure

diskann_py/
│
├── main.py                 # Entry point to run the entire DiskANN system
├── graph_construction/
│   ├── __init__.py         # Makes this directory a Python package
│   ├── graph.py            # Core graph data structure and helper functions
│   ├── greedy_search.py    # Implementation of the GreedySearch algorithm
│   ├── robust_prune.py     # Implementation of the RobustPrune algorithm
│   ├── vamana.py           # Implementation of the Vamana graph construction algorithm
│
├── disk_index/
│   ├── __init__.py         # Makes this directory a Python package
│   ├── diskann_index.py    # DiskANN index construction (partitioning, merging)
│   ├── beam_search.py      # BeamSearch implementation for querying SSD-based index
│   ├── pq_compression.py   # Product Quantization (PQ) compression and storage utilities
│
├── utils/
│   ├── __init__.py         # Makes this directory a Python package
│   ├── metrics.py          # Distance metrics (e.g., Euclidean distance)
│   ├── dataset.py          # Dataset loading and preprocessing utilities
│   ├── clustering.py       # K-means clustering for overlapping partitions
│   ├── caching.py          # Caching frequently visited nodes
│
├── tests/
│   ├── test_graph.py       # Unit tests for graph construction algorithms
│   ├── test_disk_index.py  # Unit tests for DiskANN index construction
│   ├── test_search.py      # Unit tests for BeamSearch and query performance
│
├── README.md               # Project overview and instructions
└── requirements.txt        # Required Python libraries

Installation

Python Version: Ensure you have Python 3.7 or newer installed.
Install Dependencies: Run the following command in the root of the project to install all required libraries:
```
pip install -r requirements.txt
```

Download and Prepare Dataset

To process datasets (e.g., siftsmall, sift1m) for use with DiskANN-Py:

Download Dataset

Linux/MacOS

mkdir -p ./data
cd ./data
wget ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz
tar -zxvf siftsmall.tar.gz

Windows

mkdir ./data
cd ./data
Invoke-WebRequest -Uri "ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz" -OutFile "siftsmall.tar.gz"
tar -zxvf siftsmall.tar.gz

Process Dataset

Use the dataset.py utility to preprocess the dataset:

python utils/dataset.py --main_dir ./data --dataset_name siftsmall

This will create a processed/ subdirectory inside the dataset folder:

./data/siftsmall/processed/
    base.npy
    query.npy
    learn.npy
    groundtruth.npy

Usage

Running the System

The entry point for running the full DiskANN pipeline is main.py. Customize it based on your dataset and configuration.

Example:
```
python main.py --dataset ./data/siftsmall/processed/ --index_file ./index/siftsmall_index
```
Graph Construction

Customize algorithms like Vamana and RobustPrune in the graph_construction/ package to build the graph index.
Search on Disk

Use the disk_index/ package to build and query the DiskANN index with algorithms like BeamSearch and PQ Compression.
Utilities

The utils/ package includes helper tools for dataset preprocessing, metrics computation, clustering, and caching.

Testing

To run unit tests for various components of the project:

pytest tests/

This will execute tests for:

Graph construction algorithms
Disk-based index construction
Querying and search algorithms

Summary of Commands

Install Requirements

pip install -r requirements.txt

Download and Extract Dataset

Linux/MacOS:

mkdir -p ./data
cd ./data
wget ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz
tar -zxvf siftsmall.tar.gz

Windows:

mkdir ./data
cd ./data
Invoke-WebRequest -Uri "ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz" -OutFile "siftsmall.tar.gz"
tar -zxvf siftsmall.tar.gz

Process Dataset

python utils/dataset.py --main_dir ./data --dataset_name siftsmall

Key Features

Graph Construction:
- Implements state-of-the-art algorithms like Vamana and RobustPrune for ANN graph construction.
Disk-Based Indexing:
- Efficiently builds and queries ANN indices stored on SSDs.
- Includes BeamSearch and Product Quantization (PQ) for memory-efficient queries.
Utilities:
- Dataset preprocessing, clustering, caching, and distance metrics.

Here’s a "How to Run the Tests" section for your GitHub README.md. It explains how to set up the environment and run the tests for your diskann_py project.

How to Run the Tests

To ensure that the components of the project (e.g., graph construction, search algorithms, and disk-based indexing) work correctly, unit tests are provided in the tests/ directory. Follow these steps to run the tests:

1. Install Dependencies

Before running the tests, make sure you have all the required Python libraries installed. Use the requirements.txt file to install them:

pip install -r requirements.txt

2. Run All Tests

You can run all the tests in the tests/ directory using pytest:

pytest tests/

This will execute all unit tests in the project and display the results.

3. Run Specific Test Files

If you want to run tests for a specific module, you can specify the test file. For example:

To test graph construction algorithms (e.g., Vamana):
```
pytest tests/test_graph.py
```
To test search algorithms (e.g., BeamSearch):
```
pytest tests/test_search.py
```
To test DiskANN index construction:
```
pytest tests/test_disk_index.py
```

4. View Detailed Test Output

To see detailed output for each test (e.g., print statements or assertions), use the -v flag:

pytest -v tests/

5. Debugging with a Single Test Function

If you need to run a specific test function for debugging, use the -k flag with the test's name. For example:

pytest -k "test_vamana_graph_construction" -v

Testing Framework

The tests are built using pytest, a simple and powerful testing framework for Python. If you don’t have it installed, you can install it with:

pip install pytest

Let me know if you need additional sections or further adjustments!

Contributing

Contributions are welcome! If you'd like to add features or fix issues, please fork the repository, make changes, and submit a pull request.

License

This project is licensed under the Apache License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiskANN-Py

Project Structure

Installation

Download and Prepare Dataset

Linux/MacOS

Windows

Usage

Testing

Summary of Commands

Install Requirements

Download and Extract Dataset

Linux/MacOS:

Windows:

Process Dataset

Key Features

How to Run the Tests

1. Install Dependencies

2. Run All Tests

3. Run Specific Test Files

4. View Detailed Test Output

5. Debugging with a Single Test Function

Testing Framework

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
disk_index		disk_index
graph_construction		graph_construction
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

License

ah89/DiskANN-py

Folders and files

Latest commit

History

Repository files navigation

DiskANN-Py

Project Structure

Installation

Download and Prepare Dataset

Linux/MacOS

Windows

Usage

Testing

Summary of Commands

Install Requirements

Download and Extract Dataset

Linux/MacOS:

Windows:

Process Dataset

Key Features

How to Run the Tests

1. Install Dependencies

2. Run All Tests

3. Run Specific Test Files

4. View Detailed Test Output

5. Debugging with a Single Test Function

Testing Framework

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages