This repository contains the source code for our paper: Empowering Graph-based Approximate Nearest Neighbor Search with Adaptive Awareness Capabilities.
This paper introduces GATE, a lightweight and adaptive module that enhances graph-based Approximate Nearest Neighbor Search by identifying optimal entry points in proximity graphs using a contrastive learning-based model.
-
HVS (paper): employs a hierarchical structure of multiple PQ layers to reduce computational costs.
-
HNSW (paper): a state-of-the-art navigable small world graph.
-
LSH-APG (paper): facilitates fast ANNS using a lightweight locality-sensitive hashing framework.
-
NSG (paper): an approximation of the monotonic relative neighborhood graph.
-
NSSG (paper): an extension of NSG optimized for out-of-index queries.
The data format is: Number of vector (n) * Dimension (d).
Dataset | Base Size | Dim | Query Size | Modality |
---|---|---|---|---|
Gist1M | 1,000,000 | 960 | 1,000 | Image |
Laion3M | 3,000,000 | 512 | 1,000 | Multi |
Tiny5M | 5,000,000 | 384 | 1,000 | Image |
Sift10M | 10,000,000 | 128 | 1,000 | Image |
Text2Image10M | 10,000,000 | 200 | 100,000 | Multi |
- GCC 4.9+ with OpenMP
- Python 3.9+
- CMake 2.8+
- Boost 1.55+
$ mkdir build/ && cd build/
$ cmake ..
$ make -j
- cluster: GATE python package, including subgraph sampling, query-aware sample generation, model training and inference.
- include: C++ class interface
- src: main function implementation
- test: test codes
Firstly, we need to prepare a kNN graph. You can use Faiss and other libs.
Then run the script for indexing NSG:
./tests/test_nsg_index DATA_PATH KNNG_PATH L R C NSG_PATH
DATA_PATH
is the path of the base data infvecs
format.KNNG_PATH
is the path of the pre-built kNN graph.L
controls the quality of the NSG.R
controls the index size of the graph.C
controls the maximum candidate pool size during NSG contruction.NSG_PATH
is the path of the generated NSG index.
python -m cluster.kmeans DATA_PATH HUB_PATH
python -m cluster.subgraph2json NSG_PATH DATA_PATH HUB_PATH SUBGRAPH_DIR
DATA_PATH
is the path of the base data infvecs
format.NSG_PATH
is the path of the generated NSG index.HUB_PATH
is the path of the hub nodes extraction.SUBGRAPH_DIR
is the subgraph dir
Then, we run the graph embedding model, convert subgraph to embeddings. You can use Graph2Vec and other model.
./test_nsg_search DATA_PATH QUERY_PATH NSG_PATH SEARCH_L SEARCH_K RESULT_PATH HUB_PATH HOP_PATH
python -m cluster.experiment.combine_ep_gt HOP_PATH SAMPLE_PATH
DATA_PATH
is the path of the base data infvecs
format.QUERY_PATH
is the path of the query data infvecs
format.NSG_PATH
is the path of the pre-built NSG index in previous section.SEARCH_L
controls the quality of the search results.SEARCH_K
controls the number of result neighbors we want to query.RESULT_PATH
is the query results inivecs
format.HUB_PATH
is the path of the hub nodes extraction.HOP_PATH
is the path of the routing hop.SAMPLE_PATH
is the path of the traning sample data.
Then configure the trainning parameters, run cluster/model/__main__.py
to train the model and run cluster/model/infer.py
to get hub nodes' embeddings when model trained.
{
"input_dim": 128,
"hidden_dim": 2024,
"output_dim": 512,
"dropout": 0.05,
"batch_size": 16384,
"lr": 1e-5,
"temperature": 0.07,
"epochs": 200
}
./tests/test_gate_search DATA_PATH QUERY_PATH NSG_PATH SEARCH_L SEARCH_K EMB_PATH RESULT_PATH
DATA_PATH
is the path of the base data infvecs
format.QUERY_PATH
is the path of the query data infvecs
format.NSG_PATH
is the path of the pre-built NSG index in previous section.SEARCH_L
controls the quality of the search results.SEARCH_K
controls the number of result neighbors we want to query.HUB_PATH
is the path of the hub nodes extraction.EMB_PATH
is the path of the hub nodes' embeddings.RESULT_PATH
is the query results inivecs
format.
- QPS (for graph-based method)