From 35c698e10862d6543554493d675019b73ef086d2 Mon Sep 17 00:00:00 2001
From: "Corey J. Nolet" <cjnolet@gmail.com>
Date: Tue, 5 May 2026 12:39:25 -0400
Subject: [PATCH 1/2] Initial commit converting rst to md

---
 ci/release/update-version.sh                  |   2 +-
 .../all_cuda-129_arch-aarch64.yaml            |   3 +-
 .../all_cuda-129_arch-x86_64.yaml             |   3 +-
 .../all_cuda-131_arch-aarch64.yaml            |   3 +-
 .../all_cuda-131_arch-x86_64.yaml             |   3 +-
 dependencies.yaml                             |   3 +-
 docs/source/advanced_topics.md                |  22 +
 docs/source/advanced_topics.rst               |  22 -
 docs/source/api_basics.md                     |  81 ++
 docs/source/api_basics.rst                    |  90 --
 docs/source/api_docs.md                       |  13 +
 docs/source/api_docs.rst                      |  13 -
 docs/source/api_interoperability.md           | 102 ++
 docs/source/api_interoperability.rst          | 106 --
 docs/source/build.md                          | 261 +++++
 docs/source/build.rst                         | 285 ------
 docs/source/c_api.md                          |  14 +
 docs/source/c_api.rst                         |  14 -
 docs/source/c_api/cluster.md                  |   9 +
 docs/source/c_api/cluster.rst                 |  12 -
 docs/source/c_api/cluster_kmeans_c.md         |  22 +
 docs/source/c_api/cluster_kmeans_c.rst        |  27 -
 docs/source/c_api/core_c_api.md               |  28 +
 docs/source/c_api/core_c_api.rst              |  32 -
 docs/source/c_api/distance.md                 |  20 +
 docs/source/c_api/distance.rst                |  26 -
 docs/source/c_api/neighbors.md                |  16 +
 docs/source/c_api/neighbors.rst               |  19 -
 .../source/c_api/neighbors_all_neighbors_c.md |  22 +
 .../c_api/neighbors_all_neighbors_c.rst       |  26 -
 docs/source/c_api/neighbors_bruteforce_c.md   |  38 +
 docs/source/c_api/neighbors_bruteforce_c.rst  |  42 -
 docs/source/c_api/neighbors_cagra_c.md        |  63 ++
 docs/source/c_api/neighbors_cagra_c.rst       |  67 --
 docs/source/c_api/neighbors_hnsw_c.md         |  61 ++
 docs/source/c_api/neighbors_hnsw_c.rst        |  65 --
 docs/source/c_api/neighbors_ivf_flat_c.md     |  54 ++
 docs/source/c_api/neighbors_ivf_flat_c.rst    |  58 --
 docs/source/c_api/neighbors_ivf_pq_c.md       |  54 ++
 docs/source/c_api/neighbors_ivf_pq_c.rst      |  58 --
 docs/source/c_api/neighbors_mg.md             | 250 +++++
 docs/source/c_api/neighbors_mg.rst            | 257 -----
 docs/source/c_api/neighbors_vamana_c.md       |  39 +
 docs/source/c_api/neighbors_vamana_c.rst      |  43 -
 docs/source/c_api/preprocessing.md            |  34 +
 docs/source/c_api/preprocessing.rst           |  38 -
 ...st => choosing_and_configuring_indexes.md} |  95 +-
 ...aring_indexes.rst => comparing_indexes.md} |  39 +-
 docs/source/conf.py                           |   9 +-
 docs/source/cpp_api.md                        |  15 +
 docs/source/cpp_api.rst                       |  15 -
 docs/source/cpp_api/cluster.md                |  11 +
 docs/source/cpp_api/cluster.rst               |  14 -
 docs/source/cpp_api/cluster_agglomerative.md  |  26 +
 docs/source/cpp_api/cluster_agglomerative.rst |  31 -
 docs/source/cpp_api/cluster_kmeans.md         |  38 +
 docs/source/cpp_api/cluster_kmeans.rst        |  44 -
 docs/source/cpp_api/cluster_spectral.md       |  24 +
 docs/source/cpp_api/cluster_spectral.rst      |  28 -
 docs/source/cpp_api/distance.md               |  27 +
 docs/source/cpp_api/distance.rst              |  32 -
 docs/source/cpp_api/neighbors.md              |  21 +
 docs/source/cpp_api/neighbors.rst             |  24 -
 .../source/cpp_api/neighbors_all_neighbors.md |  24 +
 .../cpp_api/neighbors_all_neighbors.rst       |  29 -
 docs/source/cpp_api/neighbors_bruteforce.md   |  40 +
 docs/source/cpp_api/neighbors_bruteforce.rst  |  44 -
 docs/source/cpp_api/neighbors_cagra.md        |  80 ++
 docs/source/cpp_api/neighbors_cagra.rst       |  84 --
 .../cpp_api/neighbors_dynamic_batching.md     |  40 +
 .../cpp_api/neighbors_dynamic_batching.rst    |  45 -
 ....rst => neighbors_epsilon_neighborhood.md} |  22 +-
 docs/source/cpp_api/neighbors_filter.md       |  15 +
 docs/source/cpp_api/neighbors_filter.rst      |  18 -
 docs/source/cpp_api/neighbors_hnsw.md         |  63 ++
 docs/source/cpp_api/neighbors_hnsw.rst        |  67 --
 docs/source/cpp_api/neighbors_ivf_flat.md     |  64 ++
 docs/source/cpp_api/neighbors_ivf_flat.rst    |  68 --
 docs/source/cpp_api/neighbors_ivf_pq.md       |  76 ++
 docs/source/cpp_api/neighbors_ivf_pq.rst      |  80 --
 docs/source/cpp_api/neighbors_mg.md           |  72 ++
 docs/source/cpp_api/neighbors_mg.rst          |  76 --
 docs/source/cpp_api/neighbors_nn_descent.md   |  32 +
 docs/source/cpp_api/neighbors_nn_descent.rst  |  37 -
 docs/source/cpp_api/neighbors_refine.md       |  16 +
 docs/source/cpp_api/neighbors_refine.rst      |  20 -
 docs/source/cpp_api/neighbors_vamana.md       |  40 +
 docs/source/cpp_api/neighbors_vamana.rst      |  44 -
 docs/source/cpp_api/preprocessing.md          |  11 +
 docs/source/cpp_api/preprocessing.rst         |  14 -
 docs/source/cpp_api/preprocessing_pca.md      |  23 +
 docs/source/cpp_api/preprocessing_pca.rst     |  27 -
 docs/source/cpp_api/preprocessing_quantize.md |  41 +
 .../source/cpp_api/preprocessing_quantize.rst |  45 -
 .../preprocessing_spectral_embedding.md       | 100 ++
 .../preprocessing_spectral_embedding.rst      | 108 ---
 docs/source/cpp_api/selection.md              |  15 +
 docs/source/cpp_api/selection.rst             |  19 -
 docs/source/cpp_api/stats.md                  |  30 +
 docs/source/cpp_api/stats.rst                 |  34 -
 .../source/cuvs_bench/{build.rst => build.md} |  34 +-
 .../cuvs_bench/{datasets.rst => datasets.md}  |  64 +-
 docs/source/cuvs_bench/index.md               | 639 ++++++++++++
 docs/source/cuvs_bench/index.rst              | 661 -------------
 docs/source/cuvs_bench/param_tuning.md        | 894 +++++++++++++++++
 docs/source/cuvs_bench/param_tuning.rst       | 918 ------------------
 docs/source/cuvs_bench/pluggable_backend.md   | 236 +++++
 docs/source/cuvs_bench/pluggable_backend.rst  | 241 -----
 ...ki_all_dataset.rst => wiki_all_dataset.md} |  53 +-
 docs/source/filtering.md                      | 109 +++
 docs/source/filtering.rst                     | 116 ---
 docs/source/getting_started.md                | 115 +++
 docs/source/getting_started.rst               | 124 ---
 docs/source/{index.rst => index.md}           |  72 +-
 docs/source/integrations.md                   |  13 +
 docs/source/integrations.rst                  |  13 -
 .../integrations/{faiss.rst => faiss.md}      |   5 +-
 docs/source/integrations/kinetica.md          |   5 +
 docs/source/integrations/kinetica.rst         |   6 -
 .../integrations/{lucene.rst => lucene.md}    |   5 +-
 .../integrations/{milvus.rst => milvus.md}    |   7 +-
 .../{all_neighbors.rst => all_neighbors.md}   |  17 +-
 .../{bruteforce.rst => bruteforce.md}         |  34 +-
 docs/source/neighbors/cagra.md                | 263 +++++
 docs/source/neighbors/cagra.rst               | 276 ------
 docs/source/neighbors/ivfflat.md              | 106 ++
 docs/source/neighbors/ivfflat.rst             | 115 ---
 docs/source/neighbors/ivfpq.md                | 126 +++
 docs/source/neighbors/ivfpq.rst               | 135 ---
 docs/source/neighbors/neighbors.md            |  19 +
 docs/source/neighbors/neighbors.rst           |  21 -
 .../neighbors/{vamana.rst => vamana.md}       |  94 +-
 docs/source/python_api.md                     |  13 +
 docs/source/python_api.rst                    |  13 -
 docs/source/python_api/cluster.md             |   9 +
 docs/source/python_api/cluster.rst            |  12 -
 docs/source/python_api/cluster_kmeans.md      |  23 +
 docs/source/python_api/cluster_kmeans.rst     |  27 -
 docs/source/python_api/distance.md            |   7 +
 docs/source/python_api/distance.rst           |  12 -
 docs/source/python_api/neighbors.md           |  16 +
 docs/source/python_api/neighbors.rst          |  19 -
 .../python_api/neighbors_all_neighbors.md     |  15 +
 .../python_api/neighbors_all_neighbors.rst    |  19 -
 .../python_api/neighbors_brute_force.md       |  28 +
 .../python_api/neighbors_brute_force.rst      |  32 -
 docs/source/python_api/neighbors_cagra.md     |  47 +
 docs/source/python_api/neighbors_cagra.rst    |  51 -
 docs/source/python_api/neighbors_hnsw.md      |  41 +
 docs/source/python_api/neighbors_hnsw.rst     |  45 -
 docs/source/python_api/neighbors_ivf_flat.md  |  45 +
 docs/source/python_api/neighbors_ivf_flat.rst |  49 -
 docs/source/python_api/neighbors_ivf_pq.md    |  45 +
 docs/source/python_api/neighbors_ivf_pq.rst   |  49 -
 docs/source/python_api/neighbors_mg_cagra.md  |  52 +
 docs/source/python_api/neighbors_mg_cagra.rst |  55 --
 .../python_api/neighbors_mg_ivf_flat.md       |  57 ++
 .../python_api/neighbors_mg_ivf_flat.rst      |  60 --
 docs/source/python_api/neighbors_mg_ivf_pq.md |  57 ++
 .../source/python_api/neighbors_mg_ivf_pq.rst |  60 --
 ...s_multi_gpu.rst => neighbors_multi_gpu.md} | 108 +--
 docs/source/python_api/neighbors_nn_decent.md |  19 +
 .../source/python_api/neighbors_nn_decent.rst |  24 -
 docs/source/python_api/preprocessing.md       |  63 ++
 docs/source/python_api/preprocessing.rst      |  55 --
 docs/source/rust_api/index.md                 |  14 +
 docs/source/rust_api/index.rst                |  15 -
 .../{tuning_guide.rst => tuning_guide.md}     |  45 +-
 ...t => vector_databases_vs_vector_search.md} |  23 +-
 docs/source/working_with_ann_indexes.md       |  12 +
 docs/source/working_with_ann_indexes.rst      |  11 -
 docs/source/working_with_ann_indexes_c.md     |  59 ++
 docs/source/working_with_ann_indexes_c.rst    |  62 --
 docs/source/working_with_ann_indexes_cpp.md   |  40 +
 docs/source/working_with_ann_indexes_cpp.rst  |  43 -
 .../source/working_with_ann_indexes_python.md |  30 +
 .../working_with_ann_indexes_python.rst       |  33 -
 docs/source/working_with_ann_indexes_rust.md  |  61 ++
 docs/source/working_with_ann_indexes_rust.rst |  62 --
 179 files changed, 5752 insertions(+), 6197 deletions(-)
 create mode 100644 docs/source/advanced_topics.md
 delete mode 100644 docs/source/advanced_topics.rst
 create mode 100644 docs/source/api_basics.md
 delete mode 100644 docs/source/api_basics.rst
 create mode 100644 docs/source/api_docs.md
 delete mode 100644 docs/source/api_docs.rst
 create mode 100644 docs/source/api_interoperability.md
 delete mode 100644 docs/source/api_interoperability.rst
 create mode 100644 docs/source/build.md
 delete mode 100644 docs/source/build.rst
 create mode 100644 docs/source/c_api.md
 delete mode 100644 docs/source/c_api.rst
 create mode 100644 docs/source/c_api/cluster.md
 delete mode 100644 docs/source/c_api/cluster.rst
 create mode 100644 docs/source/c_api/cluster_kmeans_c.md
 delete mode 100644 docs/source/c_api/cluster_kmeans_c.rst
 create mode 100644 docs/source/c_api/core_c_api.md
 delete mode 100644 docs/source/c_api/core_c_api.rst
 create mode 100644 docs/source/c_api/distance.md
 delete mode 100644 docs/source/c_api/distance.rst
 create mode 100644 docs/source/c_api/neighbors.md
 delete mode 100644 docs/source/c_api/neighbors.rst
 create mode 100644 docs/source/c_api/neighbors_all_neighbors_c.md
 delete mode 100644 docs/source/c_api/neighbors_all_neighbors_c.rst
 create mode 100644 docs/source/c_api/neighbors_bruteforce_c.md
 delete mode 100644 docs/source/c_api/neighbors_bruteforce_c.rst
 create mode 100644 docs/source/c_api/neighbors_cagra_c.md
 delete mode 100644 docs/source/c_api/neighbors_cagra_c.rst
 create mode 100644 docs/source/c_api/neighbors_hnsw_c.md
 delete mode 100644 docs/source/c_api/neighbors_hnsw_c.rst
 create mode 100644 docs/source/c_api/neighbors_ivf_flat_c.md
 delete mode 100644 docs/source/c_api/neighbors_ivf_flat_c.rst
 create mode 100644 docs/source/c_api/neighbors_ivf_pq_c.md
 delete mode 100644 docs/source/c_api/neighbors_ivf_pq_c.rst
 create mode 100644 docs/source/c_api/neighbors_mg.md
 delete mode 100644 docs/source/c_api/neighbors_mg.rst
 create mode 100644 docs/source/c_api/neighbors_vamana_c.md
 delete mode 100644 docs/source/c_api/neighbors_vamana_c.rst
 create mode 100644 docs/source/c_api/preprocessing.md
 delete mode 100644 docs/source/c_api/preprocessing.rst
 rename docs/source/{choosing_and_configuring_indexes.rst => choosing_and_configuring_indexes.md} (73%)
 rename docs/source/{comparing_indexes.rst => comparing_indexes.md} (86%)
 create mode 100644 docs/source/cpp_api.md
 delete mode 100644 docs/source/cpp_api.rst
 create mode 100644 docs/source/cpp_api/cluster.md
 delete mode 100644 docs/source/cpp_api/cluster.rst
 create mode 100644 docs/source/cpp_api/cluster_agglomerative.md
 delete mode 100644 docs/source/cpp_api/cluster_agglomerative.rst
 create mode 100644 docs/source/cpp_api/cluster_kmeans.md
 delete mode 100644 docs/source/cpp_api/cluster_kmeans.rst
 create mode 100644 docs/source/cpp_api/cluster_spectral.md
 delete mode 100644 docs/source/cpp_api/cluster_spectral.rst
 create mode 100644 docs/source/cpp_api/distance.md
 delete mode 100644 docs/source/cpp_api/distance.rst
 create mode 100644 docs/source/cpp_api/neighbors.md
 delete mode 100644 docs/source/cpp_api/neighbors.rst
 create mode 100644 docs/source/cpp_api/neighbors_all_neighbors.md
 delete mode 100644 docs/source/cpp_api/neighbors_all_neighbors.rst
 create mode 100644 docs/source/cpp_api/neighbors_bruteforce.md
 delete mode 100644 docs/source/cpp_api/neighbors_bruteforce.rst
 create mode 100644 docs/source/cpp_api/neighbors_cagra.md
 delete mode 100644 docs/source/cpp_api/neighbors_cagra.rst
 create mode 100644 docs/source/cpp_api/neighbors_dynamic_batching.md
 delete mode 100644 docs/source/cpp_api/neighbors_dynamic_batching.rst
 rename docs/source/cpp_api/{neighbors_epsilon_neighborhood.rst => neighbors_epsilon_neighborhood.md} (55%)
 create mode 100644 docs/source/cpp_api/neighbors_filter.md
 delete mode 100644 docs/source/cpp_api/neighbors_filter.rst
 create mode 100644 docs/source/cpp_api/neighbors_hnsw.md
 delete mode 100644 docs/source/cpp_api/neighbors_hnsw.rst
 create mode 100644 docs/source/cpp_api/neighbors_ivf_flat.md
 delete mode 100644 docs/source/cpp_api/neighbors_ivf_flat.rst
 create mode 100644 docs/source/cpp_api/neighbors_ivf_pq.md
 delete mode 100644 docs/source/cpp_api/neighbors_ivf_pq.rst
 create mode 100644 docs/source/cpp_api/neighbors_mg.md
 delete mode 100644 docs/source/cpp_api/neighbors_mg.rst
 create mode 100644 docs/source/cpp_api/neighbors_nn_descent.md
 delete mode 100644 docs/source/cpp_api/neighbors_nn_descent.rst
 create mode 100644 docs/source/cpp_api/neighbors_refine.md
 delete mode 100644 docs/source/cpp_api/neighbors_refine.rst
 create mode 100644 docs/source/cpp_api/neighbors_vamana.md
 delete mode 100644 docs/source/cpp_api/neighbors_vamana.rst
 create mode 100644 docs/source/cpp_api/preprocessing.md
 delete mode 100644 docs/source/cpp_api/preprocessing.rst
 create mode 100644 docs/source/cpp_api/preprocessing_pca.md
 delete mode 100644 docs/source/cpp_api/preprocessing_pca.rst
 create mode 100644 docs/source/cpp_api/preprocessing_quantize.md
 delete mode 100644 docs/source/cpp_api/preprocessing_quantize.rst
 create mode 100644 docs/source/cpp_api/preprocessing_spectral_embedding.md
 delete mode 100644 docs/source/cpp_api/preprocessing_spectral_embedding.rst
 create mode 100644 docs/source/cpp_api/selection.md
 delete mode 100644 docs/source/cpp_api/selection.rst
 create mode 100644 docs/source/cpp_api/stats.md
 delete mode 100644 docs/source/cpp_api/stats.rst
 rename docs/source/cuvs_bench/{build.rst => build.md} (72%)
 rename docs/source/cuvs_bench/{datasets.rst => datasets.md} (57%)
 create mode 100644 docs/source/cuvs_bench/index.md
 delete mode 100644 docs/source/cuvs_bench/index.rst
 create mode 100644 docs/source/cuvs_bench/param_tuning.md
 delete mode 100644 docs/source/cuvs_bench/param_tuning.rst
 create mode 100644 docs/source/cuvs_bench/pluggable_backend.md
 delete mode 100644 docs/source/cuvs_bench/pluggable_backend.rst
 rename docs/source/cuvs_bench/{wiki_all_dataset.rst => wiki_all_dataset.md} (57%)
 create mode 100644 docs/source/filtering.md
 delete mode 100644 docs/source/filtering.rst
 create mode 100644 docs/source/getting_started.md
 delete mode 100644 docs/source/getting_started.rst
 rename docs/source/{index.rst => index.md} (60%)
 create mode 100644 docs/source/integrations.md
 delete mode 100644 docs/source/integrations.rst
 rename docs/source/integrations/{faiss.rst => faiss.md} (73%)
 create mode 100644 docs/source/integrations/kinetica.md
 delete mode 100644 docs/source/integrations/kinetica.rst
 rename docs/source/integrations/{lucene.rst => lucene.md} (74%)
 rename docs/source/integrations/{milvus.rst => milvus.md} (59%)
 rename docs/source/neighbors/{all_neighbors.rst => all_neighbors.md} (91%)
 rename docs/source/neighbors/{bruteforce.rst => bruteforce.md} (69%)
 create mode 100644 docs/source/neighbors/cagra.md
 delete mode 100644 docs/source/neighbors/cagra.rst
 create mode 100644 docs/source/neighbors/ivfflat.md
 delete mode 100644 docs/source/neighbors/ivfflat.rst
 create mode 100644 docs/source/neighbors/ivfpq.md
 delete mode 100644 docs/source/neighbors/ivfpq.rst
 create mode 100644 docs/source/neighbors/neighbors.md
 delete mode 100644 docs/source/neighbors/neighbors.rst
 rename docs/source/neighbors/{vamana.rst => vamana.md} (55%)
 create mode 100644 docs/source/python_api.md
 delete mode 100644 docs/source/python_api.rst
 create mode 100644 docs/source/python_api/cluster.md
 delete mode 100644 docs/source/python_api/cluster.rst
 create mode 100644 docs/source/python_api/cluster_kmeans.md
 delete mode 100644 docs/source/python_api/cluster_kmeans.rst
 create mode 100644 docs/source/python_api/distance.md
 delete mode 100644 docs/source/python_api/distance.rst
 create mode 100644 docs/source/python_api/neighbors.md
 delete mode 100644 docs/source/python_api/neighbors.rst
 create mode 100644 docs/source/python_api/neighbors_all_neighbors.md
 delete mode 100644 docs/source/python_api/neighbors_all_neighbors.rst
 create mode 100644 docs/source/python_api/neighbors_brute_force.md
 delete mode 100644 docs/source/python_api/neighbors_brute_force.rst
 create mode 100644 docs/source/python_api/neighbors_cagra.md
 delete mode 100644 docs/source/python_api/neighbors_cagra.rst
 create mode 100644 docs/source/python_api/neighbors_hnsw.md
 delete mode 100644 docs/source/python_api/neighbors_hnsw.rst
 create mode 100644 docs/source/python_api/neighbors_ivf_flat.md
 delete mode 100644 docs/source/python_api/neighbors_ivf_flat.rst
 create mode 100644 docs/source/python_api/neighbors_ivf_pq.md
 delete mode 100644 docs/source/python_api/neighbors_ivf_pq.rst
 create mode 100644 docs/source/python_api/neighbors_mg_cagra.md
 delete mode 100644 docs/source/python_api/neighbors_mg_cagra.rst
 create mode 100644 docs/source/python_api/neighbors_mg_ivf_flat.md
 delete mode 100644 docs/source/python_api/neighbors_mg_ivf_flat.rst
 create mode 100644 docs/source/python_api/neighbors_mg_ivf_pq.md
 delete mode 100644 docs/source/python_api/neighbors_mg_ivf_pq.rst
 rename docs/source/python_api/{neighbors_multi_gpu.rst => neighbors_multi_gpu.md} (51%)
 create mode 100644 docs/source/python_api/neighbors_nn_decent.md
 delete mode 100644 docs/source/python_api/neighbors_nn_decent.rst
 create mode 100644 docs/source/python_api/preprocessing.md
 delete mode 100644 docs/source/python_api/preprocessing.rst
 create mode 100644 docs/source/rust_api/index.md
 delete mode 100644 docs/source/rust_api/index.rst
 rename docs/source/{tuning_guide.rst => tuning_guide.md} (78%)
 rename docs/source/{vector_databases_vs_vector_search.rst => vector_databases_vs_vector_search.md} (91%)
 create mode 100644 docs/source/working_with_ann_indexes.md
 delete mode 100644 docs/source/working_with_ann_indexes.rst
 create mode 100644 docs/source/working_with_ann_indexes_c.md
 delete mode 100644 docs/source/working_with_ann_indexes_c.rst
 create mode 100644 docs/source/working_with_ann_indexes_cpp.md
 delete mode 100644 docs/source/working_with_ann_indexes_cpp.rst
 create mode 100644 docs/source/working_with_ann_indexes_python.md
 delete mode 100644 docs/source/working_with_ann_indexes_python.rst
 create mode 100644 docs/source/working_with_ann_indexes_rust.md
 delete mode 100644 docs/source/working_with_ann_indexes_rust.rst

diff --git a/ci/release/update-version.sh b/ci/release/update-version.sh
index 49da9abe83..5ce8878395 100755
--- a/ci/release/update-version.sh
+++ b/ci/release/update-version.sh
@@ -146,7 +146,7 @@ elif [[ "${RUN_CONTEXT}" == "release" ]]; then
 fi
 
 # Update cuvs-bench Docker image references (version-only, not branch-related)
-sed_runner "s|rapidsai/cuvs-bench:[0-9][0-9].[0-9][0-9]|rapidsai/cuvs-bench:${NEXT_SHORT_TAG}|g" docs/source/cuvs_bench/index.rst
+sed_runner "s|rapidsai/cuvs-bench:[0-9][0-9].[0-9][0-9]|rapidsai/cuvs-bench:${NEXT_SHORT_TAG}|g" docs/source/cuvs_bench/index.md
 
 # Version references (not branch-related)
 sed_runner "s|=[0-9][0-9].[0-9][0-9]|=${NEXT_SHORT_TAG}|g" README.md
diff --git a/conda/environments/all_cuda-129_arch-aarch64.yaml b/conda/environments/all_cuda-129_arch-aarch64.yaml
index 0a473a210f..264ba73b8e 100644
--- a/conda/environments/all_cuda-129_arch-aarch64.yaml
+++ b/conda/environments/all_cuda-129_arch-aarch64.yaml
@@ -35,6 +35,7 @@ dependencies:
 - libopenblas<=0.3.30
 - librmm==26.6.*,>=0.0.0a0
 - make
+- myst-parser
 - nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0
@@ -45,12 +46,10 @@ dependencies:
 - pytest
 - pytest-cov
 - rapids-build-backend>=0.4.0,<0.5.0
-- recommonmark
 - rust
 - scikit-build-core>=0.11.0
 - scikit-learn>=1.5
 - sphinx-copybutton
-- sphinx-markdown-tables
 - sphinx>=8.0.0
 - sysroot_linux-aarch64==2.28
 - pip:
diff --git a/conda/environments/all_cuda-129_arch-x86_64.yaml b/conda/environments/all_cuda-129_arch-x86_64.yaml
index 08e3c3f4e5..695df7793b 100644
--- a/conda/environments/all_cuda-129_arch-x86_64.yaml
+++ b/conda/environments/all_cuda-129_arch-x86_64.yaml
@@ -34,6 +34,7 @@ dependencies:
 - libnvjitlink-dev
 - librmm==26.6.*,>=0.0.0a0
 - make
+- myst-parser
 - nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0
@@ -44,12 +45,10 @@ dependencies:
 - pytest
 - pytest-cov
 - rapids-build-backend>=0.4.0,<0.5.0
-- recommonmark
 - rust
 - scikit-build-core>=0.11.0
 - scikit-learn>=1.5
 - sphinx-copybutton
-- sphinx-markdown-tables
 - sphinx>=8.0.0
 - sysroot_linux-64==2.28
 - pip:
diff --git a/conda/environments/all_cuda-131_arch-aarch64.yaml b/conda/environments/all_cuda-131_arch-aarch64.yaml
index 9fb879b06f..315d142788 100644
--- a/conda/environments/all_cuda-131_arch-aarch64.yaml
+++ b/conda/environments/all_cuda-131_arch-aarch64.yaml
@@ -35,6 +35,7 @@ dependencies:
 - libopenblas<=0.3.30
 - librmm==26.6.*,>=0.0.0a0
 - make
+- myst-parser
 - nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0
@@ -45,12 +46,10 @@ dependencies:
 - pytest
 - pytest-cov
 - rapids-build-backend>=0.4.0,<0.5.0
-- recommonmark
 - rust
 - scikit-build-core>=0.11.0
 - scikit-learn>=1.5
 - sphinx-copybutton
-- sphinx-markdown-tables
 - sphinx>=8.0.0
 - sysroot_linux-aarch64==2.28
 - pip:
diff --git a/conda/environments/all_cuda-131_arch-x86_64.yaml b/conda/environments/all_cuda-131_arch-x86_64.yaml
index 105e7a8d9c..cfbf03a543 100644
--- a/conda/environments/all_cuda-131_arch-x86_64.yaml
+++ b/conda/environments/all_cuda-131_arch-x86_64.yaml
@@ -34,6 +34,7 @@ dependencies:
 - libnvjitlink-dev
 - librmm==26.6.*,>=0.0.0a0
 - make
+- myst-parser
 - nccl>=2.19
 - ninja
 - numpy>=1.23,<3.0
@@ -44,12 +45,10 @@ dependencies:
 - pytest
 - pytest-cov
 - rapids-build-backend>=0.4.0,<0.5.0
-- recommonmark
 - rust
 - scikit-build-core>=0.11.0
 - scikit-learn>=1.5
 - sphinx-copybutton
-- sphinx-markdown-tables
 - sphinx>=8.0.0
 - sysroot_linux-64==2.28
 - pip:
diff --git a/dependencies.yaml b/dependencies.yaml
index 2aae054862..e7cbd0d315 100644
--- a/dependencies.yaml
+++ b/dependencies.yaml
@@ -450,11 +450,10 @@ dependencies:
           - doxygen>=1.8.20
           - graphviz
           - ipython
+          - myst-parser
           - numpydoc
-          - recommonmark
           - sphinx>=8.0.0
           - sphinx-copybutton
-          - sphinx-markdown-tables
           - pip:
               - nvidia-sphinx-theme
   rust:
diff --git a/docs/source/advanced_topics.md b/docs/source/advanced_topics.md
new file mode 100644
index 0000000000..bd7ab0b709
--- /dev/null
+++ b/docs/source/advanced_topics.md
@@ -0,0 +1,22 @@
+# Advanced Topics
+
+- [Just-in-Time Compilation](#just-in-time-compilation)
+
+## Just-in-Time Compilation
+cuVS uses the Just-in-Time (JIT) [Link-Time Optimization (LTO)](https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/) compilation technology to compile certain kernels. When a JIT compilation is triggered, cuVS will compile the kernel for your architecture and automatically cache it in-memory and on-disk. The validity of the cache is as follows:
+
+1. In-memory cache is valid for the lifetime of the process.
+2. On-disk cache is valid until a CUDA driver upgrade is performed. The cache can be portably shared between machines in network or cloud storage and we strongly recommend that you store the cache in a persistent location. For more details on how to configure the on-disk cache, look at CUDA documentation on [JIT Compilation](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/environment-variables.html#jit-compilation). Specifically, the environment variables of interest are: `CUDA_CACHE_PATH` and `CUDA_CACHE_MAX_SIZE`.
+
+
+Thus, the JIT compilation is a one-time cost and you can expect no loss in real performance after the first compilation. We recommend that you run a "warmup" to trigger the JIT compilation before the actual usage.
+
+Currently, the following capabilities will trigger a JIT compilation:
+- IVF Flat search APIs: {doc}`cuvs::neighbors::ivf_flat::search() <cpp_api/neighbors_ivf_flat>`
+
+```{toctree}
+:maxdepth: 2
+
+jit_lto_guide
+```
+
diff --git a/docs/source/advanced_topics.rst b/docs/source/advanced_topics.rst
deleted file mode 100644
index 4171845af5..0000000000
--- a/docs/source/advanced_topics.rst
+++ /dev/null
@@ -1,22 +0,0 @@
-Advanced Topics
-===============
-
-- `Just-in-Time Compilation`_
-
-Just-in-Time Compilation
-------------------------
-cuVS uses the Just-in-Time (JIT) `Link-Time Optimization (LTO) <https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/>`_ compilation technology to compile certain kernels. When a JIT compilation is triggered, cuVS will compile the kernel for your architecture and automatically cache it in-memory and on-disk. The validity of the cache is as follows:
-
-1. In-memory cache is valid for the lifetime of the process.
-2. On-disk cache is valid until a CUDA driver upgrade is performed. The cache can be portably shared between machines in network or cloud storage and we strongly recommend that you store the cache in a persistent location. For more details on how to configure the on-disk cache, look at CUDA documentation on `JIT Compilation <https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/environment-variables.html#jit-compilation>`_. Specifically, the environment variables of interest are: `CUDA_CACHE_PATH` and `CUDA_CACHE_MAX_SIZE`.
-
-
-Thus, the JIT compilation is a one-time cost and you can expect no loss in real performance after the first compilation. We recommend that you run a "warmup" to trigger the JIT compilation before the actual usage.
-
-Currently, the following capabilities will trigger a JIT compilation:
-- IVF Flat search APIs: :doc:`cuvs::neighbors::ivf_flat::search() <cpp_api/neighbors_ivf_flat>`
-
-.. toctree::
-   :maxdepth: 2
-
-   jit_lto_guide
diff --git a/docs/source/api_basics.md b/docs/source/api_basics.md
new file mode 100644
index 0000000000..7612837003
--- /dev/null
+++ b/docs/source/api_basics.md
@@ -0,0 +1,81 @@
+# cuVS API Basics
+
+- [Memory management](#memory-management)
+- [Resource management](#resource-management)
+
+## Memory management
+
+Centralized memory management allows flexible configuration of allocation strategies, such as sharing the same CUDA memory pool across library boundaries. cuVS uses the [RMM](https://github.com/rapidsai/rmm) library, which eases the burden of configuring different allocation strategies globally across GPU-accelerated libraries.
+
+RMM currently has APIs for C++ and Python.
+
+### C++
+
+Here's an example of configuring RMM to use a pool allocator in C++ (derived from the RMM example [here](https://github.com/rapidsai/rmm?tab=readme-ov-file#example)):
+
+```c++
+rmm::mr::cuda_memory_resource cuda_mr;
+// Construct a resource that uses a coalescing best-fit pool allocator
+// With the pool initially half of available device memory
+auto initial_size = rmm::percent_of_free_device_memory(50);
+rmm::mr::pool_memory_resource pool_mr{cuda_mr, initial_size};
+rmm::mr::set_current_device_resource(pool_mr);
+auto mr = rmm::mr::get_current_device_resource_ref();
+```
+
+### Python
+
+And the corresponding code in Python (derived from the RMM example [here](https://github.com/rapidsai/rmm?tab=readme-ov-file#memoryresource-objects)):
+
+```python
+import rmm
+pool = rmm.mr.PoolMemoryResource(
+  rmm.mr.CudaMemoryResource(),
+  initial_pool_size=2**30,
+  maximum_pool_size=2**32)
+rmm.mr.set_current_device_resource(pool)
+```
+
+## Resource management
+
+cuVS uses an API from the [RAFT](https://github.com/rapidsai/raft) library of ML and data mining primitives to centralize and reuse expensive resources, such as memory management. The below code examples demonstrate how to create these resources for use throughout this guide.
+
+See RAFT's [resource API documentation](https://docs.rapids.ai/api/raft/nightly/cpp_api/core_resources/) for more information.
+
+C
+^
+
+```c
+#include <cuda_runtime.h>
+#include <cuvs/core/c_api.h>
+
+cuvsResources_t res;
+cuvsResourcesCreate(&res);
+
+// ... do some processing ...
+
+cuvsResourcesDestroy(res);
+```
+
+### C++
+
+```c++
+#include <raft/core/device_resources.hpp>
+
+raft::device_resources res;
+```
+
+### Python
+
+```python
+import pylibraft
+
+res = pylibraft.common.DeviceResources()
+```
+
+### Rust
+
+```rust
+let res = cuvs::Resources::new()?;
+```
+
diff --git a/docs/source/api_basics.rst b/docs/source/api_basics.rst
deleted file mode 100644
index 5ffb1da630..0000000000
--- a/docs/source/api_basics.rst
+++ /dev/null
@@ -1,90 +0,0 @@
-cuVS API Basics
-===============
-
-- `Memory management`_
-- `Resource management`_
-
-Memory management
------------------
-
-Centralized memory management allows flexible configuration of allocation strategies, such as sharing the same CUDA memory pool across library boundaries. cuVS uses the `RMM <https://github.com/rapidsai/rmm>`_ library, which eases the burden of configuring different allocation strategies globally across GPU-accelerated libraries.
-
-RMM currently has APIs for C++ and Python.
-
-C++
-^^^
-
-Here's an example of configuring RMM to use a pool allocator in C++ (derived from the RMM example `here <https://github.com/rapidsai/rmm?tab=readme-ov-file#example>`__):
-
-.. code-block:: c++
-
-    rmm::mr::cuda_memory_resource cuda_mr;
-    // Construct a resource that uses a coalescing best-fit pool allocator
-    // With the pool initially half of available device memory
-    auto initial_size = rmm::percent_of_free_device_memory(50);
-    rmm::mr::pool_memory_resource pool_mr{cuda_mr, initial_size};
-    rmm::mr::set_current_device_resource(pool_mr);
-    auto mr = rmm::mr::get_current_device_resource_ref();
-
-Python
-^^^^^^
-
-And the corresponding code in Python (derived from the RMM example `here <https://github.com/rapidsai/rmm?tab=readme-ov-file#memoryresource-objects>`__):
-
-.. code-block:: python
-
-    import rmm
-    pool = rmm.mr.PoolMemoryResource(
-      rmm.mr.CudaMemoryResource(),
-      initial_pool_size=2**30,
-      maximum_pool_size=2**32)
-    rmm.mr.set_current_device_resource(pool)
-
-
-Resource management
--------------------
-
-cuVS uses an API from the `RAFT <https://github.com/rapidsai/raft>`_ library of ML and data mining primitives to centralize and reuse expensive resources, such as memory management. The below code examples demonstrate how to create these resources for use throughout this guide.
-
-See RAFT's `resource API documentation <https://docs.rapids.ai/api/raft/nightly/cpp_api/core_resources/>`_ for more information.
-
-C
-^
-
-.. code-block:: c
-
-    #include <cuda_runtime.h>
-    #include <cuvs/core/c_api.h>
-
-    cuvsResources_t res;
-    cuvsResourcesCreate(&res);
-
-    // ... do some processing ...
-
-    cuvsResourcesDestroy(res);
-
-C++
-^^^
-
-.. code-block:: c++
-
-    #include <raft/core/device_resources.hpp>
-
-    raft::device_resources res;
-
-Python
-^^^^^^
-
-.. code-block:: python
-
-    import pylibraft
-
-    res = pylibraft.common.DeviceResources()
-
-
-Rust
-^^^^
-
-.. code-block:: rust
-
-    let res = cuvs::Resources::new()?;
diff --git a/docs/source/api_docs.md b/docs/source/api_docs.md
new file mode 100644
index 0000000000..5d91e6dbbb
--- /dev/null
+++ b/docs/source/api_docs.md
@@ -0,0 +1,13 @@
+# API Reference
+
+```{toctree}
+:maxdepth: 3
+
+c_api.md
+cpp_api.md
+python_api.md
+rust_api/index.md
+```
+
+* {ref}`genindex`
+* {ref}`search`
diff --git a/docs/source/api_docs.rst b/docs/source/api_docs.rst
deleted file mode 100644
index 68d184c72c..0000000000
--- a/docs/source/api_docs.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-API Reference
-=============
-
-.. toctree::
-   :maxdepth: 3
-
-   c_api.rst
-   cpp_api.rst
-   python_api.rst
-   rust_api/index.rst
-
-* :ref:`genindex`
-* :ref:`search`
diff --git a/docs/source/api_interoperability.md b/docs/source/api_interoperability.md
new file mode 100644
index 0000000000..9c454c6a5e
--- /dev/null
+++ b/docs/source/api_interoperability.md
@@ -0,0 +1,102 @@
+# Interoperability
+
+## DLPack (C)
+
+Approximate nearest neighbor (ANN) indexes provide an interface to build and search an index via a C API. [DLPack v0.8](https://github.com/dmlc/dlpack/blob/main/README.md), a tensor interface framework, is used as the standard to interact with our C API.
+
+Representing a tensor with DLPack is simple, as it is a POD struct that stores information about the tensor at runtime. At the moment, `DLManagedTensor` from DLPack v0.8 is compatible with out C API however we will soon upgrade to `DLManagedTensorVersioned` from DLPack v1.0 as it will help us maintain ABI and API compatibility.
+
+Here's an example on how to represent device memory using `DLManagedTensor`:
+
+```c
+#include <dlpack/dlpack.h>
+
+// Create data representation in host memory
+float dataset[2][1] = {{0.2, 0.1}};
+// copy data to device memory
+float *dataset_dev;
+cuvsRMMAlloc(&dataset_dev, sizeof(float) * 2 * 1);
+cudaMemcpy(dataset_dev, dataset, sizeof(float) * 2 * 1, cudaMemcpyDefault);
+
+// Use DLPack for representing the data as a tensor
+DLManagedTensor dataset_tensor;
+dataset_tensor.dl_tensor.data               = dataset;
+dataset_tensor.dl_tensor.device.device_type = kDLCUDA;
+dataset_tensor.dl_tensor.ndim               = 2;
+dataset_tensor.dl_tensor.dtype.code         = kDLFloat;
+dataset_tensor.dl_tensor.dtype.bits         = 32;
+dataset_tensor.dl_tensor.dtype.lanes        = 1;
+int64_t dataset_shape[2]                    = {2, 1};
+dataset_tensor.dl_tensor.shape              = dataset_shape;
+dataset_tensor.dl_tensor.strides            = nullptr;
+
+// free memory after use
+cuvsRMMFree(dataset_dev);
+```
+
+Please refer to [cuVS C API documentation](c_api.md) to learn more.
+
+## Multi-dimensional span (C++)
+
+cuVS is built on top of the GPU-accelerated machine learning and data mining primitives in the [RAFT](https://github.com/rapidsai/raft) library. Most of the C++ APIs in cuVS accept [mdspan](https://arxiv.org/abs/2010.06474) multi-dimensional array view for representing data in higher dimensions similar to the `ndarray` in the Numpy Python library. RAFT also contains the corresponding owning `mdarray` structure, which simplifies the allocation and management of multi-dimensional data in both host and device (GPU) memory.
+
+The `mdarray` is an owning object that forms a convenience layer over RMM and can be constructed in RAFT using a number of different helper functions:
+
+```c++
+#include <raft/core/device_mdarray.hpp>
+
+int n_rows = 10;
+int n_cols = 10;
+
+auto scalar = raft::make_device_scalar<float>(handle, 1.0);
+auto vector = raft::make_device_vector<float>(handle, n_cols);
+auto matrix = raft::make_device_matrix<float>(handle, n_rows, n_cols);
+```
+
+The `mdspan` is a lightweight non-owning view that can wrap around any pointer, maintaining shape, layout, and indexing information for accessing elements.
+
+We can construct `mdspan` instances directly from the above `mdarray` instances:
+
+```c++
+// Scalar mdspan on device
+auto scalar_view = scalar.view();
+
+// Vector mdspan on device
+auto vector_view = vector.view();
+
+// Matrix mdspan on device
+auto matrix_view = matrix.view();
+```
+
+Since the `mdspan` is just a lightweight wrapper, we can also construct it from the underlying data handles in the `mdarray` instances above. We use the extent to get information about the `mdarray` or `mdspan`'s shape.
+
+```c++
+#include <raft/core/device_mdspan.hpp>
+
+auto scalar_view = raft::make_device_scalar_view(scalar.data_handle());
+auto vector_view = raft::make_device_vector_view(vector.data_handle(), vector.extent(0));
+auto matrix_view = raft::make_device_matrix_view(matrix.data_handle(), matrix.extent(0), matrix.extent(1));
+```
+
+Of course, RAFT's `mdspan`/`mdarray` APIs aren't just limited to the `device`. You can also create `host` variants:
+
+```c++
+#include <raft/core/host_mdarray.hpp>
+#include <raft/core/host_mdspan.hpp>
+
+int n_rows = 10;
+int n_cols = 10;
+
+auto scalar = raft::make_host_scalar<float>(handle, 1.0);
+auto vector = raft::make_host_vector<float>(handle, n_cols);
+auto matrix = raft::make_host_matrix<float>(handle, n_rows, n_cols);
+
+auto scalar_view = raft::make_host_scalar_view(scalar.data_handle());
+auto vector_view = raft::make_host_vector_view(vector.data_handle(), vector.extent(0));
+auto matrix_view = raft::make_host_matrix_view(matrix.data_handle(), matrix.extent(0), matrix.extent(1));
+```
+
+Please refer to RAFT's [mdspan documentation](https://docs.rapids.ai/api/raft/stable/cpp_api/mdspan/) to learn more.
+
+
+## CUDA array interface (Python)
diff --git a/docs/source/api_interoperability.rst b/docs/source/api_interoperability.rst
deleted file mode 100644
index 097025aee7..0000000000
--- a/docs/source/api_interoperability.rst
+++ /dev/null
@@ -1,106 +0,0 @@
-Interoperability
-================
-
-DLPack (C)
-^^^^^^^^^^
-
-Approximate nearest neighbor (ANN) indexes provide an interface to build and search an index via a C API. `DLPack v0.8 <https://github.com/dmlc/dlpack/blob/main/README.md>`_, a tensor interface framework, is used as the standard to interact with our C API.
-
-Representing a tensor with DLPack is simple, as it is a POD struct that stores information about the tensor at runtime. At the moment, `DLManagedTensor` from DLPack v0.8 is compatible with out C API however we will soon upgrade to `DLManagedTensorVersioned` from DLPack v1.0 as it will help us maintain ABI and API compatibility.
-
-Here's an example on how to represent device memory using `DLManagedTensor`:
-
-.. code-block:: c
-
-    #include <dlpack/dlpack.h>
-
-    // Create data representation in host memory
-    float dataset[2][1] = {{0.2, 0.1}};
-    // copy data to device memory
-    float *dataset_dev;
-    cuvsRMMAlloc(&dataset_dev, sizeof(float) * 2 * 1);
-    cudaMemcpy(dataset_dev, dataset, sizeof(float) * 2 * 1, cudaMemcpyDefault);
-
-    // Use DLPack for representing the data as a tensor
-    DLManagedTensor dataset_tensor;
-    dataset_tensor.dl_tensor.data               = dataset;
-    dataset_tensor.dl_tensor.device.device_type = kDLCUDA;
-    dataset_tensor.dl_tensor.ndim               = 2;
-    dataset_tensor.dl_tensor.dtype.code         = kDLFloat;
-    dataset_tensor.dl_tensor.dtype.bits         = 32;
-    dataset_tensor.dl_tensor.dtype.lanes        = 1;
-    int64_t dataset_shape[2]                    = {2, 1};
-    dataset_tensor.dl_tensor.shape              = dataset_shape;
-    dataset_tensor.dl_tensor.strides            = nullptr;
-
-    // free memory after use
-    cuvsRMMFree(dataset_dev);
-
-Please refer to `cuVS C API documentation <c_api.rst>`_ to learn more.
-
-Multi-dimensional span (C++)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-cuVS is built on top of the GPU-accelerated machine learning and data mining primitives in the `RAFT <https://github.com/rapidsai/raft>`_ library. Most of the C++ APIs in cuVS accept `mdspan <https://arxiv.org/abs/2010.06474>`_ multi-dimensional array view for representing data in higher dimensions similar to the `ndarray` in the Numpy Python library. RAFT also contains the corresponding owning `mdarray` structure, which simplifies the allocation and management of multi-dimensional data in both host and device (GPU) memory.
-
-The `mdarray` is an owning object that forms a convenience layer over RMM and can be constructed in RAFT using a number of different helper functions:
-
-.. code-block:: c++
-
-    #include <raft/core/device_mdarray.hpp>
-
-    int n_rows = 10;
-    int n_cols = 10;
-
-    auto scalar = raft::make_device_scalar<float>(handle, 1.0);
-    auto vector = raft::make_device_vector<float>(handle, n_cols);
-    auto matrix = raft::make_device_matrix<float>(handle, n_rows, n_cols);
-
-The `mdspan` is a lightweight non-owning view that can wrap around any pointer, maintaining shape, layout, and indexing information for accessing elements.
-
-We can construct `mdspan` instances directly from the above `mdarray` instances:
-
-.. code-block:: c++
-
-    // Scalar mdspan on device
-    auto scalar_view = scalar.view();
-
-    // Vector mdspan on device
-    auto vector_view = vector.view();
-
-    // Matrix mdspan on device
-    auto matrix_view = matrix.view();
-
-Since the `mdspan` is just a lightweight wrapper, we can also construct it from the underlying data handles in the `mdarray` instances above. We use the extent to get information about the `mdarray` or `mdspan`'s shape.
-
-.. code-block:: c++
-
-    #include <raft/core/device_mdspan.hpp>
-
-    auto scalar_view = raft::make_device_scalar_view(scalar.data_handle());
-    auto vector_view = raft::make_device_vector_view(vector.data_handle(), vector.extent(0));
-    auto matrix_view = raft::make_device_matrix_view(matrix.data_handle(), matrix.extent(0), matrix.extent(1));
-
-Of course, RAFT's `mdspan`/`mdarray` APIs aren't just limited to the `device`. You can also create `host` variants:
-
-.. code-block:: c++
-
-    #include <raft/core/host_mdarray.hpp>
-    #include <raft/core/host_mdspan.hpp>
-
-    int n_rows = 10;
-    int n_cols = 10;
-
-    auto scalar = raft::make_host_scalar<float>(handle, 1.0);
-    auto vector = raft::make_host_vector<float>(handle, n_cols);
-    auto matrix = raft::make_host_matrix<float>(handle, n_rows, n_cols);
-
-    auto scalar_view = raft::make_host_scalar_view(scalar.data_handle());
-    auto vector_view = raft::make_host_vector_view(vector.data_handle(), vector.extent(0));
-    auto matrix_view = raft::make_host_matrix_view(matrix.data_handle(), matrix.extent(0), matrix.extent(1));
-
-Please refer to RAFT's `mdspan documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/mdspan/>`_ to learn more.
-
-
-CUDA array interface (Python)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/source/build.md b/docs/source/build.md
new file mode 100644
index 0000000000..dc28fc3a4e
--- /dev/null
+++ b/docs/source/build.md
@@ -0,0 +1,261 @@
+# Installation
+
+The cuVS software development kit provides APIs for C, C++, Python, and Rust languages. This guide outlines how to install the pre-compiled packages, build it from source, and use it in downstream applications.
+
+- [Installing pre-compiled packages](#installing-pre-compiled-packages)
+
+  * [C, C++, and Python through Conda](#c-c-and-python-through-conda)
+
+  * [Python through Pip](#python-through-pip)
+
+  * [Tarball](#tarball)
+
+- [Build from source](#build-from-source)
+
+  * [Prerequisites](#prerequisites)
+
+  * [Create a build environment](#create-a-build-environment)
+
+  * [C and C++ Libraries](#c-and-c-libraries)
+
+    * [Building the Googletests](#building-the-googletests)
+
+  * [Python Library](#python-library)
+
+  * [Rust Library](#rust-library)
+
+  * [Using CMake Directly](#using-cmake-directly)
+
+- [Build Documentation](#build-documentation)
+
+
+## Installing Pre-compiled Packages
+
+**Note:** The cuVS pre-compiled packages are available for **Linux** only (x86_64 and aarch64 architectures). Native Windows support is not available at this time. On Windows, use **WSL2** with GPU passthrough. See the [RAPIDS WSL2 guide](https://rapids.ai/start.html#wsl2).
+
+### C, C++, and Python through Conda
+
+The easiest way to install the pre-compiled C, C++, and Python packages is through conda. You can get a minimal conda installation with [miniforge](https://github.com/conda-forge/miniforge).
+
+Use the following commands, depending on your CUDA version, to install cuVS packages (replace `rapidsai` with `rapidsai-nightly` to install more up-to-date but less stable nightly packages). `mamba` is preferred over the `conda` command and can be enabled using [this guide](https://conda.github.io/conda-libmamba-solver/user-guide/).
+
+#### C/C++ Package
+
+```bash
+# CUDA 13
+conda install -c rapidsai -c conda-forge libcuvs cuda-version=13.1
+
+# CUDA 12
+conda install -c rapidsai -c conda-forge libcuvs cuda-version=12.9
+```
+
+#### Python Package
+
+```bash
+# CUDA 13
+conda install -c rapidsai -c conda-forge cuvs cuda-version=13.1
+
+# CUDA 12
+conda install -c rapidsai -c conda-forge cuvs cuda-version=12.9
+```
+
+### Python through Pip
+
+The cuVS Python package can also be [installed through pip](https://docs.rapids.ai/install#pip).
+
+```bash
+# CUDA 13
+pip install cuvs-cu13 --extra-index-url=https://pypi.nvidia.com
+
+# CUDA 12
+pip install cuvs-cu12 --extra-index-url=https://pypi.nvidia.com
+```
+
+Note: these packages statically link the C and C++ libraries so the `libcuvs` and `libcuvs_c` shared libraries won't be readily available to use in your code.
+
+### Tarball
+
+#### Install Dependencies
+
+1. [NCCL](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html)
+2. `libopenmp`
+3. CUDA Toolkit Runtime 12.2+
+4. Ampere architecture or better (compute capability >= 8.0)
+
+#### Download & Extract
+
+Download the pre-built tarball for your CPU architecture and CUDA version from
+[https://developer.nvidia.com/cuvs-downloads](https://developer.nvidia.com/cuvs-downloads)
+
+Untar the tarball into a directory.
+
+```bash
+tar -xzvf libcuvs-linux-sbsa-26.02.00.189485_cuda12-archive.tar.xz -C /path/to/folder
+```
+
+Add cuVS to your system library load path. This should be done in the appropriate profile configuration (for e.g. `.bashrc`, `.bash_profile`) to maintain the setting across sessions.
+
+```bash
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/folder
+```
+
+## Build from source
+
+The core cuVS source code is written in C++ and wrapped through a C API. The C API is wrapped around the C++ APIs and the other supported languages are built around the C API.
+
+### Prerequisites
+
+- CMake 3.26.4+
+- GCC 9.3+ (11.4+ recommended)
+- CUDA Toolkit 12.2+
+- Ampere architecture or better (compute capability >= 8.0)
+
+### Create a build environment
+
+Conda environment scripts are provided for installing the necessary dependencies to build cuVS from source. It is preferred to use `mamba`, as it provides significant speedup over `conda`:
+
+```bash
+conda env create --name cuvs -f conda/environments/all_cuda-131_arch-$(uname -m).yaml
+conda activate cuvs
+```
+
+The recommended way to build and install cuVS from source is to use the `build.sh` script in the root of the repository. This script can build both the C++ and Python artifacts and provides CMake options for building and installing the headers, tests, benchmarks, and the pre-compiled shared library.
+
+
+### C and C++ libraries
+
+The C and C++ shared libraries are built together using the following arguments to `build.sh`:
+
+```bash
+./build.sh libcuvs
+```
+
+In above example the `libcuvs.so` and `libcuvs_c.so` shared libraries are installed by default into `$INSTALL_PREFIX/lib`. To disable this, pass `-n` flag.
+
+Once installed, the shared libraries, headers (and any dependencies downloaded and installed via `rapids-cmake`) can be uninstalled using `build.sh`:
+
+```bash
+./build.sh libcuvs --uninstall
+```
+
+### Multi-GPU features
+
+To disable the multi-gpu features run :
+
+```bash
+./build.sh libcuvs --no-mg
+```
+
+#### Building the Googletests
+
+Compile the C and C++ Googletests using the `tests` target in `build.sh`.
+
+```bash
+./build.sh libcuvs tests
+```
+
+The tests will be written to the build directory, which is `cpp/build/` by default, and they will be named `*_TEST`.
+
+It can take some time to compile all of the tests. You can build individual tests by providing a semicolon-separated list to the `--limit-tests` option in `build.sh`. Make sure to pass the `-n` flag so the tests are not installed.
+
+```bash
+./build.sh libcuvs tests -n --limit-tests=NEIGHBORS_TEST;CAGRA_C_TEST
+```
+
+### Python library
+
+The Python library should be built and installed using the `build.sh` script:
+
+```bash
+./build.sh python
+```
+
+The Python packages can also be uninstalled using the `build.sh` script:
+
+```bash
+./build.sh python --uninstall
+```
+
+### Go library
+
+After building the C and C++ libraries, the Golang library can be built with the following command:
+
+```bash
+export CUDA_HOME="/usr/local/cuda" # or wherever your CUDA installation is.
+export CGO_CFLAGS="-I${CONDA_PREFIX}/include -I${CUDA_HOME}/include"
+export CGO_LDFLAGS="-L${CONDA_PREFIX}/lib -lcuvs -lcuvs_c"
+export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
+export CC=clang
+
+./build.sh go
+```
+
+### Rust library
+
+The Rust bindings can be built with
+
+```bash
+./build.sh rust
+```
+
+### Using CMake directly
+
+When building cuVS from source, the `build.sh` script offers a nice wrapper around the `cmake` commands to ease the burdens of manually configuring the various available cmake options. When more fine-grained control over the CMake configuration is desired, the `cmake` command can be invoked directly as the below example demonstrates.
+
+The `CMAKE_INSTALL_PREFIX` installs cuVS into a specific location. The example below installs cuVS into the current Conda environment:
+
+```bash
+cd cpp
+mkdir build
+cd build
+cmake -D BUILD_TESTS=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX ../
+make -j<parallel_level> install
+```
+
+cuVS has the following configurable cmake flags available:
+
+```{list-table} CMake Flags
+* - Flag
+  - Possible Values
+  - Default Value
+  - Behavior
+
+* - BUILD_TESTS
+  - ON, OFF
+  - ON
+  - Compile Googletests
+
+* - CUDA_ENABLE_KERNELINFO
+  - ON, OFF
+  - OFF
+  - Enables `kernelinfo` in nvcc. This is useful for `compute-sanitizer`
+
+* - CUDA_ENABLE_LINEINFO
+  - ON, OFF
+  - OFF
+  - Enable the `-lineinfo` option for nvcc
+
+* - CUDA_STATIC_MATH_LIBRARIES
+  - ON, OFF
+  - OFF
+  - Statically link the CUDA math libraries
+
+* - DETECT_CONDA_ENV
+  - ON, OFF
+  - ON
+  - Enable detection of conda environment for dependencies
+
+* - CUVS_NVTX
+  - ON, OFF
+  - OFF
+  - Enable NVTX markers
+```
+
+### Build documentation
+
+The documentation requires that the C, C++ and Python libraries have been built and installed. The following will build the docs along with the necessary libraries:
+
+```bash
+./build.sh libcuvs python docs
+```
+
diff --git a/docs/source/build.rst b/docs/source/build.rst
deleted file mode 100644
index 5e863e40f4..0000000000
--- a/docs/source/build.rst
+++ /dev/null
@@ -1,285 +0,0 @@
-Installation
-============
-
-The cuVS software development kit provides APIs for C, C++, Python, and Rust languages. This guide outlines how to install the pre-compiled packages, build it from source, and use it in downstream applications.
-
-- `Installing pre-compiled packages`_
-
-  * `C, C++, and Python through Conda`_
-
-  * `Python through Pip`_
-
-  * `Tarball`_
-
-- `Build from source`_
-
-  * `Prerequisites`_
-
-  * `Create a build environment`_
-
-  * `C and C++ Libraries`_
-
-    * `Building the Googletests`_
-
-  * `Python Library`_
-
-  * `Rust Library`_
-
-  * `Using CMake Directly`_
-
-- `Build Documentation`_
-
-
-Installing Pre-compiled Packages
---------------------------------
-
-**Note:** The cuVS pre-compiled packages are available for **Linux** only (x86_64 and aarch64 architectures). Native Windows support is not available at this time. On Windows, use **WSL2** with GPU passthrough. See the `RAPIDS WSL2 guide <https://rapids.ai/start.html#wsl2>`_.
-
-C, C++, and Python through Conda
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The easiest way to install the pre-compiled C, C++, and Python packages is through conda. You can get a minimal conda installation with `miniforge <https://github.com/conda-forge/miniforge>`__.
-
-Use the following commands, depending on your CUDA version, to install cuVS packages (replace `rapidsai` with `rapidsai-nightly` to install more up-to-date but less stable nightly packages). `mamba` is preferred over the `conda` command and can be enabled using `this guide <https://conda.github.io/conda-libmamba-solver/user-guide/>`_.
-
-C/C++ Package
-~~~~~~~~~~~~~
-
-.. code-block:: bash
-
-   # CUDA 13
-   conda install -c rapidsai -c conda-forge libcuvs cuda-version=13.1
-
-   # CUDA 12
-   conda install -c rapidsai -c conda-forge libcuvs cuda-version=12.9
-
-Python Package
-~~~~~~~~~~~~~~
-
-.. code-block:: bash
-
-   # CUDA 13
-   conda install -c rapidsai -c conda-forge cuvs cuda-version=13.1
-
-   # CUDA 12
-   conda install -c rapidsai -c conda-forge cuvs cuda-version=12.9
-
-Python through Pip
-^^^^^^^^^^^^^^^^^^
-
-The cuVS Python package can also be `installed through pip <https://docs.rapids.ai/install#pip>`_.
-
-.. code-block:: bash
-
-    # CUDA 13
-    pip install cuvs-cu13 --extra-index-url=https://pypi.nvidia.com
-
-    # CUDA 12
-    pip install cuvs-cu12 --extra-index-url=https://pypi.nvidia.com
-
-Note: these packages statically link the C and C++ libraries so the `libcuvs` and `libcuvs_c` shared libraries won't be readily available to use in your code.
-
-Tarball
-^^^^^^^
-
-Install Dependencies
-~~~~~~~~~~~~~~~~~~~~
-
-1. `NCCL <https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html>`_
-2. `libopenmp`
-3. CUDA Toolkit Runtime 12.2+
-4. Ampere architecture or better (compute capability >= 8.0)
-
-Download & Extract
-~~~~~~~~~~~~~~~~~~
-
-Download the pre-built tarball for your CPU architecture and CUDA version from
-`https://developer.nvidia.com/cuvs-downloads <https://developer.nvidia.com/cuvs-downloads>`_
-
-Untar the tarball into a directory.
-
-.. code-block:: bash
-
-    tar -xzvf libcuvs-linux-sbsa-26.02.00.189485_cuda12-archive.tar.xz -C /path/to/folder
-
-
-Add cuVS to your system library load path. This should be done in the appropriate profile configuration (for e.g. `.bashrc`, `.bash_profile`) to maintain the setting across sessions.
-
-.. code-block:: bash
-
-    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/folder
-
-
-Build from source
------------------
-
-The core cuVS source code is written in C++ and wrapped through a C API. The C API is wrapped around the C++ APIs and the other supported languages are built around the C API.
-
-Prerequisites
-^^^^^^^^^^^^^
-
-- CMake 3.26.4+
-- GCC 9.3+ (11.4+ recommended)
-- CUDA Toolkit 12.2+
-- Ampere architecture or better (compute capability >= 8.0)
-
-Create a build environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Conda environment scripts are provided for installing the necessary dependencies to build cuVS from source. It is preferred to use `mamba`, as it provides significant speedup over `conda`:
-
-.. code-block:: bash
-
-    conda env create --name cuvs -f conda/environments/all_cuda-131_arch-$(uname -m).yaml
-    conda activate cuvs
-
-The recommended way to build and install cuVS from source is to use the `build.sh` script in the root of the repository. This script can build both the C++ and Python artifacts and provides CMake options for building and installing the headers, tests, benchmarks, and the pre-compiled shared library.
-
-
-C and C++ libraries
-^^^^^^^^^^^^^^^^^^^
-
-The C and C++ shared libraries are built together using the following arguments to `build.sh`:
-
-.. code-block:: bash
-
-    ./build.sh libcuvs
-
-In above example the `libcuvs.so` and `libcuvs_c.so` shared libraries are installed by default into `$INSTALL_PREFIX/lib`. To disable this, pass `-n` flag.
-
-Once installed, the shared libraries, headers (and any dependencies downloaded and installed via `rapids-cmake`) can be uninstalled using `build.sh`:
-
-.. code-block:: bash
-
-    ./build.sh libcuvs --uninstall
-
-
-Multi-GPU features
-^^^^^^^^^^^^^^^^^^
-
-To disable the multi-gpu features run :
-
-.. code-block:: bash
-
-    ./build.sh libcuvs --no-mg
-
-
-Building the Googletests
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-Compile the C and C++ Googletests using the `tests` target in `build.sh`.
-
-.. code-block:: bash
-
-    ./build.sh libcuvs tests
-
-The tests will be written to the build directory, which is `cpp/build/` by default, and they will be named `*_TEST`.
-
-It can take some time to compile all of the tests. You can build individual tests by providing a semicolon-separated list to the `--limit-tests` option in `build.sh`. Make sure to pass the `-n` flag so the tests are not installed.
-
-.. code-block:: bash
-
-    ./build.sh libcuvs tests -n --limit-tests=NEIGHBORS_TEST;CAGRA_C_TEST
-
-Python library
-^^^^^^^^^^^^^^
-
-The Python library should be built and installed using the `build.sh` script:
-
-.. code-block:: bash
-
-    ./build.sh python
-
-The Python packages can also be uninstalled using the `build.sh` script:
-
-.. code-block:: bash
-
-    ./build.sh python --uninstall
-
-Go library
-^^^^^^^^^^
-
-After building the C and C++ libraries, the Golang library can be built with the following command:
-
-.. code-block:: bash
-
-    export CUDA_HOME="/usr/local/cuda" # or wherever your CUDA installation is.
-    export CGO_CFLAGS="-I${CONDA_PREFIX}/include -I${CUDA_HOME}/include"
-    export CGO_LDFLAGS="-L${CONDA_PREFIX}/lib -lcuvs -lcuvs_c"
-    export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
-    export CC=clang
-
-    ./build.sh go
-
-Rust library
-^^^^^^^^^^^^
-
-The Rust bindings can be built with
-
-.. code-block:: bash
-
-    ./build.sh rust
-
-Using CMake directly
-^^^^^^^^^^^^^^^^^^^^
-
-When building cuVS from source, the `build.sh` script offers a nice wrapper around the `cmake` commands to ease the burdens of manually configuring the various available cmake options. When more fine-grained control over the CMake configuration is desired, the `cmake` command can be invoked directly as the below example demonstrates.
-
-The `CMAKE_INSTALL_PREFIX` installs cuVS into a specific location. The example below installs cuVS into the current Conda environment:
-
-.. code-block:: bash
-
-    cd cpp
-    mkdir build
-    cd build
-    cmake -D BUILD_TESTS=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX ../
-    make -j<parallel_level> install
-
-cuVS has the following configurable cmake flags available:
-
-.. list-table:: CMake Flags
-
- * - Flag
-   - Possible Values
-   - Default Value
-   - Behavior
-
- * - BUILD_TESTS
-   - ON, OFF
-   - ON
-   - Compile Googletests
-
- * - CUDA_ENABLE_KERNELINFO
-   - ON, OFF
-   - OFF
-   - Enables `kernelinfo` in nvcc. This is useful for `compute-sanitizer`
-
- * - CUDA_ENABLE_LINEINFO
-   - ON, OFF
-   - OFF
-   - Enable the `-lineinfo` option for nvcc
-
- * - CUDA_STATIC_MATH_LIBRARIES
-   - ON, OFF
-   - OFF
-   - Statically link the CUDA math libraries
-
- * - DETECT_CONDA_ENV
-   - ON, OFF
-   - ON
-   - Enable detection of conda environment for dependencies
-
- * - CUVS_NVTX
-   - ON, OFF
-   - OFF
-   - Enable NVTX markers
-
-
-Build documentation
-^^^^^^^^^^^^^^^^^^^
-
-The documentation requires that the C, C++ and Python libraries have been built and installed. The following will build the docs along with the necessary libraries:
-
-.. code-block:: bash
-
-    ./build.sh libcuvs python docs
diff --git a/docs/source/c_api.md b/docs/source/c_api.md
new file mode 100644
index 0000000000..3f04f086d8
--- /dev/null
+++ b/docs/source/c_api.md
@@ -0,0 +1,14 @@
+# C API Documentation
+
+(api)=
+
+```{toctree}
+:maxdepth: 4
+
+c_api/core_c_api.md
+c_api/distance.md
+c_api/cluster.md
+c_api/neighbors.md
+c_api/preprocessing.md
+```
+
diff --git a/docs/source/c_api.rst b/docs/source/c_api.rst
deleted file mode 100644
index c65eee06ef..0000000000
--- a/docs/source/c_api.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-~~~~~~~~~~~~~~~~~~~
-C API Documentation
-~~~~~~~~~~~~~~~~~~~
-
-.. _api:
-
-.. toctree::
-   :maxdepth: 4
-
-   c_api/core_c_api.rst
-   c_api/distance.rst
-   c_api/cluster.rst
-   c_api/neighbors.rst
-   c_api/preprocessing.rst
diff --git a/docs/source/c_api/cluster.md b/docs/source/c_api/cluster.md
new file mode 100644
index 0000000000..fa7589f143
--- /dev/null
+++ b/docs/source/c_api/cluster.md
@@ -0,0 +1,9 @@
+# Clustering
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+cluster_kmeans_c.md
+```
+
diff --git a/docs/source/c_api/cluster.rst b/docs/source/c_api/cluster.rst
deleted file mode 100644
index 34795e45bf..0000000000
--- a/docs/source/c_api/cluster.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Clustering
-==========
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   cluster_kmeans_c.rst
diff --git a/docs/source/c_api/cluster_kmeans_c.md b/docs/source/c_api/cluster_kmeans_c.md
new file mode 100644
index 0000000000..23cc8bde80
--- /dev/null
+++ b/docs/source/c_api/cluster_kmeans_c.md
@@ -0,0 +1,22 @@
+# K-Means
+
+## Parameters
+
+`#include <cuvs/cluster/kmeans.h>`
+
+```{doxygengroup} kmeans_c_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Functions
+
+`#include <cuvs/cluster/kmeans.h>`
+
+```{doxygengroup} kmeans_c
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/cluster_kmeans_c.rst b/docs/source/c_api/cluster_kmeans_c.rst
deleted file mode 100644
index b22003bc27..0000000000
--- a/docs/source/c_api/cluster_kmeans_c.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-K-Means
-=======
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-Parameters
-----------
-
-``#include <cuvs/cluster/kmeans.h>``
-
-.. doxygengroup:: kmeans_c_params
-   :project: cuvs
-   :members:
-   :content-only:
-
-
-Functions
----------
-
-``#include <cuvs/cluster/kmeans.h>``
-
-.. doxygengroup:: kmeans_c
-   :project: cuvs
-   :members:
-   :content-only:
diff --git a/docs/source/c_api/core_c_api.md b/docs/source/c_api/core_c_api.md
new file mode 100644
index 0000000000..254f7b55be
--- /dev/null
+++ b/docs/source/c_api/core_c_api.md
@@ -0,0 +1,28 @@
+# Core Routines
+
+`#include <cuvs/core/c_api.h>`
+
+## Resources Handle
+
+```{doxygengroup} resources_c
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Error Handling
+
+```{doxygengroup} error_c
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Logging
+
+```{doxygengroup} log_c
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/core_c_api.rst b/docs/source/c_api/core_c_api.rst
deleted file mode 100644
index e228394733..0000000000
--- a/docs/source/c_api/core_c_api.rst
+++ /dev/null
@@ -1,32 +0,0 @@
-Core Routines
-=============
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/core/c_api.h>``
-
-Resources Handle
-----------------
-
-.. doxygengroup:: resources_c
-    :project: cuvs
-    :members:
-    :content-only:
-
-Error Handling
---------------
-
-.. doxygengroup:: error_c
-    :project: cuvs
-    :members:
-    :content-only:
-
-Logging
--------
-
-.. doxygengroup:: log_c
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/distance.md b/docs/source/c_api/distance.md
new file mode 100644
index 0000000000..c7117e6343
--- /dev/null
+++ b/docs/source/c_api/distance.md
@@ -0,0 +1,20 @@
+# Distance
+
+## Distance types
+
+`#include <cuvs/distance/distance.h>`
+
+```{doxygenenum} cuvsDistanceType
+:project: cuvs
+```
+
+## Pairwise distance
+
+`#include <cuvs/distance/pairwise_distance.h>`
+
+```{doxygengroup} pairwise_distance_c
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/distance.rst b/docs/source/c_api/distance.rst
deleted file mode 100644
index 8635ddf8bc..0000000000
--- a/docs/source/c_api/distance.rst
+++ /dev/null
@@ -1,26 +0,0 @@
-Distance
-========
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-
-Distance types
---------------
-
-``#include <cuvs/distance/distance.h>``
-
-.. doxygenenum:: cuvsDistanceType
-   :project: cuvs
-
-
-Pairwise distance
------------------
-
-``#include <cuvs/distance/pairwise_distance.h>``
-
-.. doxygengroup:: pairwise_distance_c
-   :project: cuvs
-   :members:
-   :content-only:
diff --git a/docs/source/c_api/neighbors.md b/docs/source/c_api/neighbors.md
new file mode 100644
index 0000000000..a9b8883281
--- /dev/null
+++ b/docs/source/c_api/neighbors.md
@@ -0,0 +1,16 @@
+# Nearest Neighbors
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+neighbors_all_neighbors_c.md
+neighbors_bruteforce_c.md
+neighbors_cagra_c.md
+neighbors_hnsw_c.md
+neighbors_ivf_flat_c.md
+neighbors_ivf_pq_c.md
+neighbors_mg.md
+neighbors_vamana_c.md
+```
+
diff --git a/docs/source/c_api/neighbors.rst b/docs/source/c_api/neighbors.rst
deleted file mode 100644
index 305364bb2a..0000000000
--- a/docs/source/c_api/neighbors.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-Nearest Neighbors
-=================
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   neighbors_all_neighbors_c.rst
-   neighbors_bruteforce_c.rst
-   neighbors_cagra_c.rst
-   neighbors_hnsw_c.rst
-   neighbors_ivf_flat_c.rst
-   neighbors_ivf_pq_c.rst
-   neighbors_mg.rst
-   neighbors_vamana_c.rst
diff --git a/docs/source/c_api/neighbors_all_neighbors_c.md b/docs/source/c_api/neighbors_all_neighbors_c.md
new file mode 100644
index 0000000000..ffee961db7
--- /dev/null
+++ b/docs/source/c_api/neighbors_all_neighbors_c.md
@@ -0,0 +1,22 @@
+# All-Neighbors
+
+The all-neighbors method constructs a k-NN graph for all vectors in a dataset. It supports multiple algorithms including brute force, IVF-PQ (approximate), and NN-Descent (approximate) for building local k-NN subgraphs. The API automatically detects whether the dataset is host-resident or device-resident and applies appropriate optimizations.
+
+`#include <cuvs/neighbors/all_neighbors.h>`
+
+## Build parameters
+
+```{doxygengroup} all_neighbors_c_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Build
+
+```{doxygengroup} all_neighbors_c_build
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_all_neighbors_c.rst b/docs/source/c_api/neighbors_all_neighbors_c.rst
deleted file mode 100644
index 7c6559979e..0000000000
--- a/docs/source/c_api/neighbors_all_neighbors_c.rst
+++ /dev/null
@@ -1,26 +0,0 @@
-All-Neighbors
-=============
-
-The all-neighbors method constructs a k-NN graph for all vectors in a dataset. It supports multiple algorithms including brute force, IVF-PQ (approximate), and NN-Descent (approximate) for building local k-NN subgraphs. The API automatically detects whether the dataset is host-resident or device-resident and applies appropriate optimizations.
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/all_neighbors.h>``
-
-Build parameters
-----------------
-
-.. doxygengroup:: all_neighbors_c_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Build
------
-
-.. doxygengroup:: all_neighbors_c_build
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_bruteforce_c.md b/docs/source/c_api/neighbors_bruteforce_c.md
new file mode 100644
index 0000000000..49610d9124
--- /dev/null
+++ b/docs/source/c_api/neighbors_bruteforce_c.md
@@ -0,0 +1,38 @@
+# Bruteforce
+
+The bruteforce method is running the KNN algorithm. It performs an extensive search, and in contrast to ANN methods produces an exact result.
+
+`#include <cuvs/neighbors/bruteforce.h>`
+
+## Index
+
+```{doxygengroup} bruteforce_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} bruteforce_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} bruteforce_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} bruteforce_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_bruteforce_c.rst b/docs/source/c_api/neighbors_bruteforce_c.rst
deleted file mode 100644
index 36ba96f424..0000000000
--- a/docs/source/c_api/neighbors_bruteforce_c.rst
+++ /dev/null
@@ -1,42 +0,0 @@
-Bruteforce
-==========
-
-The bruteforce method is running the KNN algorithm. It performs an extensive search, and in contrast to ANN methods produces an exact result.
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/bruteforce.h>``
-
-Index
------
-
-.. doxygengroup:: bruteforce_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: bruteforce_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: bruteforce_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: bruteforce_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_cagra_c.md b/docs/source/c_api/neighbors_cagra_c.md
new file mode 100644
index 0000000000..7cffb146b1
--- /dev/null
+++ b/docs/source/c_api/neighbors_cagra_c.md
@@ -0,0 +1,63 @@
+# CAGRA
+
+CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
+
+
+`#include <cuvs/neighbors/cagra.h>`
+
+## Index build parameters
+
+```{doxygengroup} cagra_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} cagra_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} cagra_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} cagra_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} cagra_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index merge
+
+```{doxygengroup} cagra_c_index_merge
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} cagra_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_cagra_c.rst b/docs/source/c_api/neighbors_cagra_c.rst
deleted file mode 100644
index 9d9f1b7ea9..0000000000
--- a/docs/source/c_api/neighbors_cagra_c.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-CAGRA
-=====
-
-CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
-
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/cagra.h>``
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: cagra_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: cagra_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: cagra_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: cagra_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: cagra_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index merge
------------
-
-.. doxygengroup:: cagra_c_index_merge
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: cagra_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_hnsw_c.md b/docs/source/c_api/neighbors_hnsw_c.md
new file mode 100644
index 0000000000..7d1ca61428
--- /dev/null
+++ b/docs/source/c_api/neighbors_hnsw_c.md
@@ -0,0 +1,61 @@
+# HNSW
+
+This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
+
+
+`#include <cuvs/neighbors/hnsw.h>`
+
+## Index search parameters
+
+```{doxygengroup} hnsw_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} hnsw_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend parameters
+
+```{doxygengroup} hnsw_c_extend_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+```{doxygengroup} hnsw_c_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index load
+```{doxygengroup} hnsw_c_index_load
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} hnsw_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} hnsw_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_hnsw_c.rst b/docs/source/c_api/neighbors_hnsw_c.rst
deleted file mode 100644
index 3f10eea33b..0000000000
--- a/docs/source/c_api/neighbors_hnsw_c.rst
+++ /dev/null
@@ -1,65 +0,0 @@
-HNSW
-====
-
-This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
-
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/hnsw.h>``
-
-Index search parameters
------------------------
-
-.. doxygengroup:: hnsw_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: hnsw_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend parameters
------------------------
-
-.. doxygengroup:: hnsw_c_extend_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-.. doxygengroup:: hnsw_c_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index load
-----------
-.. doxygengroup:: hnsw_c_index_load
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: hnsw_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: hnsw_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_ivf_flat_c.md b/docs/source/c_api/neighbors_ivf_flat_c.md
new file mode 100644
index 0000000000..7928619ac6
--- /dev/null
+++ b/docs/source/c_api/neighbors_ivf_flat_c.md
@@ -0,0 +1,54 @@
+# IVF-Flat
+
+The IVF-Flat method is an ANN algorithm. It uses an inverted file index (IVF) with unmodified (that is, flat) vectors. This algorithm provides simple knobs to reduce the overall search space and to trade-off accuracy for speed.
+
+`#include <cuvs/neighbors/ivf_flat.h>`
+
+## Index build parameters
+
+```{doxygengroup} ivf_flat_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} ivf_flat_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} ivf_flat_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} ivf_flat_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} ivf_flat_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} ivf_flat_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_ivf_flat_c.rst b/docs/source/c_api/neighbors_ivf_flat_c.rst
deleted file mode 100644
index a37b153bed..0000000000
--- a/docs/source/c_api/neighbors_ivf_flat_c.rst
+++ /dev/null
@@ -1,58 +0,0 @@
-IVF-Flat
-========
-
-The IVF-Flat method is an ANN algorithm. It uses an inverted file index (IVF) with unmodified (that is, flat) vectors. This algorithm provides simple knobs to reduce the overall search space and to trade-off accuracy for speed.
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/ivf_flat.h>``
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: ivf_flat_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: ivf_flat_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: ivf_flat_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: ivf_flat_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: ivf_flat_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: ivf_flat_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_ivf_pq_c.md b/docs/source/c_api/neighbors_ivf_pq_c.md
new file mode 100644
index 0000000000..1bd9be90d0
--- /dev/null
+++ b/docs/source/c_api/neighbors_ivf_pq_c.md
@@ -0,0 +1,54 @@
+# IVF-PQ
+
+The IVF-PQ method is an ANN algorithm. Like IVF-Flat, IVF-PQ splits the points into a number of clusters (also specified by a parameter called n_lists) and searches the closest clusters to compute the nearest neighbors (also specified by a parameter called n_probes), but it shrinks the sizes of the vectors using a technique called product quantization.
+
+`#include <cuvs/neighbors/ivf_pq.h>`
+
+## Index build parameters
+
+```{doxygengroup} ivf_pq_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} ivf_pq_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} ivf_pq_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} ivf_pq_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} ivf_pq_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} ivf_pq_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_ivf_pq_c.rst b/docs/source/c_api/neighbors_ivf_pq_c.rst
deleted file mode 100644
index ae985870b4..0000000000
--- a/docs/source/c_api/neighbors_ivf_pq_c.rst
+++ /dev/null
@@ -1,58 +0,0 @@
-IVF-PQ
-======
-
-The IVF-PQ method is an ANN algorithm. Like IVF-Flat, IVF-PQ splits the points into a number of clusters (also specified by a parameter called n_lists) and searches the closest clusters to compute the nearest neighbors (also specified by a parameter called n_probes), but it shrinks the sizes of the vectors using a technique called product quantization.
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/ivf_pq.h>``
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: ivf_pq_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: ivf_pq_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: ivf_pq_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: ivf_pq_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: ivf_pq_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: ivf_pq_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_mg.md b/docs/source/c_api/neighbors_mg.md
new file mode 100644
index 0000000000..07a2c304f1
--- /dev/null
+++ b/docs/source/c_api/neighbors_mg.md
@@ -0,0 +1,250 @@
+# Multi-GPU Nearest Neighbors
+
+The Multi-GPU (SNMG - single-node multi-GPUs) C API provides a set of functions to deploy ANN indexes across multiple GPUs for improved performance and scalability.
+
+# Common Types and Enums
+
+Common types and enums used across multi-GPU ANN algorithms.
+
+`#include <cuvs/neighbors/mg_common.h>`
+
+```{doxygengroup} mg_c_common_types
+:project: cuvs
+:members:
+:content-only:
+```
+
+# Multi-GPU IVF-Flat
+
+The Multi-GPU IVF-Flat method extends the IVF-Flat ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
+
+`#include <cuvs/neighbors/mg_ivf_flat.h>`
+
+## IVF-Flat Index Build Parameters
+
+```{doxygengroup} mg_ivf_flat_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Search Parameters
+
+```{doxygengroup} mg_ivf_flat_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index
+
+```{doxygengroup} mg_ivf_flat_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Build
+
+```{doxygengroup} mg_ivf_flat_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Search
+
+```{doxygengroup} mg_ivf_flat_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Extend
+
+```{doxygengroup} mg_ivf_flat_c_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Serialize
+
+```{doxygengroup} mg_ivf_flat_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Deserialize
+
+```{doxygengroup} mg_ivf_flat_c_index_deserialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-Flat Index Distribute
+
+```{doxygengroup} mg_ivf_flat_c_index_distribute
+:project: cuvs
+:members:
+:content-only:
+```
+
+# Multi-GPU IVF-PQ
+
+The Multi-GPU IVF-PQ method extends the IVF-PQ ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
+
+`#include <cuvs/neighbors/mg_ivf_pq.h>`
+
+## IVF-PQ Index Build Parameters
+
+```{doxygengroup} mg_ivf_pq_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Search Parameters
+
+```{doxygengroup} mg_ivf_pq_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index
+
+```{doxygengroup} mg_ivf_pq_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Build
+
+```{doxygengroup} mg_ivf_pq_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Search
+
+```{doxygengroup} mg_ivf_pq_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Extend
+
+```{doxygengroup} mg_ivf_pq_c_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Serialize
+
+```{doxygengroup} mg_ivf_pq_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Deserialize
+
+```{doxygengroup} mg_ivf_pq_c_index_deserialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## IVF-PQ Index Distribute
+
+```{doxygengroup} mg_ivf_pq_c_index_distribute
+:project: cuvs
+:members:
+:content-only:
+```
+
+# Multi-GPU CAGRA
+
+The Multi-GPU CAGRA method extends the CAGRA graph-based ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
+
+`#include <cuvs/neighbors/mg_cagra.h>`
+
+## CAGRA Index Build Parameters
+
+```{doxygengroup} mg_cagra_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Search Parameters
+
+```{doxygengroup} mg_cagra_c_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index
+
+```{doxygengroup} mg_cagra_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Build
+
+```{doxygengroup} mg_cagra_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Search
+
+```{doxygengroup} mg_cagra_c_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Extend
+
+```{doxygengroup} mg_cagra_c_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Serialize
+
+```{doxygengroup} mg_cagra_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Deserialize
+
+```{doxygengroup} mg_cagra_c_index_deserialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## CAGRA Index Distribute
+
+```{doxygengroup} mg_cagra_c_index_distribute
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_mg.rst b/docs/source/c_api/neighbors_mg.rst
deleted file mode 100644
index bffe3fc4c5..0000000000
--- a/docs/source/c_api/neighbors_mg.rst
+++ /dev/null
@@ -1,257 +0,0 @@
-Multi-GPU Nearest Neighbors
-===========================
-
-The Multi-GPU (SNMG - single-node multi-GPUs) C API provides a set of functions to deploy ANN indexes across multiple GPUs for improved performance and scalability.
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-Common Types and Enums
-======================
-
-Common types and enums used across multi-GPU ANN algorithms.
-
-``#include <cuvs/neighbors/mg_common.h>``
-
-.. doxygengroup:: mg_c_common_types
-    :project: cuvs
-    :members:
-    :content-only:
-
-Multi-GPU IVF-Flat
-==================
-
-The Multi-GPU IVF-Flat method extends the IVF-Flat ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
-
-``#include <cuvs/neighbors/mg_ivf_flat.h>``
-
-IVF-Flat Index Build Parameters
--------------------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Search Parameters
---------------------------------
-
-.. doxygengroup:: mg_ivf_flat_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index
---------------
-
-.. doxygengroup:: mg_ivf_flat_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Build
---------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Search
----------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Extend
----------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Serialize
-------------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Deserialize
----------------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_deserialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-Flat Index Distribute
---------------------------
-
-.. doxygengroup:: mg_ivf_flat_c_index_distribute
-    :project: cuvs
-    :members:
-    :content-only:
-
-Multi-GPU IVF-PQ
-=================
-
-The Multi-GPU IVF-PQ method extends the IVF-PQ ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
-
-``#include <cuvs/neighbors/mg_ivf_pq.h>``
-
-IVF-PQ Index Build Parameters
------------------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Search Parameters
-------------------------------
-
-.. doxygengroup:: mg_ivf_pq_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index
-------------
-
-.. doxygengroup:: mg_ivf_pq_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Build
-------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Search
--------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Extend
--------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Serialize
-----------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Deserialize
-------------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_deserialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-IVF-PQ Index Distribute
------------------------
-
-.. doxygengroup:: mg_ivf_pq_c_index_distribute
-    :project: cuvs
-    :members:
-    :content-only:
-
-Multi-GPU CAGRA
-================
-
-The Multi-GPU CAGRA method extends the CAGRA graph-based ANN algorithm to work across multiple GPUs. It provides two distribution modes: replicated (for higher throughput) and sharded (for handling larger datasets).
-
-``#include <cuvs/neighbors/mg_cagra.h>``
-
-CAGRA Index Build Parameters
-----------------------------
-
-.. doxygengroup:: mg_cagra_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Search Parameters
------------------------------
-
-.. doxygengroup:: mg_cagra_c_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index
------------
-
-.. doxygengroup:: mg_cagra_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Build
------------------
-
-.. doxygengroup:: mg_cagra_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Search
-------------------
-
-.. doxygengroup:: mg_cagra_c_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Extend
-------------------
-
-.. doxygengroup:: mg_cagra_c_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Serialize
----------------------
-
-.. doxygengroup:: mg_cagra_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Deserialize
------------------------
-
-.. doxygengroup:: mg_cagra_c_index_deserialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-CAGRA Index Distribute
-----------------------
-
-.. doxygengroup:: mg_cagra_c_index_distribute
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/neighbors_vamana_c.md b/docs/source/c_api/neighbors_vamana_c.md
new file mode 100644
index 0000000000..9f7e727dc0
--- /dev/null
+++ b/docs/source/c_api/neighbors_vamana_c.md
@@ -0,0 +1,39 @@
+# Vamana
+
+Vamana is the graph construction algorithm behind the well-known DiskANN vector search solution. The cuVS implementation of Vamana/DiskANN is a custom GPU-acceleration version of the algorithm that aims to reduce index construction time using NVIDIA GPUs.
+
+
+`#include <cuvs/neighbors/vamana.h>`
+
+## Index build parameters
+
+```{doxygengroup} vamana_c_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} vamana_c_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} vamana_c_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} vamana_c_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/neighbors_vamana_c.rst b/docs/source/c_api/neighbors_vamana_c.rst
deleted file mode 100644
index 90e47f1f6e..0000000000
--- a/docs/source/c_api/neighbors_vamana_c.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-Vamana
-======
-
-Vamana is the graph construction algorithm behind the well-known DiskANN vector search solution. The cuVS implementation of Vamana/DiskANN is a custom GPU-acceleration version of the algorithm that aims to reduce index construction time using NVIDIA GPUs.
-
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-``#include <cuvs/neighbors/vamana.h>``
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: vamana_c_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: vamana_c_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: vamana_c_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: vamana_c_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/c_api/preprocessing.md b/docs/source/c_api/preprocessing.md
new file mode 100644
index 0000000000..eaf78f10ce
--- /dev/null
+++ b/docs/source/c_api/preprocessing.md
@@ -0,0 +1,34 @@
+# Preprocessing
+
+## Binary Quantizer
+
+```{doxygengroup} preprocessing_c_binary
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Product Quantizer
+
+```{doxygengroup} preprocessing_c_pq
+:project: cuvs
+:members:
+:content-only:
+```
+
+## PCA (Principal Component Analysis)
+
+```{doxygengroup} preprocessing_c_pca
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Scalar Quantizer
+
+```{doxygengroup} preprocessing_c_scalar
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/c_api/preprocessing.rst b/docs/source/c_api/preprocessing.rst
deleted file mode 100644
index 1c65455de0..0000000000
--- a/docs/source/c_api/preprocessing.rst
+++ /dev/null
@@ -1,38 +0,0 @@
-Preprocessing
-=============
-
-.. role:: py(code)
-   :language: c
-   :class: highlight
-
-Binary Quantizer
-----------------
-
-.. doxygengroup:: preprocessing_c_binary
-    :project: cuvs
-    :members:
-    :content-only:
-
-Product Quantizer
------------------
-
-.. doxygengroup:: preprocessing_c_pq
-    :project: cuvs
-    :members:
-    :content-only:
-
-PCA (Principal Component Analysis)
------------------------------------
-
-.. doxygengroup:: preprocessing_c_pca
-    :project: cuvs
-    :members:
-    :content-only:
-
-Scalar Quantizer
-----------------
-
-.. doxygengroup:: preprocessing_c_scalar
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/choosing_and_configuring_indexes.rst b/docs/source/choosing_and_configuring_indexes.md
similarity index 73%
rename from docs/source/choosing_and_configuring_indexes.rst
rename to docs/source/choosing_and_configuring_indexes.md
index b4c140f295..efb34a8b0d 100644
--- a/docs/source/choosing_and_configuring_indexes.rst
+++ b/docs/source/choosing_and_configuring_indexes.md
@@ -1,98 +1,89 @@
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Primer on vector search indexes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Primer on vector search indexes
 
 Vector search indexes often use approximations to trade-off accuracy of the results for speed, either through lowering latency (end-to-end single query speed) or by increasing throughput (the number of query vectors that can be satisfied in a short period of time). Vector search indexes, especially ones that use approximations, are very closely related to machine learning models but they are optimized for fast search and accuracy of results.
 
 When the number of vectors is very small, such as less than 100 thousand vectors, it could be fast enough to use a brute-force (also known as a flat index), which returns exact results but at the expense of exhaustively searching all possible neighbors
 
-Objectives
-==========
+## Objectives
 
 This primer addresses the challenge of configuring vector search indexes, but its primary goal is to get a user up and running quickly with acceptable enough results for a good choice of index type and a small and manageable tuning knob, rather than providing a comprehensive guide to tuning each and every hyper-parameter.
 
 For this reason, we focus on 4 primary data sizes:
 
-#. Tiny datasets where GPU is likely not needed (< 100 thousand vectors)
-#. Small datasets where GPU might not be needed (< 1 million vectors)
-#. Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
-#. Large datasets where high quality is preferred at the expense of fast index creation
+1. Tiny datasets where GPU is likely not needed (< 100 thousand vectors)
+1. Small datasets where GPU might not be needed (< 1 million vectors)
+1. Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
+1. Large datasets where high quality is preferred at the expense of fast index creation
 
 Like other machine learning algorithms, vector search indexes generally have a training step – which means building the index – and an inference – or search step. The hyper-parameters also tend to be broken down into build and search parameters.
 
 While not always the case, a general trend is often observed where the search speed decreases as the quality increases. This also tends to be the case with the index build performance, though different algorithms have different relationships between build time, quality, and search time. It’s important to understand that there’s no free lunch so there will always be trade-offs for each index type.
 
-Definition of quality
-=====================
+## Definition of quality
 
 What do we mean when we say quality of an index? In machine learning terminology, we measure this using recall, which is sometimes used interchangeably to mean accuracy, even though the two are slightly different measures. Recall, when used in vector search, essentially means “out of all of my results, which results would have been included in the exact results?” In vector search, the objective is to find some number of vectors that are closest to a given query vector so recall tends to be more relaxed than accuracy, discriminating only on set inclusion, rather than on exact ordered list matching, which would be closer to an accuracy measure.
 
-Choosing vector search indexes
-==============================
+## Choosing vector search indexes
 
 Many vector search algorithms improve scalability while reducing the number of distances by partitioning the vector space into smaller pieces, often through the use of clustering, hashing, trees, and other techniques. Another popular technique is to reduce the width or dimensionality of the space in order to decrease the cost of computing each distance.
 
-Tiny datasets (< 100 thousand vectors)
---------------------------------------
+### Tiny datasets (< 100 thousand vectors)
 
 These datasets are very small and it’s questionable whether or not the GPU would provide any value at all. If the dimensionality is also relatively small (< 1024), you could just use brute-force or HNSW on the CPU and get great performance. If the dimensionality is relatively large (1536, 2048, 4096), you should consider using HNSW. If build time performance is critical, you should consider using CAGRA to build the graph and convert it to an HNSW graph for search (this capability exists today in the standalone cuVS/RAFT libraries and will soon be added to Milvus). An IVF flat index can also be a great candidate here, as it can improve the search performance over brute-force by partitioning the vector space and thus reducing the search space.
 
-Small datasets where GPU might not be needed (< 1 million vectors)
-------------------------------------------------------------------
+### Small datasets where GPU might not be needed (< 1 million vectors)
 
 For smaller dimensionality, such as 1024 or below, you could consider using a brute-force (aka flat) index on GPU and get very good search performance with exact results. You could also use a graph-based index like HNSW on the CPU or CAGRA on the GPU. If build time is critical, you could even build a CAGRA graph on the GPU and convert it to HNSW graph on the CPU.
 
 For larger dimensionality (1536, 2048, 4096), you will start to see lower build-time performance with HNSW for higher quality search settings, and so it becomes more clear that building a CAGRA graph can be useful instead.
 
-Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
---------------------------------------------------------------------------------------------------
+### Large datasets (> 1 million vectors), goal is fast index creation at the expense of search quality
 
 For fast ingest where slightly lower search quality is acceptable (85% recall and above), the IVF (inverted file index) methods can be very useful, as they can be very fast to build and still have acceptable search performance. IVF-flat index will partition the vectors into some number of clusters (specified by the user as n_lists) and at search time, some number of closest clusters (defined by n_probes) will be searched with brute-force for each query vector.
 
 IVF-PQ is similar to IVF-flat with the major difference that the vectors are compressed using a lossy product quantized compression so the index can have a much smaller footprint on the GPU. In general, it’s advised to set n_lists = sqrt(n_vectors) and set n_probes to some percentage of n_lists (e.g. 1%, 2%, 4%, 8%, 16%). Because IVF-PQ is a lossy compression, a refinement step can be performed by initially increasing the number of neighbors (by some multiple factor) and using the raw vectors to compute the exact distances, ultimately reducing the neighborhoods down to size k. Even a refinement of 2x (which would query initially for k*2) can be quite effective in making up for recall lost by the PQ compression, but it does come at the expense of having to keep the raw vectors around (keeping in mind many databases store the raw vectors anyways).
 
-Large datasets (> 1 million vectors), goal is high quality search at the expense of fast index creation
--------------------------------------------------------------------------------------------------------
+### Large datasets (> 1 million vectors), goal is high quality search at the expense of fast index creation
 
 By trading off index creation performance, an extremely high quality search model can be built. Generally, all of the vector search index types have hyperparameters that have a direct correlation with the search accuracy and so they can be cranked up to yield better recall. Unfortunately, this can also significantly increase the index build time and reduce the search throughput. The trick here is to find the fastest build time that can achieve the best recall with the lowest latency or highest throughput possible.
 
 As for suggested index types, graph-based algorithms like HNSW and CAGRA tend to scale very well to larger datasets while having superior search performance with respect to quality. The challenge is that graph-based indexes require learning a graph and so, as the subtitle of this section suggests, have a tendency to be slower to build than other options. Using the CAGRA algorithm on the GPU can reduce the build time significantly over HNSW, while also having a superior throughput (and lower latency) than searching on the CPU. Currently, the downside to using CAGRA on the GPU is that it requires both the graph and the raw vectors to fit into GPU memory. A middle-ground can be reached by building a CAGRA graph on the GPU and converting it to an HNSW for high quality (and moderately fast) search on the CPU.
 
 
-Tuning and hyperparameter optimization
-======================================
+## Tuning and hyperparameter optimization
 
 Unfortunately, for large datasets, doing a hyper-parameter optimization on the whole dataset is not always feasible. It is possible, however, to perform a hyper-parameter optimization on the smaller subsets and find reasonably acceptable parameters that should generalize fairly well to the entire dataset. Generally this hyper-parameter optimization will require computing a ground truth on the subset with an exact method like brute-force and then using it to evaluate several searches on randomly sampled vectors.
 
 Full hyper-parameter optimization may also not always be necessary- for example, once you have built a ground truth dataset on a subset, many times you can start by building an index with the default build parameters and then playing around with different search parameters until you get the desired quality and search performance.  For massive indexes that might be multiple terabytes, you could also take this subsampling of, say, 10M vectors, train an index and then tune the search parameters from there. While there might be a small margin of error, the chosen build/search parameters should generalize fairly well for the databases that build locally partitioned indexes.
 
 
-Summary of vector search index types
-====================================
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Trade-offs
-     - Best to use with...
-   * - Brute-force (aka flat)
-     - Exact search but requires exhaustive distance computations
-     - Tiny datasets (< 100k vectors)
-   * - IVF-Flat
-     - Partitions the vector space to reduce distance computations for brute-force search at the expense of recall
-     - Small datasets (<1M vectors) or larger datasets (>1M vectors) where fast index build time is prioritized over quality.
-   * - IVF-PQ
-     - Adds product quantization to IVF-Flat to achieve scale at the expense of recall
-     - Large datasets (>>1M vectors) where fast index build is prioritized over quality
-   * - HNSW
-     - Significantly reduces distance computations at the expense of longer build times
-     - Small datasets (<1M vectors) or large datasets (>1M vectors) where quality and speed of search are prioritized over index build times
-   * - CAGRA
-     - Significantly reduces distance computations at the expense of longer build times (though build times improve over HNSW)
-     - Large datasets (>>1M vectors) where quality and speed of search are prioritized over index build times but index build times are still important.
-   * - CAGRA build +HNSW search
-     - (coming soon to Milvus)
-     - Significantly reduces distance computations and improves build times at the expense of higher search latency / lower throughput.
-       Large datasets (>>1M vectors) where index build times and quality of search is important but GPU resources are limited and latency of search is not.
+## Summary of vector search index types
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Trade-offs
+  - Best to use with...
+* - Brute-force (aka flat)
+  - Exact search but requires exhaustive distance computations
+  - Tiny datasets (< 100k vectors)
+* - IVF-Flat
+  - Partitions the vector space to reduce distance computations for brute-force search at the expense of recall
+  - Small datasets (<1M vectors) or larger datasets (>1M vectors) where fast index build time is prioritized over quality.
+* - IVF-PQ
+  - Adds product quantization to IVF-Flat to achieve scale at the expense of recall
+  - Large datasets (>>1M vectors) where fast index build is prioritized over quality
+* - HNSW
+  - Significantly reduces distance computations at the expense of longer build times
+  - Small datasets (<1M vectors) or large datasets (>1M vectors) where quality and speed of search are prioritized over index build times
+* - CAGRA
+  - Significantly reduces distance computations at the expense of longer build times (though build times improve over HNSW)
+  - Large datasets (>>1M vectors) where quality and speed of search are prioritized over index build times but index build times are still important.
+* - CAGRA build +HNSW search
+  - (coming soon to Milvus)
+  - Significantly reduces distance computations and improves build times at the expense of higher search latency / lower throughput.
+    Large datasets (>>1M vectors) where index build times and quality of search is important but GPU resources are limited and latency of search is not.
+```
+
diff --git a/docs/source/comparing_indexes.rst b/docs/source/comparing_indexes.md
similarity index 86%
rename from docs/source/comparing_indexes.rst
rename to docs/source/comparing_indexes.md
index 167aa2e072..3492fdc296 100644
--- a/docs/source/comparing_indexes.rst
+++ b/docs/source/comparing_indexes.md
@@ -1,28 +1,24 @@
-.. _comparing_indexes:
+(comparing_indexes)=
 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Comparing performance of vector indexes
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Comparing performance of vector indexes
 
-This document provides a brief overview methodology for comparing vector search indexes and models. For guidance on how to choose and configure an index type, please refer to :doc:`this <vector_databases_vs_vector_search>` guide.
+This document provides a brief overview methodology for comparing vector search indexes and models. For guidance on how to choose and configure an index type, please refer to {doc}`this <vector_databases_vs_vector_search>` guide.
 
 Unlike traditional database indexes, which will generally return correct results even without performance tuning, vector search indexes are more closely related to ML models and they can return absolutely garbage results if they have not been tuned.
 
 For this reason, it’s important to consider the parameters that an index is built upon, both for its potential quality and throughput/latency, when comparing two trained indexes. While easier to build an index on its default parameters than having to tune them, a well tuned index can have a significantly better search quality AND perform within search perf constraints like maximal throughput and minimal latency.
 
 
-What is recall?
-===============
+## What is recall?
 
 Recall is a measure of model quality. Imagine for a particular vector, we know the exact nearest neighbors because we computed them already. The recall for a query result can be computed by taking the set intersection between the exact nearest neighbors and the actual nearest neighbors. The number of neighbors in that intersection list gets divided by k, the number of neighbors being requested. To really give a fair estimate of the recall of a model, we use several query vectors, all with ground truth computed, and we take the total neighbors across all intersected neighbor lists and divide by n_queries * k.
 
 Parameter settings dictate the quality of an index. The graph below shows eight indexes from the same data but with different tuning parameters. Generally speaking, the indexes with higher average recall took longer to build. Which index is fair to report?
 
-.. image:: images/index_recalls.png
+```{image} images/index_recalls.png
+```
 
-
-How do I compare models or indexing algorithms?
-===============================================
+## How do I compare models or indexing algorithms?
 
 In order to fairly compare the performance (e.g. latency and throughput) of an indexing algorithm or model against another, we always need to do so with respect to its potential recall. This is important and draws from the ML roots of vector search, but is often confusing to newcomers who might be more familiar with the database world.
 
@@ -32,29 +28,28 @@ Because recall levels can vary quite a bit across parameter settings, we tend to
 
 We suggest averaging performance within a range of recall. For general guidance, we tend to use the following buckets:
 
-#. 85% - 89%
-#. 90% - 94%
-#. 95% - 99%
-#. >99%
-
-.. image:: images/recall_buckets.png
+1. 85% - 89%
+1. 90% - 94%
+1. 95% - 99%
+1. >99%
 
+```{image} images/recall_buckets.png
+```
 
 This allows us to make observations such as “at 95% recall level, model A can be built 3x faster than model B, but model B has 2x lower latency than model A”
 
-.. image:: images/build_benchmarks.png
-
+```{image} images/build_benchmarks.png
+```
 
 Another important detail is that we compare these models against their best-case search performance within each recall window. This means that we aim to find models that not only have great recall quality but also have either the highest throughput or lowest latency within the window of interest. These best-cases are most often computed by doing a parameter sweep in a grid search (or other types of search optimizers) and looking at the best cases for each level of recall.
 
 The resulting data points will construct a curve known as a Pareto optimum. Please note that this process is specifically for showing best-case across recall and throughput/latency, but when we care about finding the parameters that yield the best recall and search performance, we are essentially performing a  hyperparameter optimization, which is common in machine learning.
 
 
-How do I do this on large vector databases?
-===========================================
+## How do I do this on large vector databases?
 
 It turns out that most vector databases, like Milvus for example, make many smaller vector search indexing models for a single “index”, and the distribution of the vectors across the smaller index models are assumed to be completely uniform. This means we can use subsampling to our benefit, and tune on smaller sub-samples of the overall dataset.
 
 Please note, however, that there are often caps on the size of each of these smaller indexes, and that needs to be taken into consideration when choosing the size of the sub sample to tune.
 
-Please see :doc:`this guide <tuning_guide>` for more information on the steps one would take to do this subsampling and tuning process.
+Please see {doc}`this guide <tuning_guide>` for more information on the steps one would take to do this subsampling and tuning process.
diff --git a/docs/source/conf.py b/docs/source/conf.py
index ffec63ded9..0bb0c62d7a 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -35,8 +35,7 @@
     "IPython.sphinxext.ipython_console_highlighting",
     "IPython.sphinxext.ipython_directive",
     "breathe",
-    "recommonmark",
-    "sphinx_markdown_tables",
+    "myst_parser",
     "sphinx_copybutton",
 ]
 
@@ -55,8 +54,10 @@
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
-# source_suffix = ['.rst', '.md']
-source_suffix = {".rst": "restructuredtext", ".md": "markdown"}
+# source_suffix = [".md"]
+source_suffix = {".md": "markdown"}
+myst_enable_extensions = ["dollarmath"]
+myst_heading_anchors = 6
 
 # The master toctree document.
 master_doc = "index"
diff --git a/docs/source/cpp_api.md b/docs/source/cpp_api.md
new file mode 100644
index 0000000000..9bb46f0779
--- /dev/null
+++ b/docs/source/cpp_api.md
@@ -0,0 +1,15 @@
+# C++ API Documentation
+
+(api)=
+
+```{toctree}
+:maxdepth: 4
+
+cpp_api/cluster.md
+cpp_api/distance.md
+cpp_api/neighbors.md
+cpp_api/preprocessing.md
+cpp_api/selection.md
+cpp_api/stats.md
+```
+
diff --git a/docs/source/cpp_api.rst b/docs/source/cpp_api.rst
deleted file mode 100644
index 34f48a88f6..0000000000
--- a/docs/source/cpp_api.rst
+++ /dev/null
@@ -1,15 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~
-C++ API Documentation
-~~~~~~~~~~~~~~~~~~~~~
-
-.. _api:
-
-.. toctree::
-   :maxdepth: 4
-
-   cpp_api/cluster.rst
-   cpp_api/distance.rst
-   cpp_api/neighbors.rst
-   cpp_api/preprocessing.rst
-   cpp_api/selection.rst
-   cpp_api/stats.rst
diff --git a/docs/source/cpp_api/cluster.md b/docs/source/cpp_api/cluster.md
new file mode 100644
index 0000000000..a4d23e4a81
--- /dev/null
+++ b/docs/source/cpp_api/cluster.md
@@ -0,0 +1,11 @@
+# Cluster
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+cluster_agglomerative.md
+cluster_kmeans.md
+cluster_spectral.md
+```
+
diff --git a/docs/source/cpp_api/cluster.rst b/docs/source/cpp_api/cluster.rst
deleted file mode 100644
index 8165a7d115..0000000000
--- a/docs/source/cpp_api/cluster.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-Cluster
-=======
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   cluster_agglomerative.rst
-   cluster_kmeans.rst
-   cluster_spectral.rst
diff --git a/docs/source/cpp_api/cluster_agglomerative.md b/docs/source/cpp_api/cluster_agglomerative.md
new file mode 100644
index 0000000000..3946947d99
--- /dev/null
+++ b/docs/source/cpp_api/cluster_agglomerative.md
@@ -0,0 +1,26 @@
+# Agglomerative
+
+## Parameters
+
+`#include <cuvs/cluster/agglomerative.hpp>`
+
+namespace *cuvs::cluster::agglomerative*
+
+```{doxygengroup} agglomerative_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Agglomerative
+
+`#include <cuvs/cluster/agglomerative.hpp>`
+
+namespace *cuvs::cluster::agglomerative*
+
+```{doxygengroup} single_linkage
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/cluster_agglomerative.rst b/docs/source/cpp_api/cluster_agglomerative.rst
deleted file mode 100644
index 57a46504c4..0000000000
--- a/docs/source/cpp_api/cluster_agglomerative.rst
+++ /dev/null
@@ -1,31 +0,0 @@
-Agglomerative
-=============
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Parameters
-----------
-
-``#include <cuvs/cluster/agglomerative.hpp>``
-
-namespace *cuvs::cluster::agglomerative*
-
-.. doxygengroup:: agglomerative_params
-   :project: cuvs
-   :members:
-   :content-only:
-
-
-Agglomerative
--------------
-
-``#include <cuvs/cluster/agglomerative.hpp>``
-
-namespace *cuvs::cluster::agglomerative*
-
-.. doxygengroup:: single_linkage
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/cluster_kmeans.md b/docs/source/cpp_api/cluster_kmeans.md
new file mode 100644
index 0000000000..2ae6e79426
--- /dev/null
+++ b/docs/source/cpp_api/cluster_kmeans.md
@@ -0,0 +1,38 @@
+# K-Means
+
+## Parameters
+
+`#include <cuvs/cluster/kmeans.hpp>`
+
+namespace *cuvs::cluster::kmeans*
+
+```{doxygengroup} kmeans_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## K-means
+
+`#include <cuvs/cluster/kmeans.hpp>`
+
+namespace *cuvs::cluster::kmeans*
+
+```{doxygengroup} kmeans
+:project: cuvs
+:members:
+:content-only:
+```
+
+## K-means Helpers
+
+`#include <cuvs/cluster/kmeans.hpp>`
+
+namespace *cuvs::cluster::kmeans::helpers*
+
+```{doxygengroup} kmeans_helpers
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/cluster_kmeans.rst b/docs/source/cpp_api/cluster_kmeans.rst
deleted file mode 100644
index 70ab57bcbd..0000000000
--- a/docs/source/cpp_api/cluster_kmeans.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-K-Means
-=======
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Parameters
-----------
-
-``#include <cuvs/cluster/kmeans.hpp>``
-
-namespace *cuvs::cluster::kmeans*
-
-.. doxygengroup:: kmeans_params
-   :project: cuvs
-   :members:
-   :content-only:
-
-
-K-means
--------
-
-``#include <cuvs/cluster/kmeans.hpp>``
-
-namespace *cuvs::cluster::kmeans*
-
-.. doxygengroup:: kmeans
-    :project: cuvs
-    :members:
-    :content-only:
-
-
-K-means Helpers
----------------
-
-``#include <cuvs/cluster/kmeans.hpp>``
-
-namespace *cuvs::cluster::kmeans::helpers*
-
-.. doxygengroup:: kmeans_helpers
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/cluster_spectral.md b/docs/source/cpp_api/cluster_spectral.md
new file mode 100644
index 0000000000..f38a44ab62
--- /dev/null
+++ b/docs/source/cpp_api/cluster_spectral.md
@@ -0,0 +1,24 @@
+# Spectral Clustering
+
+Spectral clustering is a graph-based clustering technique that uses the eigenvalues of similarity matrices to identify clusters with complex, non-convex shapes.
+
+`#include <cuvs/cluster/spectral.hpp>`
+
+namespace *cuvs::cluster::spectral*
+
+## Parameters
+
+```{doxygengroup} spectral_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Spectral Clustering
+
+```{doxygengroup} spectral
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/cluster_spectral.rst b/docs/source/cpp_api/cluster_spectral.rst
deleted file mode 100644
index 19dedeef19..0000000000
--- a/docs/source/cpp_api/cluster_spectral.rst
+++ /dev/null
@@ -1,28 +0,0 @@
-Spectral Clustering
-===================
-
-Spectral clustering is a graph-based clustering technique that uses the eigenvalues of similarity matrices to identify clusters with complex, non-convex shapes.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/cluster/spectral.hpp>``
-
-namespace *cuvs::cluster::spectral*
-
-Parameters
-----------
-
-.. doxygengroup:: spectral_params
-   :project: cuvs
-   :members:
-   :content-only:
-
-Spectral Clustering
--------------------
-
-.. doxygengroup:: spectral
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/distance.md b/docs/source/cpp_api/distance.md
new file mode 100644
index 0000000000..598e64469b
--- /dev/null
+++ b/docs/source/cpp_api/distance.md
@@ -0,0 +1,27 @@
+# Distance
+
+This page provides C++ class references for the publicly-exposed elements of the `cuvs/distance` package. cuVS's
+distances have been highly optimized and support a wide assortment of different distance measures.
+
+## Distance Types
+
+`#include <cuvs/distance/distance.h>`
+
+namespace *cuvs::distance*
+
+```{doxygenenum} cuvsDistanceType
+:project: cuvs
+```
+
+## Pairwise Distances
+
+`#include <cuvs/distance/distance.hpp>`
+
+namespace *cuvs::distance*
+
+```{doxygengroup} pairwise_distance
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/distance.rst b/docs/source/cpp_api/distance.rst
deleted file mode 100644
index 994fbdaff5..0000000000
--- a/docs/source/cpp_api/distance.rst
+++ /dev/null
@@ -1,32 +0,0 @@
-Distance
-========
-
-This page provides C++ class references for the publicly-exposed elements of the `cuvs/distance` package. cuVS's
-distances have been highly optimized and support a wide assortment of different distance measures.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Distance Types
---------------
-
-``#include <cuvs/distance/distance.h>``
-
-namespace *cuvs::distance*
-
-.. doxygenenum:: cuvsDistanceType
-   :project: cuvs
-
-
-Pairwise Distances
-------------------
-
-``#include <cuvs/distance/distance.hpp>``
-
-namespace *cuvs::distance*
-
-.. doxygengroup:: pairwise_distance
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors.md b/docs/source/cpp_api/neighbors.md
new file mode 100644
index 0000000000..a457ca57e6
--- /dev/null
+++ b/docs/source/cpp_api/neighbors.md
@@ -0,0 +1,21 @@
+# Nearest Neighbors
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+neighbors_all_neighbors.md
+neighbors_bruteforce.md
+neighbors_cagra.md
+neighbors_dynamic_batching.md
+neighbors_epsilon_neighborhood.md
+neighbors_filter.md
+neighbors_hnsw.md
+neighbors_ivf_flat.md
+neighbors_ivf_pq.md
+neighbors_mg.md
+neighbors_nn_descent.md
+neighbors_refine.md
+neighbors_vamana.md
+```
+
diff --git a/docs/source/cpp_api/neighbors.rst b/docs/source/cpp_api/neighbors.rst
deleted file mode 100644
index 66b4e0c4aa..0000000000
--- a/docs/source/cpp_api/neighbors.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-Nearest Neighbors
-=================
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   neighbors_all_neighbors.rst
-   neighbors_bruteforce.rst
-   neighbors_cagra.rst
-   neighbors_dynamic_batching.rst
-   neighbors_epsilon_neighborhood.rst
-   neighbors_filter.rst
-   neighbors_hnsw.rst
-   neighbors_ivf_flat.rst
-   neighbors_ivf_pq.rst
-   neighbors_mg.rst
-   neighbors_nn_descent.rst
-   neighbors_refine.rst
-   neighbors_vamana.rst
diff --git a/docs/source/cpp_api/neighbors_all_neighbors.md b/docs/source/cpp_api/neighbors_all_neighbors.md
new file mode 100644
index 0000000000..e6bbc9e183
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_all_neighbors.md
@@ -0,0 +1,24 @@
+# All-Neighbors
+
+All-Neighbors allows building an approximate all-neighbors knn graph. Given a full dataset, it finds nearest neighbors for all the training vectors in the dataset.
+
+`#include <cuvs/neighbors/all_neighbors.hpp>`
+
+namespace *cuvs::neighbors::all_neighbors*
+
+## Build Parameters
+
+```{doxygengroup} all_neighbors_cpp_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Build
+
+```{doxygengroup} all_neighbors_cpp_build
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_all_neighbors.rst b/docs/source/cpp_api/neighbors_all_neighbors.rst
deleted file mode 100644
index 3a7eaee61f..0000000000
--- a/docs/source/cpp_api/neighbors_all_neighbors.rst
+++ /dev/null
@@ -1,29 +0,0 @@
-All-Neighbors
-=============
-
-All-Neighbors allows building an approximate all-neighbors knn graph. Given a full dataset, it finds nearest neighbors for all the training vectors in the dataset.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/all_neighbors.hpp>``
-
-namespace *cuvs::neighbors::all_neighbors*
-
-Build Parameters
-----------------
-
-.. doxygengroup:: all_neighbors_cpp_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-
-Build
------
-
-.. doxygengroup:: all_neighbors_cpp_build
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_bruteforce.md b/docs/source/cpp_api/neighbors_bruteforce.md
new file mode 100644
index 0000000000..20296dc75b
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_bruteforce.md
@@ -0,0 +1,40 @@
+# Bruteforce
+
+The bruteforce method is running the KNN algorithm. It performs an extensive search, and in contrast to ANN methods produces an exact result.
+
+`#include <cuvs/neighbors/brute_force.hpp>`
+
+namespace *cuvs::neighbors::bruteforce*
+
+## Index
+
+```{doxygengroup} bruteforce_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} bruteforce_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} bruteforce_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} bruteforce_cpp_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_bruteforce.rst b/docs/source/cpp_api/neighbors_bruteforce.rst
deleted file mode 100644
index 1a3f2f7154..0000000000
--- a/docs/source/cpp_api/neighbors_bruteforce.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-Bruteforce
-==========
-
-The bruteforce method is running the KNN algorithm. It performs an extensive search, and in contrast to ANN methods produces an exact result.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/brute_force.hpp>``
-
-namespace *cuvs::neighbors::bruteforce*
-
-Index
------
-
-.. doxygengroup:: bruteforce_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: bruteforce_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: bruteforce_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: bruteforce_cpp_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_cagra.md b/docs/source/cpp_api/neighbors_cagra.md
new file mode 100644
index 0000000000..d8950a280f
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_cagra.md
@@ -0,0 +1,80 @@
+# CAGRA
+
+CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
+
+`#include <cuvs/neighbors/cagra.hpp>`
+
+namespace *cuvs::neighbors::cagra*
+
+## Index build parameters
+
+```{doxygengroup} cagra_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} cagra_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend parameters
+
+```{doxygengroup} cagra_cpp_extend_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} cagra_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} cagra_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} cagra_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+
+```{doxygengroup} cagra_cpp_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index merge
+
+```{doxygengroup} cagra_cpp_index_merge
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} cagra_cpp_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_cagra.rst b/docs/source/cpp_api/neighbors_cagra.rst
deleted file mode 100644
index aa1e6ed117..0000000000
--- a/docs/source/cpp_api/neighbors_cagra.rst
+++ /dev/null
@@ -1,84 +0,0 @@
-CAGRA
-=====
-
-CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/cagra.hpp>``
-
-namespace *cuvs::neighbors::cagra*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: cagra_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: cagra_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend parameters
------------------------
-
-.. doxygengroup:: cagra_cpp_extend_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: cagra_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: cagra_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: cagra_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-
-.. doxygengroup:: cagra_cpp_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index merge
------------
-
-.. doxygengroup:: cagra_cpp_index_merge
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: cagra_cpp_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_dynamic_batching.md b/docs/source/cpp_api/neighbors_dynamic_batching.md
new file mode 100644
index 0000000000..de9e657621
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_dynamic_batching.md
@@ -0,0 +1,40 @@
+# Dynamic Batching
+
+Dynamic Batching allows grouping small search requests into batches to increase the device occupancy and throughput while keeping the latency within limits.
+
+`#include <cuvs/neighbors/dynamic_batching.hpp>`
+
+namespace *cuvs::neighbors::dynamic_batching*
+
+## Index build parameters
+
+```{doxygengroup} dynamic_batching_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} dynamic_batching_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} dynamic_batching_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} dynamic_batching_cpp_search
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_dynamic_batching.rst b/docs/source/cpp_api/neighbors_dynamic_batching.rst
deleted file mode 100644
index adc5cb56aa..0000000000
--- a/docs/source/cpp_api/neighbors_dynamic_batching.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-Dynamic Batching
-================
-
-Dynamic Batching allows grouping small search requests into batches to increase the device occupancy and throughput while keeping the latency within limits.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/dynamic_batching.hpp>``
-
-namespace *cuvs::neighbors::dynamic_batching*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: dynamic_batching_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: dynamic_batching_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: dynamic_batching_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-
-Index search
-------------
-
-.. doxygengroup:: dynamic_batching_cpp_search
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_epsilon_neighborhood.rst b/docs/source/cpp_api/neighbors_epsilon_neighborhood.md
similarity index 55%
rename from docs/source/cpp_api/neighbors_epsilon_neighborhood.rst
rename to docs/source/cpp_api/neighbors_epsilon_neighborhood.md
index 1ca957bfed..ea62eea8f2 100644
--- a/docs/source/cpp_api/neighbors_epsilon_neighborhood.rst
+++ b/docs/source/cpp_api/neighbors_epsilon_neighborhood.md
@@ -1,20 +1,16 @@
-Epsilon Neighborhood
-====================
+# Epsilon Neighborhood
 
 Epsilon neighborhood finds all neighbors within a given radius (epsilon) for each point in a dataset. Unlike k-nearest neighbors which finds a fixed number of neighbors, epsilon neighborhood finds all points within a specified distance threshold, making it particularly useful for density-based algorithms and graph construction.
 
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/epsilon_neighborhood.hpp>``
+`#include <cuvs/neighbors/epsilon_neighborhood.hpp>`
 
 namespace *cuvs::neighbors::epsilon_neighborhood*
 
-L2-Squared Distance Operations
-------------------------------
+## L2-Squared Distance Operations
+
+```{doxygengroup} epsilon_neighborhood_cpp_l2
+:project: cuvs
+:members:
+:content-only:
+```
 
-.. doxygengroup:: epsilon_neighborhood_cpp_l2
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_filter.md b/docs/source/cpp_api/neighbors_filter.md
new file mode 100644
index 0000000000..97132ca13b
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_filter.md
@@ -0,0 +1,15 @@
+# Filtering
+
+All nearest neighbors search methods support filtering. Filtering is a method to reduce the number
+of candidates that are considered for the nearest neighbors search.
+
+`#include <cuvs/neighbors/common.hpp>`
+
+namespace *cuvs::neighbors*
+
+```{doxygengroup} neighbors_filtering
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_filter.rst b/docs/source/cpp_api/neighbors_filter.rst
deleted file mode 100644
index aba1d348fe..0000000000
--- a/docs/source/cpp_api/neighbors_filter.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-Filtering
-==========
-
-All nearest neighbors search methods support filtering. Filtering is a method to reduce the number
-of candidates that are considered for the nearest neighbors search.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/common.hpp>``
-
-namespace *cuvs::neighbors*
-
-.. doxygengroup:: neighbors_filtering
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_hnsw.md b/docs/source/cpp_api/neighbors_hnsw.md
new file mode 100644
index 0000000000..e786b75253
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_hnsw.md
@@ -0,0 +1,63 @@
+# HNSW
+
+This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
+
+`#include <cuvs/neighbors/hnsw.hpp>`
+
+namespace *cuvs::neighbors::hnsw*
+
+## Index search parameters
+
+```{doxygengroup} hnsw_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} hnsw_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend parameters
+
+```{doxygengroup} hnsw_cpp_extend_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+```{doxygengroup} hnsw_cpp_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index load
+
+```{doxygengroup} hnsw_cpp_index_load
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} hnsw_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} hnsw_cpp_index_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_hnsw.rst b/docs/source/cpp_api/neighbors_hnsw.rst
deleted file mode 100644
index 00dd3a213c..0000000000
--- a/docs/source/cpp_api/neighbors_hnsw.rst
+++ /dev/null
@@ -1,67 +0,0 @@
-HNSW
-====
-
-This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/hnsw.hpp>``
-
-namespace *cuvs::neighbors::hnsw*
-
-Index search parameters
------------------------
-
-.. doxygengroup:: hnsw_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: hnsw_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend parameters
------------------------
-
-.. doxygengroup:: hnsw_cpp_extend_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-.. doxygengroup:: hnsw_cpp_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index load
-----------
-
-.. doxygengroup:: hnsw_cpp_index_load
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: hnsw_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: hnsw_cpp_index_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_ivf_flat.md b/docs/source/cpp_api/neighbors_ivf_flat.md
new file mode 100644
index 0000000000..2ba034d3a0
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_ivf_flat.md
@@ -0,0 +1,64 @@
+# IVF-Flat
+
+The IVF-Flat method is an ANN algorithm. It uses an inverted file index (IVF) with unmodified (that is, flat) vectors. This algorithm provides simple knobs to reduce the overall search space and to trade-off accuracy for speed.
+
+`#include <cuvs/neighbors/ivf_flat.hpp>`
+
+namespace *cuvs::neighbors::ivf_flat*
+
+## Index build parameters
+
+```{doxygengroup} ivf_flat_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} ivf_flat_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} ivf_flat_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} ivf_flat_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+
+```{doxygengroup} ivf_flat_cpp_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} ivf_flat_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} ivf_flat_cpp_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_ivf_flat.rst b/docs/source/cpp_api/neighbors_ivf_flat.rst
deleted file mode 100644
index 3836223e10..0000000000
--- a/docs/source/cpp_api/neighbors_ivf_flat.rst
+++ /dev/null
@@ -1,68 +0,0 @@
-IVF-Flat
-========
-
-The IVF-Flat method is an ANN algorithm. It uses an inverted file index (IVF) with unmodified (that is, flat) vectors. This algorithm provides simple knobs to reduce the overall search space and to trade-off accuracy for speed.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/ivf_flat.hpp>``
-
-namespace *cuvs::neighbors::ivf_flat*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: ivf_flat_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: ivf_flat_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: ivf_flat_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: ivf_flat_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-
-.. doxygengroup:: ivf_flat_cpp_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: ivf_flat_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: ivf_flat_cpp_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_ivf_pq.md b/docs/source/cpp_api/neighbors_ivf_pq.md
new file mode 100644
index 0000000000..655e2bb602
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_ivf_pq.md
@@ -0,0 +1,76 @@
+# IVF-PQ
+
+The IVF-PQ method is an ANN algorithm. Like IVF-Flat, IVF-PQ splits the points into a number of clusters (also specified by a parameter called n_lists) and searches the closest clusters to compute the nearest neighbors (also specified by a parameter called n_probes), but it shrinks the sizes of the vectors using a technique called product quantization.
+
+`#include <cuvs/neighbors/ivf_pq.hpp>`
+
+namespace *cuvs::neighbors::ivf_pq*
+
+## Index build parameters
+
+```{doxygengroup} ivf_pq_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search parameters
+
+```{doxygengroup} ivf_pq_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} ivf_pq_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} ivf_pq_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+
+```{doxygengroup} ivf_pq_cpp_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} ivf_pq_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} ivf_pq_cpp_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Helper Methods
+
+Additional helper functions for manipulating the underlying data of an IVF-PQ index, unpacking records, and writing PQ codes into an existing IVF list.
+
+namespace *cuvs::neighbors::ivf_pq::helpers*
+
+```{doxygengroup} ivf_pq_cpp_helpers
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_ivf_pq.rst b/docs/source/cpp_api/neighbors_ivf_pq.rst
deleted file mode 100644
index cc515682b9..0000000000
--- a/docs/source/cpp_api/neighbors_ivf_pq.rst
+++ /dev/null
@@ -1,80 +0,0 @@
-IVF-PQ
-======
-
-The IVF-PQ method is an ANN algorithm. Like IVF-Flat, IVF-PQ splits the points into a number of clusters (also specified by a parameter called n_lists) and searches the closest clusters to compute the nearest neighbors (also specified by a parameter called n_probes), but it shrinks the sizes of the vectors using a technique called product quantization.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/ivf_pq.hpp>``
-
-namespace *cuvs::neighbors::ivf_pq*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: ivf_pq_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search parameters
------------------------
-
-.. doxygengroup:: ivf_pq_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: ivf_pq_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: ivf_pq_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-
-.. doxygengroup:: ivf_pq_cpp_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: ivf_pq_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: ivf_pq_cpp_serialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-Helper Methods
----------------
-
-Additional helper functions for manipulating the underlying data of an IVF-PQ index, unpacking records, and writing PQ codes into an existing IVF list.
-
-namespace *cuvs::neighbors::ivf_pq::helpers*
-
-.. doxygengroup:: ivf_pq_cpp_helpers
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_mg.md b/docs/source/cpp_api/neighbors_mg.md
new file mode 100644
index 0000000000..4eb0f7ccf5
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_mg.md
@@ -0,0 +1,72 @@
+# Multi-GPU Nearest Neighbors
+
+The Multi-GPU (SNMG - single-node multi-GPUs) nearest neighbors API provides a set of functions to deploy ANN indexes across multiple GPUs for improved performance and scalability.
+
+`#include <cuvs/neighbors/common.hpp>`
+
+namespace *cuvs::neighbors*
+
+## Index build parameters
+
+```{doxygengroup} mg_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Search parameters
+
+```{doxygengroup} mg_cpp_search_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} mg_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index extend
+
+```{doxygengroup} mg_cpp_index_extend
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index search
+
+```{doxygengroup} mg_cpp_index_search
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} mg_cpp_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index deserialize
+
+```{doxygengroup} mg_cpp_deserialize
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Distribute pre-built local index
+
+```{doxygengroup} mg_cpp_distribute
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_mg.rst b/docs/source/cpp_api/neighbors_mg.rst
deleted file mode 100644
index a03490a157..0000000000
--- a/docs/source/cpp_api/neighbors_mg.rst
+++ /dev/null
@@ -1,76 +0,0 @@
-Multi-GPU Nearest Neighbors
-===========================
-
-The Multi-GPU (SNMG - single-node multi-GPUs) nearest neighbors API provides a set of functions to deploy ANN indexes across multiple GPUs for improved performance and scalability.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/common.hpp>``
-
-namespace *cuvs::neighbors*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: mg_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Search parameters
------------------
-
-.. doxygengroup:: mg_cpp_search_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: mg_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index extend
-------------
-
-.. doxygengroup:: mg_cpp_index_extend
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index search
-------------
-
-.. doxygengroup:: mg_cpp_index_search
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: mg_cpp_serialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index deserialize
------------------
-
-.. doxygengroup:: mg_cpp_deserialize
-    :project: cuvs
-    :members:
-    :content-only:
-
-Distribute pre-built local index
---------------------------------
-
-.. doxygengroup:: mg_cpp_distribute
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_nn_descent.md b/docs/source/cpp_api/neighbors_nn_descent.md
new file mode 100644
index 0000000000..e3d3582a71
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_nn_descent.md
@@ -0,0 +1,32 @@
+# NN-Descent
+
+The NN-descent method is an ANN algorithm that directly approximates a k-nearest neighbors graph by randomly sampling points to compute distances and using neighbors of neighbors distances to reduce distance computations.
+
+`#include <cuvs/neighbors/nn_descent.hpp>`
+
+namespace *cuvs::neighbors::nn_descent*
+
+## Index build parameters
+
+```{doxygengroup} nn_descent_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} nn_descent_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} nn_descent_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_nn_descent.rst b/docs/source/cpp_api/neighbors_nn_descent.rst
deleted file mode 100644
index c21a1003db..0000000000
--- a/docs/source/cpp_api/neighbors_nn_descent.rst
+++ /dev/null
@@ -1,37 +0,0 @@
-NN-Descent
-==========
-
-The NN-descent method is an ANN algorithm that directly approximates a k-nearest neighbors graph by randomly sampling points to compute distances and using neighbors of neighbors distances to reduce distance computations.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/nn_descent.hpp>``
-
-namespace *cuvs::neighbors::nn_descent*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: nn_descent_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-
-Index
------
-
-.. doxygengroup:: nn_descent_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: nn_descent_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_refine.md b/docs/source/cpp_api/neighbors_refine.md
new file mode 100644
index 0000000000..14cee5c4bb
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_refine.md
@@ -0,0 +1,16 @@
+# Refinement
+
+Candidate refinement methods for nearest neighbors search
+
+`#include <cuvs/neighbors/refine.hpp>`
+
+namespace *cuvs::neighbors*
+
+## Index
+
+```{doxygengroup} ann_refine
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_refine.rst b/docs/source/cpp_api/neighbors_refine.rst
deleted file mode 100644
index 4a90ee9959..0000000000
--- a/docs/source/cpp_api/neighbors_refine.rst
+++ /dev/null
@@ -1,20 +0,0 @@
-Refinement
-==========
-
-Candidate refinement methods for nearest neighbors search
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/refine.hpp>``
-
-namespace *cuvs::neighbors*
-
-Index
------
-
-.. doxygengroup:: ann_refine
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/neighbors_vamana.md b/docs/source/cpp_api/neighbors_vamana.md
new file mode 100644
index 0000000000..9d05171ff1
--- /dev/null
+++ b/docs/source/cpp_api/neighbors_vamana.md
@@ -0,0 +1,40 @@
+# Vamana
+
+Vamana is the graph construction algorithm behind the well-known DiskANN vector search solution. The cuVS implementation of Vamana/DiskANN is a custom GPU-acceleration version of the algorithm that aims to reduce index construction time using NVIDIA GPUs.
+
+`#include <cuvs/neighbors/vamana.hpp>`
+
+namespace *cuvs::neighbors::vamana*
+
+## Index build parameters
+
+```{doxygengroup} vamana_cpp_index_params
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index
+
+```{doxygengroup} vamana_cpp_index
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index build
+
+```{doxygengroup} vamana_cpp_index_build
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Index serialize
+
+```{doxygengroup} vamana_cpp_serialize
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/neighbors_vamana.rst b/docs/source/cpp_api/neighbors_vamana.rst
deleted file mode 100644
index 25447efce1..0000000000
--- a/docs/source/cpp_api/neighbors_vamana.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-Vamana
-======
-
-Vamana is the graph construction algorithm behind the well-known DiskANN vector search solution. The cuVS implementation of Vamana/DiskANN is a custom GPU-acceleration version of the algorithm that aims to reduce index construction time using NVIDIA GPUs.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/neighbors/vamana.hpp>``
-
-namespace *cuvs::neighbors::vamana*
-
-Index build parameters
-----------------------
-
-.. doxygengroup:: vamana_cpp_index_params
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index
------
-
-.. doxygengroup:: vamana_cpp_index
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index build
------------
-
-.. doxygengroup:: vamana_cpp_index_build
-    :project: cuvs
-    :members:
-    :content-only:
-
-Index serialize
----------------
-
-.. doxygengroup:: vamana_cpp_serialize
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/preprocessing.md b/docs/source/cpp_api/preprocessing.md
new file mode 100644
index 0000000000..1618288cad
--- /dev/null
+++ b/docs/source/cpp_api/preprocessing.md
@@ -0,0 +1,11 @@
+# Preprocessing
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+preprocessing_pca.md
+preprocessing_quantize.md
+preprocessing_spectral_embedding.md
+```
+
diff --git a/docs/source/cpp_api/preprocessing.rst b/docs/source/cpp_api/preprocessing.rst
deleted file mode 100644
index 417c8faf7e..0000000000
--- a/docs/source/cpp_api/preprocessing.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-Preprocessing
-=============
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   preprocessing_pca.rst
-   preprocessing_quantize.rst
-   preprocessing_spectral_embedding.rst
diff --git a/docs/source/cpp_api/preprocessing_pca.md b/docs/source/cpp_api/preprocessing_pca.md
new file mode 100644
index 0000000000..65702ee3a0
--- /dev/null
+++ b/docs/source/cpp_api/preprocessing_pca.md
@@ -0,0 +1,23 @@
+# PCA
+
+Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto orthogonal directions of maximum variance.
+
+`#include <cuvs/preprocessing/pca.hpp>`
+
+namespace *cuvs::preprocessing::pca*
+
+## Parameters
+
+```{doxygenstruct} cuvs::preprocessing::pca::params
+:project: cuvs
+:members:
+```
+
+## PCA
+
+```{doxygengroup} pca
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/preprocessing_pca.rst b/docs/source/cpp_api/preprocessing_pca.rst
deleted file mode 100644
index 3083f42daf..0000000000
--- a/docs/source/cpp_api/preprocessing_pca.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-PCA
-===
-
-Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto orthogonal directions of maximum variance.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-``#include <cuvs/preprocessing/pca.hpp>``
-
-namespace *cuvs::preprocessing::pca*
-
-Parameters
-----------
-
-.. doxygenstruct:: cuvs::preprocessing::pca::params
-    :project: cuvs
-    :members:
-
-PCA
----------
-
-.. doxygengroup:: pca
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cpp_api/preprocessing_quantize.md b/docs/source/cpp_api/preprocessing_quantize.md
new file mode 100644
index 0000000000..20f8dfd858
--- /dev/null
+++ b/docs/source/cpp_api/preprocessing_quantize.md
@@ -0,0 +1,41 @@
+# Quantize
+
+This page provides C++ class references for the publicly-exposed elements of the
+`cuvs/preprocessing/quantize` package.
+
+## Binary Quantizer
+
+`#include <cuvs/preprocessing/quantize/binary.hpp>`
+
+namespace *cuvs::preprocessing::quantize::binary*
+
+```{doxygengroup} binary
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Product Quantizer
+
+`#include <cuvs/preprocessing/quantize/pq.hpp>`
+
+namespace *cuvs::preprocessing::quantize::pq*
+
+```{doxygengroup} pq
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Scalar Quantizer
+
+`#include <cuvs/preprocessing/quantize/scalar.hpp>`
+
+namespace *cuvs::preprocessing::quantize::scalar*
+
+```{doxygengroup} scalar
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/preprocessing_quantize.rst b/docs/source/cpp_api/preprocessing_quantize.rst
deleted file mode 100644
index fe8bf1ed8e..0000000000
--- a/docs/source/cpp_api/preprocessing_quantize.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-Quantize
-========
-
-This page provides C++ class references for the publicly-exposed elements of the
-`cuvs/preprocessing/quantize` package.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Binary Quantizer
-----------------
-
-``#include <cuvs/preprocessing/quantize/binary.hpp>``
-
-namespace *cuvs::preprocessing::quantize::binary*
-
-.. doxygengroup:: binary
-   :project: cuvs
-   :members:
-   :content-only:
-
-Product Quantizer
------------------
-
-``#include <cuvs/preprocessing/quantize/pq.hpp>``
-
-namespace *cuvs::preprocessing::quantize::pq*
-
-.. doxygengroup:: pq
-   :project: cuvs
-   :members:
-   :content-only:
-
-Scalar Quantizer
-----------------
-
-``#include <cuvs/preprocessing/quantize/scalar.hpp>``
-
-namespace *cuvs::preprocessing::quantize::scalar*
-
-.. doxygengroup:: scalar
-   :project: cuvs
-   :members:
-   :content-only:
diff --git a/docs/source/cpp_api/preprocessing_spectral_embedding.md b/docs/source/cpp_api/preprocessing_spectral_embedding.md
new file mode 100644
index 0000000000..75d2fd1ae4
--- /dev/null
+++ b/docs/source/cpp_api/preprocessing_spectral_embedding.md
@@ -0,0 +1,100 @@
+# Spectral Embedding
+
+Spectral embedding is a powerful dimensionality reduction technique that uses the eigenvectors
+of the graph Laplacian to embed high-dimensional data into a lower-dimensional space. This
+method is particularly effective for discovering non-linear manifold structures in data and
+is widely used in clustering, visualization, and feature extraction tasks.
+
+## Overview
+
+The spectral embedding algorithm works by:
+
+1. **Graph Construction**: Building a k-nearest neighbors graph from the input data
+2. **Laplacian Computation**: Computing the graph Laplacian matrix (normalized or unnormalized)
+3. **Eigendecomposition**: Finding the eigenvectors corresponding to the smallest eigenvalues
+4. **Embedding**: Using these eigenvectors as coordinates in the lower-dimensional space
+
+## Parameters
+
+`#include <cuvs/preprocessing/spectral_embedding.hpp>`
+
+namespace *cuvs::preprocessing::spectral_embedding*
+
+```{doxygenstruct} cuvs::preprocessing::spectral_embedding::params
+:project: cuvs
+:members:
+```
+
+## Spectral Embedding
+
+`#include <cuvs/preprocessing/spectral_embedding.hpp>`
+
+namespace *cuvs::preprocessing::spectral_embedding*
+
+```{doxygengroup} spectral_embedding
+:project: cuvs
+:content-only:
+```
+
+## Example Usage
+
+### Basic Usage with Dataset
+
+```cpp
+#include <raft/core/resources.hpp>
+#include <cuvs/preprocessing/spectral_embedding.hpp>
+
+// Initialize RAFT resources
+raft::resources handle;
+
+// Configure spectral embedding parameters
+cuvs::preprocessing::spectral_embedding::params params;
+params.n_components = 2;        // Reduce to 2D for visualization
+params.n_neighbors = 15;        // Local neighborhood size
+params.norm_laplacian = true;   // Use normalized Laplacian
+params.drop_first = true;       // Drop constant eigenvector
+params.seed = 42;               // For reproducibility
+
+// Create input dataset (n_samples x n_features)
+int n_samples = 1000;
+int n_features = 50;
+auto dataset = raft::make_device_matrix<float, int>(handle, n_samples, n_features);
+// ... populate dataset with your data ...
+
+// Allocate output embedding matrix (n_samples x n_components)
+auto embedding = raft::make_device_matrix<float, int, raft::col_major>(
+    handle, n_samples, params.n_components);
+
+// Perform spectral embedding
+cuvs::preprocessing::spectral_embedding::transform(
+    handle, params, dataset.view(), embedding.view());
+```
+
+### Using Precomputed Graph
+
+```cpp
+#include <raft/core/resources.hpp>
+#include <cuvs/preprocessing/spectral_embedding.hpp>
+
+raft::resources handle;
+
+// Configure parameters (n_neighbors is ignored with precomputed graph)
+cuvs::preprocessing::spectral_embedding::params params;
+params.n_components = 3;
+params.norm_laplacian = true;
+params.drop_first = true;
+params.seed = 42;
+
+// Assume we have a precomputed connectivity graph
+// This could be from custom similarity computation or k-NN search
+raft::device_coo_matrix<float, int, int, int> connectivity_graph(...);
+
+// Allocate output embedding
+auto embedding = raft::make_device_matrix<float, int, raft::col_major>(
+    handle, n_samples, params.n_components);
+
+// Perform spectral embedding with precomputed graph
+cuvs::preprocessing::spectral_embedding::transform(
+    handle, params, connectivity_graph.view(), embedding.view());
+```
+
diff --git a/docs/source/cpp_api/preprocessing_spectral_embedding.rst b/docs/source/cpp_api/preprocessing_spectral_embedding.rst
deleted file mode 100644
index bfae68f9de..0000000000
--- a/docs/source/cpp_api/preprocessing_spectral_embedding.rst
+++ /dev/null
@@ -1,108 +0,0 @@
-Spectral Embedding
-==================
-
-Spectral embedding is a powerful dimensionality reduction technique that uses the eigenvectors
-of the graph Laplacian to embed high-dimensional data into a lower-dimensional space. This
-method is particularly effective for discovering non-linear manifold structures in data and
-is widely used in clustering, visualization, and feature extraction tasks.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Overview
---------
-
-The spectral embedding algorithm works by:
-
-1. **Graph Construction**: Building a k-nearest neighbors graph from the input data
-2. **Laplacian Computation**: Computing the graph Laplacian matrix (normalized or unnormalized)
-3. **Eigendecomposition**: Finding the eigenvectors corresponding to the smallest eigenvalues
-4. **Embedding**: Using these eigenvectors as coordinates in the lower-dimensional space
-
-Parameters
-----------
-
-``#include <cuvs/preprocessing/spectral_embedding.hpp>``
-
-namespace *cuvs::preprocessing::spectral_embedding*
-
-.. doxygenstruct:: cuvs::preprocessing::spectral_embedding::params
-   :project: cuvs
-   :members:
-
-Spectral Embedding
-------------------
-
-``#include <cuvs/preprocessing/spectral_embedding.hpp>``
-
-namespace *cuvs::preprocessing::spectral_embedding*
-
-.. doxygengroup:: spectral_embedding
-   :project: cuvs
-   :content-only:
-
-Example Usage
--------------
-
-Basic Usage with Dataset
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code-block:: cpp
-
-   #include <raft/core/resources.hpp>
-   #include <cuvs/preprocessing/spectral_embedding.hpp>
-
-   // Initialize RAFT resources
-   raft::resources handle;
-
-   // Configure spectral embedding parameters
-   cuvs::preprocessing::spectral_embedding::params params;
-   params.n_components = 2;        // Reduce to 2D for visualization
-   params.n_neighbors = 15;        // Local neighborhood size
-   params.norm_laplacian = true;   // Use normalized Laplacian
-   params.drop_first = true;       // Drop constant eigenvector
-   params.seed = 42;               // For reproducibility
-
-   // Create input dataset (n_samples x n_features)
-   int n_samples = 1000;
-   int n_features = 50;
-   auto dataset = raft::make_device_matrix<float, int>(handle, n_samples, n_features);
-   // ... populate dataset with your data ...
-
-   // Allocate output embedding matrix (n_samples x n_components)
-   auto embedding = raft::make_device_matrix<float, int, raft::col_major>(
-       handle, n_samples, params.n_components);
-
-   // Perform spectral embedding
-   cuvs::preprocessing::spectral_embedding::transform(
-       handle, params, dataset.view(), embedding.view());
-
-Using Precomputed Graph
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code-block:: cpp
-
-   #include <raft/core/resources.hpp>
-   #include <cuvs/preprocessing/spectral_embedding.hpp>
-
-   raft::resources handle;
-
-   // Configure parameters (n_neighbors is ignored with precomputed graph)
-   cuvs::preprocessing::spectral_embedding::params params;
-   params.n_components = 3;
-   params.norm_laplacian = true;
-   params.drop_first = true;
-   params.seed = 42;
-
-   // Assume we have a precomputed connectivity graph
-   // This could be from custom similarity computation or k-NN search
-   raft::device_coo_matrix<float, int, int, int> connectivity_graph(...);
-
-   // Allocate output embedding
-   auto embedding = raft::make_device_matrix<float, int, raft::col_major>(
-       handle, n_samples, params.n_components);
-
-   // Perform spectral embedding with precomputed graph
-   cuvs::preprocessing::spectral_embedding::transform(
-       handle, params, connectivity_graph.view(), embedding.view());
diff --git a/docs/source/cpp_api/selection.md b/docs/source/cpp_api/selection.md
new file mode 100644
index 0000000000..e474279b91
--- /dev/null
+++ b/docs/source/cpp_api/selection.md
@@ -0,0 +1,15 @@
+# Selection
+
+This page provides C++ class references for the publicly-exposed elements of the `cuvs/selection`
+package.
+
+## Select-K
+
+`#include <cuvs/selection/select_k.hpp>`
+
+namespace *cuvs::selection*
+
+```{doxygengroup} select_k
+:project: cuvs
+```
+
diff --git a/docs/source/cpp_api/selection.rst b/docs/source/cpp_api/selection.rst
deleted file mode 100644
index 5abe81662f..0000000000
--- a/docs/source/cpp_api/selection.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-Selection
-=========
-
-This page provides C++ class references for the publicly-exposed elements of the `cuvs/selection`
-package.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Select-K
---------
-
-``#include <cuvs/selection/select_k.hpp>``
-
-namespace *cuvs::selection*
-
-.. doxygengroup:: select_k
-   :project: cuvs
diff --git a/docs/source/cpp_api/stats.md b/docs/source/cpp_api/stats.md
new file mode 100644
index 0000000000..e8fe569d4e
--- /dev/null
+++ b/docs/source/cpp_api/stats.md
@@ -0,0 +1,30 @@
+# Stats
+
+
+This page provides C++ class references for the publicly-exposed elements of the `cuvs/stats`
+package.
+
+## Silhouette Score
+
+`#include <cuvs/stats/silhouette_score.hpp>`
+
+namespace *cuvs::stats*
+
+```{doxygengroup} stats_silhouette_score
+:project: cuvs
+:members:
+:content-only:
+```
+
+## Trustworthiness Score
+
+`#include <cuvs/stats/trustworthiness_score.hpp>`
+
+namespace *cuvs::stats*
+
+```{doxygengroup} stats_trustworthiness
+:project: cuvs
+:members:
+:content-only:
+```
+
diff --git a/docs/source/cpp_api/stats.rst b/docs/source/cpp_api/stats.rst
deleted file mode 100644
index 988ba05dfc..0000000000
--- a/docs/source/cpp_api/stats.rst
+++ /dev/null
@@ -1,34 +0,0 @@
-Stats
-=====
-
-
-This page provides C++ class references for the publicly-exposed elements of the `cuvs/stats`
-package.
-
-.. role:: py(code)
-   :language: c++
-   :class: highlight
-
-Silhouette Score
-----------------
-
-``#include <cuvs/stats/silhouette_score.hpp>``
-
-namespace *cuvs::stats*
-
-.. doxygengroup:: stats_silhouette_score
-    :project: cuvs
-    :members:
-    :content-only:
-
-Trustworthiness Score
----------------------
-
-``#include <cuvs/stats/trustworthiness_score.hpp>``
-
-namespace *cuvs::stats*
-
-.. doxygengroup:: stats_trustworthiness
-    :project: cuvs
-    :members:
-    :content-only:
diff --git a/docs/source/cuvs_bench/build.rst b/docs/source/cuvs_bench/build.md
similarity index 72%
rename from docs/source/cuvs_bench/build.rst
rename to docs/source/cuvs_bench/build.md
index d579a3424d..88f26c21bf 100644
--- a/docs/source/cuvs_bench/build.rst
+++ b/docs/source/cuvs_bench/build.md
@@ -1,13 +1,10 @@
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Build cuVS Bench From Source
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Build cuVS Bench From Source
 
-Dependencies
-============
+## Dependencies
 
 CUDA 12 and a GPU with Volta architecture or later are required to run the benchmarks.
 
-Please refer to the  :doc:`installation docs <../build>` for the base requirements to build cuVS.
+Please refer to the  {doc}`installation docs <../build>` for the base requirements to build cuVS.
 
 In addition to the base requirements for building cuVS, additional dependencies needed to build the ANN benchmarks include:
 
@@ -18,31 +15,30 @@ In addition to the base requirements for building cuVS, additional dependencies
 5. nlohmann_json
 6. GGNN
 
-`rapids-cmake <https://github.com/rapidsai/rapids-cmake>`_ is used to build the ANN benchmarks so the code for dependencies not already supplied in the CUDA toolkit will be downloaded and built automatically.
+[rapids-cmake](https://github.com/rapidsai/rapids-cmake) is used to build the ANN benchmarks so the code for dependencies not already supplied in the CUDA toolkit will be downloaded and built automatically.
 
 The easiest (and most reproducible) way to install the dependencies needed to build the ANN benchmarks is to use the conda environment file located in the `conda/environments` directory of the cuVS repository. The following command will use `mamba` (which is preferred over `conda`) to build and activate a new environment for compiling the benchmarks:
 
-.. code-block:: bash
-
-    conda env create --name cuvs_benchmarks -f conda/environments/bench_ann_cuda-131_arch-$(uname -m).yaml
-    conda activate cuvs_benchmarks
+```bash
+conda env create --name cuvs_benchmarks -f conda/environments/bench_ann_cuda-131_arch-$(uname -m).yaml
+conda activate cuvs_benchmarks
+```
 
 The above conda environment will also reduce the compile times as dependencies like FAISS will already be installed and not need to be compiled with `rapids-cmake`.
 
-Compiling the Benchmarks
-========================
+## Compiling the Benchmarks
 
 After the needed dependencies are satisfied, the easiest way to compile ANN benchmarks is through the `build.sh` script in the root of the RAFT source code repository. The following will build the executables for all the support algorithms:
 
-.. code-block:: bash
-
-    ./build.sh bench-ann
+```bash
+./build.sh bench-ann
+```
 
 You can limit the algorithms that are built by providing a semicolon-delimited list of executable names (each algorithm is suffixed with `_ANN_BENCH`):
 
-.. code-block:: bash
-
-    ./build.sh bench-ann -n --limit-bench-ann=HNSWLIB_ANN_BENCH;CUVS_IVF_PQ_ANN_BENCH
+```bash
+./build.sh bench-ann -n --limit-bench-ann=HNSWLIB_ANN_BENCH;CUVS_IVF_PQ_ANN_BENCH
+```
 
 Available targets to use with `--limit-bench-ann` are:
 
diff --git a/docs/source/cuvs_bench/datasets.rst b/docs/source/cuvs_bench/datasets.md
similarity index 57%
rename from docs/source/cuvs_bench/datasets.rst
rename to docs/source/cuvs_bench/datasets.md
index e6a53ca82b..66751087e7 100644
--- a/docs/source/cuvs_bench/datasets.rst
+++ b/docs/source/cuvs_bench/datasets.md
@@ -1,6 +1,4 @@
-~~~~~~~~~~~~~~~~~~~
-cuVS Bench Datasets
-~~~~~~~~~~~~~~~~~~~
+# cuVS Bench Datasets
 
 A dataset usually has 4 binary files containing database vectors, query vectors, ground truth neighbors and their corresponding distances. For example, Glove-100 dataset has files `base.fbin` (database vectors), `query.fbin` (query vectors), `groundtruth.neighbors.ibin` (ground truth neighbors), and `groundtruth.distances.fbin` (ground truth distances). The first two files are for index building and searching, while the other two are associated with a particular distance and are used for evaluation.
 
@@ -10,53 +8,53 @@ These binary files are little-endian and the format is: the first 8 bytes are `n
 Some implementation can take `float16` database and query vectors as inputs and will have better performance. Use `python/cuvs_bench/cuvs_bench/get_dataset/fbin_to_f16bin.py` to transform dataset from `float32` to `float16` type.
 
 Commonly used datasets can be downloaded from two websites:
-#. Million-scale datasets can be found at the `Data sets <https://github.com/erikbern/ann-benchmarks#data-sets>`_ section of `ann-benchmarks <https://github.com/erikbern/ann-benchmarks>`_.
+1. Million-scale datasets can be found at the [Data sets](https://github.com/erikbern/ann-benchmarks#data-sets) section of [ann-benchmarks](https://github.com/erikbern/ann-benchmarks).
 
     However, these datasets are in HDF5 format. Use `python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py` to transform the format. The usage of this script is:
 
-    .. code-block:: bash
-
-        $ python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
-        usage: hdf5_to_fbin.py [-n] <input>.hdf5
-           -n: normalize base/query set
-         outputs: <input>.base.fbin
-                  <input>.query.fbin
-                  <input>.groundtruth.neighbors.ibin
-                  <input>.groundtruth.distances.fbin
+    ```bash
+    $ python/cuvs_bench/cuvs_bench/get_dataset/hdf5_to_fbin.py
+    usage: hdf5_to_fbin.py [-n] <input>.hdf5
+       -n: normalize base/query set
+     outputs: <input>.base.fbin
+              <input>.query.fbin
+              <input>.groundtruth.neighbors.ibin
+              <input>.groundtruth.distances.fbin
+    ```
 
     So for an input `.hdf5` file, four output binary files will be produced. See previous section for an example of prepossessing GloVe dataset.
 
     Most datasets provided by `ann-benchmarks` use `Angular` or `Euclidean` distance. `Angular` denotes cosine distance. However, computing cosine distance reduces to computing inner product by normalizing vectors beforehand. In practice, we can always do the normalization to decrease computation cost, so it's better to measure the performance of inner product rather than cosine distance. The `-n` option of `hdf5_to_fbin.py` can be used to normalize the dataset.
 
-#. Billion-scale datasets can be found at `big-ann-benchmarks <http://big-ann-benchmarks.com>`_. The ground truth file contains both neighbors and distances, thus should be split. A script is provided for this:
+1. Billion-scale datasets can be found at [big-ann-benchmarks](http://big-ann-benchmarks.com). The ground truth file contains both neighbors and distances, thus should be split. A script is provided for this:
 
     Take Deep-1B dataset as an example:
 
-    .. code-block:: bash
-
-        mkdir -p data/deep-1B && cd data/deep-1B
-
-        # download manually "Ground Truth" file of "Yandex DEEP"
-        # suppose the file name is deep_new_groundtruth.public.10K.bin
-        python -m cuvs_bench.split_groundtruth deep_new_groundtruth.public.10K.bin groundtruth
-
-        # two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+    ```bash
+    mkdir -p data/deep-1B && cd data/deep-1B
+    
+    # download manually "Ground Truth" file of "Yandex DEEP"
+    # suppose the file name is deep_new_groundtruth.public.10K.bin
+    python -m cuvs_bench.split_groundtruth deep_new_groundtruth.public.10K.bin groundtruth
+    
+    # two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+    ```
 
     Besides ground truth files for the whole billion-scale datasets, this site also provides ground truth files for the first 10M or 100M vectors of the base sets. This mean we can use these billion-scale datasets as million-scale datasets. To facilitate this, an optional parameter `subset_size` for dataset can be used. See the next step for further explanation.
 
-Generate ground truth
-=====================
+## Generate ground truth
 
 If you have a dataset, but no corresponding ground truth file, then you can generate ground trunth using the `generate_groundtruth` utility. Example usage:
 
-.. code-block:: bash
+```bash
+# With existing query file
+python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin
 
-    # With existing query file
-    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin
+# With randomly generated queries
+python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=random --n_queries=10000
 
-    # With randomly generated queries
-    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=random --n_queries=10000
+# Using only a subset of the dataset. Define queries by randomly
+# selecting vectors from the (subset of the) dataset.
+python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --nrows=2000000 --output=groundtruth_dir --queries=random-choice --n_queries=10000
+```
 
-    # Using only a subset of the dataset. Define queries by randomly
-    # selecting vectors from the (subset of the) dataset.
-    python -m cuvs_bench.generate_groundtruth --dataset /dataset/base.fbin --nrows=2000000 --output=groundtruth_dir --queries=random-choice --n_queries=10000
diff --git a/docs/source/cuvs_bench/index.md b/docs/source/cuvs_bench/index.md
new file mode 100644
index 0000000000..4ad74fbcc1
--- /dev/null
+++ b/docs/source/cuvs_bench/index.md
@@ -0,0 +1,639 @@
+# cuVS Bench
+
+cuVS bench provides a reproducible benchmarking tool for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU. One of the primary goals of cuVS is to capture ideal index configurations for a variety of important usage patterns so the results can be reproduced easily on different hardware environments, such as on-prem and cloud.
+
+This tool offers several benefits, including
+
+1. Making fair comparisons of index build times
+
+1. Making fair comparisons of index search throughput and/or latency
+
+1. Finding the optimal parameter settings for a range of recall buckets
+
+1. Easily generating consistently styled plots for index build and search
+
+1. Profiling blind spots and potential for algorithm optimization
+
+1. Investigating the relationship between different parameter settings, index build times, and search performance.
+
+- [Installing the benchmarks](#installing-the-benchmarks)
+
+  * [Conda](#conda)
+
+  * [Docker](#docker)
+
+- [Running the benchmarks](#running-the-benchmarks)
+
+  * `End-to-end: smaller-scale benchmarks (<1M to 10M)`_
+
+  * `End-to-end: large-scale benchmarks (>10M vectors)`_
+
+  * [Running with Docker containers](#running-with-docker-containers)
+
+    * [End-to-end run on GPU](#end-to-end-run-on-gpu)
+
+    * [Manually run the scripts inside the container](#manually-run-the-scripts-inside-the-container)
+
+  * [Evaluating the results](#evaluating-the-results)
+
+- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations)
+
+  * [Multi-GPU benchmarks](#multi-gpu-benchmarks)
+
+- [Adding a new index algorithm](#adding-a-new-index-algorithm)
+
+  * [Implementation and configuration](#implementation-and-configuration)
+
+  * [Adding a Cmake target](#adding-a-cmake-target)
+
+## Installing the benchmarks
+
+There are two main ways pre-compiled benchmarks are distributed:
+
+- [Conda](#conda) For users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers.
+- [Docker](#docker) Only needs docker and [NVIDIA docker](https://github.com/NVIDIA/nvidia-docker) to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers.
+
+### Conda
+
+```bash
+conda create --name cuvs_benchmarks
+conda activate cuvs_benchmarks
+
+# to install GPU package:
+conda install -c rapidsai -c conda-forge cuvs-bench=<rapids_version> cuda-version=13.1*
+
+# to install CPU package for usage in CPU-only systems:
+conda install -c rapidsai -c conda-forge  cuvs-bench-cpu
+```
+
+The channel `rapidsai` can easily be substituted with `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
+
+Please see the {doc}`build instructions <build>` to build the benchmarks from source.
+
+### Docker
+
+We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:
+
+- `cuvs-bench`: Contains GPU and CPU benchmarks, can run all algorithms supported. Will download million-scale datasets as required. Best suited for users that prefer a smaller container size for GPU based systems. Requires the NVIDIA Container Toolkit to run GPU algorithms, can run CPU algorithms without it.
+- `cuvs-bench-cpu`: Contains only CPU benchmarks with minimal size. Best suited for users that want the smallest containers to reproduce benchmarks on systems without a GPU.
+
+Nightly images are located in [dockerhub](https://hub.docker.com/r/rapidsai/cuvs-bench/tags).
+
+The following command pulls the nightly container for Python version 3.13, CUDA version 12.9, and cuVS version 26.06:
+
+```bash
+docker pull rapidsai/cuvs-bench:26.06a-cuda12-py3.13 # substitute cuvs-bench for the exact desired container.
+```
+
+The CUDA and python versions can be changed for the supported values:
+- Supported CUDA versions: 12, 13
+- Supported Python versions: 3.11, 3.11, 3.13, and 3.14
+
+You can see the exact versions as well in the dockerhub site:
+- [cuVS bench images](https://hub.docker.com/r/rapidsai/cuvs-bench/tags)
+- [cuVS bench CPU only images](https://hub.docker.com/r/rapidsai/cuvs-bench-cpu/tags)
+
+**Note:** GPU containers use the CUDA toolkit from inside the container, the only requirement is a driver installed on the host machine that supports that version. So, for example, CUDA 12 containers can run in systems with a CUDA 13.x capable driver. Please also note that the Nvidia-Docker runtime from the [Nvidia Container Toolkit](https://github.com/NVIDIA/nvidia-docker) is required to use GPUs inside docker containers.
+
+## Running the benchmarks
+
+### End-to-end: smaller-scale benchmarks (<1M to 10M)
+
+The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset. By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called.
+
+```bash
+# (1) Prepare dataset.
+python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
+```
+
+```python
+# (2) Build and search index.
+from cuvs_bench.orchestrator import BenchmarkOrchestrator
+
+orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
+results = orchestrator.run_benchmark(
+    dataset="deep-image-96-inner",
+    algorithms="cuvs_cagra",
+    count=10,
+    batch_size=10,
+    build=True,
+    search=True,
+)
+```
+
+```bash
+# (3) Export data.
+python -m cuvs_bench.run --data-export --dataset deep-image-96-inner
+
+# (4) Plot results.
+python -m cuvs_bench.plot --dataset deep-image-96-inner
+```
+
+```{list-table}
+* - Dataset name
+  - Train rows
+  - Columns
+  - Test rows
+  - Distance
+
+* - `deep-image-96-angular`
+  - 10M
+  - 96
+  - 10K
+  - Angular
+
+* - `fashion-mnist-784-euclidean`
+  - 60K
+  - 784
+  - 10K
+  - Euclidean
+
+* - `glove-50-angular`
+  - 1.1M
+  - 50
+  - 10K
+  - Angular
+
+* - `glove-100-angular`
+  - 1.1M
+  - 100
+  - 10K
+  - Angular
+
+* - `mnist-784-euclidean`
+  - 60K
+  - 784
+  - 10K
+  - Euclidean
+
+* - `nytimes-256-angular`
+  - 290K
+  - 256
+  - 10K
+  - Angular
+
+* - `sift-128-euclidean`
+  - 1M
+  - 128
+  - 10K
+  - Euclidean
+```
+
+All of the datasets above contain ground test datasets with 100 neighbors. Thus `k` for these datasets must be  less than or equal to 100.
+
+### End-to-end: large-scale benchmarks (>10M vectors)
+
+`cuvs_bench.get_dataset` cannot be used to download the billion-scale datasets due to their size. You should instead use our billion-scale datasets guide to download and prepare them.
+All other python commands mentioned below work as intended once the billion-scale dataset has been downloaded.
+
+To download billion-scale datasets, visit [big-ann-benchmarks](http://big-ann-benchmarks.com/neurips21.html)
+
+We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our {doc}`Wiki-all Dataset Guide <wiki_all_dataset>` for more information and to download the dataset.
+
+
+The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100.
+
+```bash
+mkdir -p datasets/deep-1B
+# (1) Prepare dataset.
+# download manually "Ground Truth" file of "Yandex DEEP"
+# suppose the file name is deep_new_groundtruth.public.10K.bin
+python -m cuvs_bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
+# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
+```
+
+```python
+# (2) Build and search index.
+from cuvs_bench.orchestrator import BenchmarkOrchestrator
+
+orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
+results = orchestrator.run_benchmark(
+    dataset="deep-1B",
+    algorithms="cuvs_cagra",
+    count=10,
+    batch_size=10,
+    build=True,
+    search=True,
+)
+```
+
+```bash
+# (3) Export data.
+python -m cuvs_bench.run --data-export --dataset deep-1B
+
+# (4) Plot results.
+python -m cuvs_bench.plot --dataset deep-1B
+```
+
+The usage of `python -m cuvs_bench.split_groundtruth` is:
+
+```bash
+usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH
+
+options:
+  -h, --help            show this help message and exit
+  --groundtruth GROUNDTRUTH
+                        Path to billion-scale dataset groundtruth file (default: None)
+```
+
+### Testing on new datasets
+
+To run benchmark on a dataset, it is required have a descriptor that defines the file names and a few other properties of that dataset.
+Descriptors for several popular datasets are already available in [datasets.yaml](https://github.com/rapidsai/cuvs/blob/branch-25.04/python/cuvs_bench/cuvs_bench/config/datasets/datasets.yaml).
+
+Let's consider how to test on a new dataset. First we create a descriptor `mydataset.yaml`
+
+```yaml
+- name: mydata-1M
+  base_file: mydata-1M/base.100M.u8bin
+  subset_size: 1000000
+  dims: 128
+  query_file: mydata-10M/queries.u8bin
+  groundtruth_neighbors_file: mydata-1M/groundtruth.neighbors.ibin
+  distance: euclidean
+```
+
+Here `name` can be chosen arbitrarily. We pass `name` as the `--dataset` argument for the benchmark. The file names are relative to the path given by `--dataset-path` argument.
+The `subset_size` field is optional. This argument defines how many vectors to use from the dataset file, the first `subset_size` vectors will be used.
+This way you can define benchmarks on multiple subsets of the same dataset without duplicating the dataset vectors.
+Note that the ground truth vectors have to be generated for each subset separately.
+
+To run the benchmark on the newly defined `mydata-1M` dataset, you can use the following command line:
+
+```bash
+python -m cuvs_bench.run --dataset mydata-1M --dataset-path=/path/to/data/folder --dataset-configuration=mydataset.yaml  --algorithms=cuvs_cagra
+```
+
+### Running with Docker containers
+
+Two methods are provided for running the benchmarks with the Docker containers.
+
+#### End-to-end run on GPU
+
+When no other entrypoint is provided, an end-to-end script will run through all the steps in [Running the benchmarks](#running-the-benchmarks) above.
+
+For GPU-enabled systems, the `DATA_FOLDER` variable should be a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers):
+
+```bash
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run --gpus all --rm -it -u $(id -u)                      \
+    -v $DATA_FOLDER:/data/benchmarks                            \
+    rapidsai/cuvs-bench:26.06a-cuda12-py3.13              \
+    "--dataset deep-image-96-angular"                           \
+    "--normalize"                                               \
+    "--algorithms cuvs_cagra,cuvs_ivf_pq --batch-size 10 -k 10" \
+    ""
+```
+
+Usage of the above command is as follows:
+
+```{list-table}
+* - Argument
+  - Description
+
+* - `rapidsai/cuvs-bench:26.06a-cuda12-py3.13`
+  - Image to use. See "Docker" section for links to lists of available tags.
+
+* - `"--dataset deep-image-96-angular"`
+  - Dataset name
+
+* - `"--normalize"`
+  - Whether to normalize the dataset
+
+* - `"--algorithms cuvs_cagra,hnswlib --batch-size 10 -k 10"`
+  - Arguments passed to the `run` script, such as the algorithms to benchmark, the batch size, and `k`
+
+* - `""`
+  - Additional (optional) arguments that will be passed to the `plot` script.
+```
+
+***Note about user and file permissions:*** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by the `$DATA_FOLDER` variable.
+
+#### End-to-end run on CPU
+
+The container arguments in the above section also be used for the CPU-only container, which can be used on systems that don't have a GPU installed.
+
+***Note:*** the image changes to `cuvs-bench-cpu` container and the `--gpus all` argument is no longer used:
+
+```bash
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run  --rm -it -u $(id -u)                  \
+    -v $DATA_FOLDER:/data/benchmarks              \
+    rapidsai/cuvs-bench-cpu:26.06a-py3.13     \
+     "--dataset deep-image-96-angular"            \
+     "--normalize"                                \
+     "--algorithms hnswlib --batch-size 10 -k 10" \
+     ""
+```
+
+#### Manually run the scripts inside the container
+
+All of the `cuvs-bench` images contain the Conda packages, so they can be used directly by logging directly into the container itself:
+
+```bash
+export DATA_FOLDER=path/to/store/datasets/and/results
+docker run --gpus all --rm -it -u $(id -u)          \
+    --entrypoint /bin/bash                          \
+    --workdir /data/benchmarks                      \
+    -v $DATA_FOLDER:/data/benchmarks                \
+    rapidsai/cuvs-bench:26.06a-cuda12-py3.13
+```
+
+This will drop you into a command line in the container, with the `cuvs-bench` python package ready to use, as described in the [Running the benchmarks](#running-the-benchmarks) section above:
+
+```bash
+(base) root@00b068fbb862:/data/benchmarks# python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
+```
+
+Additionally, the containers can be run in detached mode without any issue.
+
+### Evaluating the results
+
+The benchmarks capture several different measurements. The table below describes each of the measurements for index build benchmarks:
+
+```{list-table}
+* - Name
+  - Description
+
+* - Benchmark
+  - A name that uniquely identifies the benchmark instance
+
+* - Time
+  - Wall-time spent training the index
+
+* - CPU
+  - CPU time spent training the index
+
+* - Iterations
+  - Number of iterations (this is usually 1)
+
+* - GPU
+  - GU time spent building
+
+* - index_size
+  - Number of vectors used to train index
+```
+
+The table below describes each of the measurements for the index search benchmarks. The most important measurements `Latency`, `items_per_second`, `end_to_end`.
+
+```{list-table}
+* - Name
+  - Description
+
+* - Benchmark
+  - A name that uniquely identifies the benchmark instance
+
+* - Time
+  - The wall-clock time of a single iteration (batch) divided by the number of threads.
+
+* - CPU
+  - The average CPU time (user + sys time). This does not include idle time (which can also happen while waiting for GPU sync).
+
+* - Iterations
+  - Total number of batches. This is going to be `total_queries` / `n_queries`.
+
+* - GPU
+  - GPU latency of a single batch (seconds). In throughput mode this is averaged over multiple threads.
+
+* - Latency
+  - Latency of a single batch (seconds), calculated from wall-clock time. In throughput mode this is averaged over multiple threads.
+
+* - Recall
+  - Proportion of correct neighbors to ground truth neighbors. Note this column is only present if groundtruth file is specified in dataset configuration.
+
+* - items_per_second
+  - Total throughput, a.k.a Queries per second (QPS). This is approximately `total_queries` / `end_to_end`.
+
+* - k
+  - Number of neighbors being queried in each iteration
+
+* - end_to_end
+  - Total time taken to run all batches for all iterations
+
+* - n_queries
+  - Total number of query vectors in each batch
+
+* - total_queries
+  - Total number of vectors queries across all iterations ( = `iterations` * `n_queries`)
+```
+
+Note the following:
+- A slightly different method is used to measure `Time` and `end_to_end`. That is why `end_to_end` = `Time` * `Iterations` holds only approximately.
+- The actual table displayed on the screen may differ slightly as the hyper-parameters will also be displayed for each different combination being benchmarked.
+- Recall calculation: the number of queries processed per test depends on the number of iterations. Because of this, recall can show slight fluctuations if less neighbors are processed then it is available for the benchmark.
+
+## Creating and customizing dataset configurations
+
+A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations.
+
+A default `datasets.yaml` is provided by CUVS in `${CUVS_HOME}/python/cuvs_bench/cuvs_bench/config/datasets/datasets.yaml` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:
+
+```yaml
+- name: sift-128-euclidean
+  base_file: sift-128-euclidean/base.fbin
+  query_file: sift-128-euclidean/query.fbin
+  groundtruth_neighbors_file: sift-128-euclidean/groundtruth.neighbors.ibin
+  dims: 128
+  distance: euclidean
+```
+
+Configuration files for ANN algorithms supported by `cuvs-bench` are provided in `${CUVS_HOME}/python/cuvs_bench/cuvs_bench/config/algos`. `cuvs_cagra` algorithm configuration looks like:
+
+```yaml
+name: cuvs_cagra
+constraints:
+  build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
+  search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
+groups:
+  base:
+    build:
+      graph_degree: [32, 64]
+      intermediate_graph_degree: [64, 96]
+      graph_build_algo: ["NN_DESCENT"]
+    search:
+      itopk: [32, 64, 128]
+
+  large:
+    build:
+      graph_degree: [32, 64]
+    search:
+      itopk: [32, 64, 128]
+```
+
+The default parameters for which the benchmarks are run can be overridden by creating a custom YAML file for algorithms with a `base` group.
+
+The config above has 3 fields:
+
+1. `name` - The name of the algorithm for which the parameters are being specified.
+2. `constraints` - Optional. Python import paths to functions that validate build and search parameter combinations (e.g. `cuvs_bench.config.algos.constraints.cuvs_cagra_build`). Each function returns `True` if the parameters are valid, `False` otherwise; invalid combinations are skipped and not benchmarked.
+3. `groups` - Run groups, each with a set of parameters. Each group defines a cross-product of all hyper-parameter fields for `build` and `search`.
+
+The table below contains all algorithms supported by cuVS. Each unique algorithm will have its own set of `build` and `search` settings. The {doc}`ANN Algorithm Parameter Tuning Guide <param_tuning>` contains detailed instructions on choosing build and search parameters for each supported algorithm.
+
+```{list-table}
+* - Library
+  - Algorithms
+
+* - FAISS_GPU
+  - `faiss_gpu_flat`, `faiss_gpu_ivf_flat`, `faiss_gpu_ivf_pq`, `faiss_gpu_cagra`
+
+* - FAISS_CPU
+  - `faiss_cpu_flat`, `faiss_cpu_ivf_flat`, `faiss_cpu_ivf_pq`, `faiss_cpu_hnsw_flat`
+
+* - GGNN
+  - `ggnn`
+
+* - HNSWLIB
+  - `hnswlib`
+
+* - DiskANN
+  - `diskann_memory`, `diskann_ssd`
+
+* - cuVS
+  - `cuvs_brute_force`, `cuvs_cagra`, `cuvs_ivf_flat`, `cuvs_ivf_pq`, `cuvs_cagra_hnswlib`, `cuvs_vamana`
+```
+
+### Multi-GPU benchmarks
+
+cuVS implements single node multi-GPU versions of IVF-Flat, IVF-PQ and CAGRA indexes.
+
+```{list-table}
+* - Index type
+  - Multi-GPU algo name
+
+* - IVF-Flat
+  - `cuvs_mg_ivf_flat`
+
+* - IVF-PQ
+  - `cuvs_mg_ivf_pq`
+
+* - CAGRA
+  - `cuvs_mg_cagra`
+```
+
+## Adding a new index algorithm
+
+### Implementation and configuration
+
+Implementation of a new algorithm should be a C++ class that inherits `class ANN` (defined in `cpp/bench/ann/src/ann.h`) and implements all the pure virtual functions.
+
+In addition, it should define two `struct`s for building and searching parameters. The searching parameter class should inherit `struct ANN<T>::AnnSearchParam`. Take `class HnswLib` as an example, its definition is:
+
+```c++
+template<typename T>
+class HnswLib : public ANN<T> {
+public:
+  struct BuildParam {
+    int M;
+    int ef_construction;
+    int num_threads;
+  };
+
+  using typename ANN<T>::AnnSearchParam;
+  struct SearchParam : public AnnSearchParam {
+    int ef;
+    int num_threads;
+  };
+
+  // ...
+};
+```
+
+The benchmark program uses JSON format natively in a configuration file to specify indexes to build, along with the build and search parameters. However the JSON config files are overly verbose and are not meant to be used directly. Instead, the Python scripts parse YAML and create these json files automatically. It's important to realize that these json objects align with the yaml objects for `build_param`, whose value is a JSON object, and `search_param`, whose value is an array of JSON objects. Take the json configuration for `HnswLib` as an example of the json after it's been parsed from yaml:
+
+```json
+{
+  "name" : "hnswlib.M12.ef500.th32",
+  "algo" : "hnswlib",
+  "build_param": {"M":12, "efConstruction":500, "numThreads":32},
+  "file" : "/path/to/file",
+  "search_params" : [
+    {"ef":10, "numThreads":1},
+    {"ef":20, "numThreads":1},
+    {"ef":40, "numThreads":1},
+  ],
+  "search_result_file" : "/path/to/file"
+},
+```
+
+The build and search params are ultimately passed to the C++ layer as json objects for each param configuration to benchmark. The code below shows how to parse these params for `Hnswlib`:
+
+1. First, add two functions for parsing JSON object to `struct BuildParam` and `struct SearchParam`, respectively:
+
+```c++
+template<typename T>
+void parse_build_param(const nlohmann::json& conf,
+                       typename cuann::HnswLib<T>::BuildParam& param) {
+  param.ef_construction = conf.at("efConstruction");
+  param.M = conf.at("M");
+  if (conf.contains("numThreads")) {
+    param.num_threads = conf.at("numThreads");
+  }
+}
+
+template<typename T>
+void parse_search_param(const nlohmann::json& conf,
+                        typename cuann::HnswLib<T>::SearchParam& param) {
+  param.ef = conf.at("ef");
+  if (conf.contains("numThreads")) {
+    param.num_threads = conf.at("numThreads");
+  }
+}
+```
+
+2. Next, add corresponding `if` case to functions `create_algo()` (in `cpp/bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example,
+
+```c++
+// JSON configuration file contains a line like:  "algo" : "hnswlib"
+if (algo == "hnswlib") {
+   // ...
+}
+```
+
+### Adding a Cmake target
+
+In `cuvs/cpp/bench/ann/CMakeLists.txt`, we provide a `CMake` function to configure a new Benchmark target with the following signature:
+
+
+```cmake
+ConfigureAnnBench(
+  NAME <algo_name>
+  PATH </path/to/algo/benchmark/source/file>
+  INCLUDES <additional_include_directories>
+  CXXFLAGS <additional_cxx_flags>
+  LINKS <additional_link_library_targets>
+)
+```
+
+To add a target for `HNSWLIB`, we would call the function as:
+
+```cmake
+ConfigureAnnBench(
+  NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES
+  ${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}"
+)
+```
+
+This will create an executable called `HNSWLIB_ANN_BENCH`, which can then be used to run `HNSWLIB` benchmarks.
+
+Add a new entry to `algos.yaml` to map the name of the algorithm to its binary executable and specify whether the algorithm requires GPU support.
+
+```yaml
+cuvs_ivf_pq:
+  executable: CUVS_IVF_PQ_ANN_BENCH
+  requires_gpu: true
+```
+
+`executable` : specifies the name of the binary that will build/search the index. It is assumed to be available in `cuvs/cpp/build/`.
+`requires_gpu` : denotes whether an algorithm requires GPU to run.
+
+
+```{toctree}
+:maxdepth: 4
+
+build.md
+datasets.md
+param_tuning.md
+pluggable_backend.md
+wiki_all_dataset.md
+```
diff --git a/docs/source/cuvs_bench/index.rst b/docs/source/cuvs_bench/index.rst
deleted file mode 100644
index 2efa9ff86b..0000000000
--- a/docs/source/cuvs_bench/index.rst
+++ /dev/null
@@ -1,661 +0,0 @@
-~~~~~~~~~~
-cuVS Bench
-~~~~~~~~~~
-
-cuVS bench provides a reproducible benchmarking tool for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU. One of the primary goals of cuVS is to capture ideal index configurations for a variety of important usage patterns so the results can be reproduced easily on different hardware environments, such as on-prem and cloud.
-
-This tool offers several benefits, including
-
-#. Making fair comparisons of index build times
-
-#. Making fair comparisons of index search throughput and/or latency
-
-#. Finding the optimal parameter settings for a range of recall buckets
-
-#. Easily generating consistently styled plots for index build and search
-
-#. Profiling blind spots and potential for algorithm optimization
-
-#. Investigating the relationship between different parameter settings, index build times, and search performance.
-
-- `Installing the benchmarks`_
-
-  * `Conda`_
-
-  * `Docker`_
-
-- `Running the benchmarks`_
-
-  * `End-to-end: smaller-scale benchmarks (<1M to 10M)`_
-
-  * `End-to-end: large-scale benchmarks (>10M vectors)`_
-
-  * `Running with Docker containers`_
-
-    * `End-to-end run on GPU`_
-
-    * `Manually run the scripts inside the container`_
-
-  * `Evaluating the results`_
-
-- `Creating and customizing dataset configurations`_
-
-  * `Multi-GPU benchmarks`_
-
-- `Adding a new index algorithm`_
-
-  * `Implementation and configuration`_
-
-  * `Adding a Cmake target`_
-
-Installing the benchmarks
-=========================
-
-There are two main ways pre-compiled benchmarks are distributed:
-
-- `Conda`_ For users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers.
-- `Docker`_ Only needs docker and `NVIDIA docker <https://github.com/NVIDIA/nvidia-docker>`_ to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers.
-
-Conda
------
-
-.. code-block:: bash
-
-   conda create --name cuvs_benchmarks
-   conda activate cuvs_benchmarks
-
-   # to install GPU package:
-   conda install -c rapidsai -c conda-forge cuvs-bench=<rapids_version> cuda-version=13.1*
-
-   # to install CPU package for usage in CPU-only systems:
-   conda install -c rapidsai -c conda-forge  cuvs-bench-cpu
-
-The channel `rapidsai` can easily be substituted with `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
-
-Please see the :doc:`build instructions <build>` to build the benchmarks from source.
-
-Docker
-------
-
-We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available:
-
-- `cuvs-bench`: Contains GPU and CPU benchmarks, can run all algorithms supported. Will download million-scale datasets as required. Best suited for users that prefer a smaller container size for GPU based systems. Requires the NVIDIA Container Toolkit to run GPU algorithms, can run CPU algorithms without it.
-- `cuvs-bench-cpu`: Contains only CPU benchmarks with minimal size. Best suited for users that want the smallest containers to reproduce benchmarks on systems without a GPU.
-
-Nightly images are located in `dockerhub <https://hub.docker.com/r/rapidsai/cuvs-bench/tags>`_.
-
-The following command pulls the nightly container for Python version 3.13, CUDA version 12.9, and cuVS version 26.06:
-
-.. code-block:: bash
-
-   docker pull rapidsai/cuvs-bench:26.06a-cuda12-py3.13 # substitute cuvs-bench for the exact desired container.
-
-The CUDA and python versions can be changed for the supported values:
-- Supported CUDA versions: 12, 13
-- Supported Python versions: 3.11, 3.11, 3.13, and 3.14
-
-You can see the exact versions as well in the dockerhub site:
-- `cuVS bench images <https://hub.docker.com/r/rapidsai/cuvs-bench/tags>`_
-- `cuVS bench CPU only images <https://hub.docker.com/r/rapidsai/cuvs-bench-cpu/tags>`_
-
-**Note:** GPU containers use the CUDA toolkit from inside the container, the only requirement is a driver installed on the host machine that supports that version. So, for example, CUDA 12 containers can run in systems with a CUDA 13.x capable driver. Please also note that the Nvidia-Docker runtime from the `Nvidia Container Toolkit <https://github.com/NVIDIA/nvidia-docker>`_ is required to use GPUs inside docker containers.
-
-Running the benchmarks
-======================
-
-End-to-end: smaller-scale benchmarks (<1M to 10M)
--------------------------------------------------
-
-The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset. By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called.
-
-.. code-block:: bash
-
-    # (1) Prepare dataset.
-    python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
-
-.. code-block:: python
-
-    # (2) Build and search index.
-    from cuvs_bench.orchestrator import BenchmarkOrchestrator
-
-    orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
-    results = orchestrator.run_benchmark(
-        dataset="deep-image-96-inner",
-        algorithms="cuvs_cagra",
-        count=10,
-        batch_size=10,
-        build=True,
-        search=True,
-    )
-
-.. code-block:: bash
-
-    # (3) Export data.
-    python -m cuvs_bench.run --data-export --dataset deep-image-96-inner
-
-    # (4) Plot results.
-    python -m cuvs_bench.plot --dataset deep-image-96-inner
-
-.. list-table::
-
- * - Dataset name
-   - Train rows
-   - Columns
-   - Test rows
-   - Distance
-
- * - `deep-image-96-angular`
-   - 10M
-   - 96
-   - 10K
-   - Angular
-
- * - `fashion-mnist-784-euclidean`
-   - 60K
-   - 784
-   - 10K
-   - Euclidean
-
- * - `glove-50-angular`
-   - 1.1M
-   - 50
-   - 10K
-   - Angular
-
- * - `glove-100-angular`
-   - 1.1M
-   - 100
-   - 10K
-   - Angular
-
- * - `mnist-784-euclidean`
-   - 60K
-   - 784
-   - 10K
-   - Euclidean
-
- * - `nytimes-256-angular`
-   - 290K
-   - 256
-   - 10K
-   - Angular
-
- * - `sift-128-euclidean`
-   - 1M
-   - 128
-   - 10K
-   - Euclidean
-
-All of the datasets above contain ground test datasets with 100 neighbors. Thus `k` for these datasets must be  less than or equal to 100.
-
-End-to-end: large-scale benchmarks (>10M vectors)
--------------------------------------------------
-
-`cuvs_bench.get_dataset` cannot be used to download the billion-scale datasets due to their size. You should instead use our billion-scale datasets guide to download and prepare them.
-All other python commands mentioned below work as intended once the billion-scale dataset has been downloaded.
-
-To download billion-scale datasets, visit `big-ann-benchmarks <http://big-ann-benchmarks.com/neurips21.html>`_
-
-We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our :doc:`Wiki-all Dataset Guide <wiki_all_dataset>` for more information and to download the dataset.
-
-
-The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100.
-
-.. code-block:: bash
-
-    mkdir -p datasets/deep-1B
-    # (1) Prepare dataset.
-    # download manually "Ground Truth" file of "Yandex DEEP"
-    # suppose the file name is deep_new_groundtruth.public.10K.bin
-    python -m cuvs_bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
-    # two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced
-
-.. code-block:: python
-
-    # (2) Build and search index.
-    from cuvs_bench.orchestrator import BenchmarkOrchestrator
-
-    orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
-    results = orchestrator.run_benchmark(
-        dataset="deep-1B",
-        algorithms="cuvs_cagra",
-        count=10,
-        batch_size=10,
-        build=True,
-        search=True,
-    )
-
-.. code-block:: bash
-
-    # (3) Export data.
-    python -m cuvs_bench.run --data-export --dataset deep-1B
-
-    # (4) Plot results.
-    python -m cuvs_bench.plot --dataset deep-1B
-
-The usage of `python -m cuvs_bench.split_groundtruth` is:
-
-.. code-block:: bash
-
-    usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH
-
-    options:
-      -h, --help            show this help message and exit
-      --groundtruth GROUNDTRUTH
-                            Path to billion-scale dataset groundtruth file (default: None)
-
-
-Testing on new datasets
------------------------
-
-To run benchmark on a dataset, it is required have a descriptor that defines the file names and a few other properties of that dataset.
-Descriptors for several popular datasets are already available in `datasets.yaml <https://github.com/rapidsai/cuvs/blob/branch-25.04/python/cuvs_bench/cuvs_bench/config/datasets/datasets.yaml>``.
-
-Let's consider how to test on a new dataset. First we create a descriptor `mydataset.yaml`
-
-.. code-block: yaml
-    - name: mydata-1M
-      base_file: mydata-1M/base.100M.u8bin
-      subset_size: 1000000
-      dims: 128
-      query_file: mydata-10M/queries.u8bin
-      groundtruth_neighbors_file: mydata-1M/groundtruth.neighbors.ibin
-      distance: euclidean
-
-Here `name` can be chosen arbitrarily. We pass `name` as the `--dataset` argument for the benchmark. The file names are relative to the path given by `--dataset-path` argument.
-The `subset_size`` field is optional. This argument defines how many vectors to use from the dataset file, the first `subset_size` vectors will be used.
-This way you can define benchmarks on multiple subsets of the same dataset without duplicating the dataset vectors.
-Note that the ground truth vectors have to be generated for each subset separately.
-
-To run the benchmark on the newly defined `mydata-1M` dataset, you can use the following command line:
-
-.. code-black: bash
-  python -m cuvs_bench.run --dataset mydata-1M --dataset-path=/path/to/data/folder --dataset-configuration=mydataset.yaml  --algorithms=cuvs_cagra
-
-Running with Docker containers
-------------------------------
-
-Two methods are provided for running the benchmarks with the Docker containers.
-
-End-to-end run on GPU
-~~~~~~~~~~~~~~~~~~~~~
-
-When no other entrypoint is provided, an end-to-end script will run through all the steps in `Running the benchmarks`_ above.
-
-For GPU-enabled systems, the `DATA_FOLDER` variable should be a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers):
-
-.. code-block:: bash
-
-    export DATA_FOLDER=path/to/store/datasets/and/results
-    docker run --gpus all --rm -it -u $(id -u)                      \
-        -v $DATA_FOLDER:/data/benchmarks                            \
-        rapidsai/cuvs-bench:26.06a-cuda12-py3.13              \
-        "--dataset deep-image-96-angular"                           \
-        "--normalize"                                               \
-        "--algorithms cuvs_cagra,cuvs_ivf_pq --batch-size 10 -k 10" \
-        ""
-
-Usage of the above command is as follows:
-
-.. list-table::
-
- * - Argument
-   - Description
-
- * - `rapidsai/cuvs-bench:26.06a-cuda12-py3.13`
-   - Image to use. See "Docker" section for links to lists of available tags.
-
- * - `"--dataset deep-image-96-angular"`
-   - Dataset name
-
- * - `"--normalize"`
-   - Whether to normalize the dataset
-
- * - `"--algorithms cuvs_cagra,hnswlib --batch-size 10 -k 10"`
-   - Arguments passed to the `run` script, such as the algorithms to benchmark, the batch size, and `k`
-
- * - `""`
-   - Additional (optional) arguments that will be passed to the `plot` script.
-
-***Note about user and file permissions:*** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by the `$DATA_FOLDER` variable.
-
-End-to-end run on CPU
-~~~~~~~~~~~~~~~~~~~~~
-
-The container arguments in the above section also be used for the CPU-only container, which can be used on systems that don't have a GPU installed.
-
-***Note:*** the image changes to `cuvs-bench-cpu` container and the `--gpus all` argument is no longer used:
-
-.. code-block:: bash
-
-    export DATA_FOLDER=path/to/store/datasets/and/results
-    docker run  --rm -it -u $(id -u)                  \
-        -v $DATA_FOLDER:/data/benchmarks              \
-        rapidsai/cuvs-bench-cpu:26.06a-py3.13     \
-         "--dataset deep-image-96-angular"            \
-         "--normalize"                                \
-         "--algorithms hnswlib --batch-size 10 -k 10" \
-         ""
-
-Manually run the scripts inside the container
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-All of the `cuvs-bench` images contain the Conda packages, so they can be used directly by logging directly into the container itself:
-
-.. code-block:: bash
-
-    export DATA_FOLDER=path/to/store/datasets/and/results
-    docker run --gpus all --rm -it -u $(id -u)          \
-        --entrypoint /bin/bash                          \
-        --workdir /data/benchmarks                      \
-        -v $DATA_FOLDER:/data/benchmarks                \
-        rapidsai/cuvs-bench:26.06a-cuda12-py3.13
-
-This will drop you into a command line in the container, with the `cuvs-bench` python package ready to use, as described in the `Running the benchmarks`_ section above:
-
-.. code-block:: bash
-
-    (base) root@00b068fbb862:/data/benchmarks# python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize
-
-Additionally, the containers can be run in detached mode without any issue.
-
-Evaluating the results
-----------------------
-
-The benchmarks capture several different measurements. The table below describes each of the measurements for index build benchmarks:
-
-.. list-table::
-
- * - Name
-   - Description
-
- * - Benchmark
-   - A name that uniquely identifies the benchmark instance
-
- * - Time
-   - Wall-time spent training the index
-
- * - CPU
-   - CPU time spent training the index
-
- * - Iterations
-   - Number of iterations (this is usually 1)
-
- * - GPU
-   - GU time spent building
-
- * - index_size
-   - Number of vectors used to train index
-
-The table below describes each of the measurements for the index search benchmarks. The most important measurements `Latency`, `items_per_second`, `end_to_end`.
-
-.. list-table::
-
- * - Name
-   - Description
-
- * - Benchmark
-   - A name that uniquely identifies the benchmark instance
-
- * - Time
-   - The wall-clock time of a single iteration (batch) divided by the number of threads.
-
- * - CPU
-   - The average CPU time (user + sys time). This does not include idle time (which can also happen while waiting for GPU sync).
-
- * - Iterations
-   - Total number of batches. This is going to be `total_queries` / `n_queries`.
-
- * - GPU
-   - GPU latency of a single batch (seconds). In throughput mode this is averaged over multiple threads.
-
- * - Latency
-   - Latency of a single batch (seconds), calculated from wall-clock time. In throughput mode this is averaged over multiple threads.
-
- * - Recall
-   - Proportion of correct neighbors to ground truth neighbors. Note this column is only present if groundtruth file is specified in dataset configuration.
-
- * - items_per_second
-   - Total throughput, a.k.a Queries per second (QPS). This is approximately `total_queries` / `end_to_end`.
-
- * - k
-   - Number of neighbors being queried in each iteration
-
- * - end_to_end
-   - Total time taken to run all batches for all iterations
-
- * - n_queries
-   - Total number of query vectors in each batch
-
- * - total_queries
-   - Total number of vectors queries across all iterations ( = `iterations` * `n_queries`)
-
-Note the following:
-- A slightly different method is used to measure `Time` and `end_to_end`. That is why `end_to_end` = `Time` * `Iterations` holds only approximately.
-- The actual table displayed on the screen may differ slightly as the hyper-parameters will also be displayed for each different combination being benchmarked.
-- Recall calculation: the number of queries processed per test depends on the number of iterations. Because of this, recall can show slight fluctuations if less neighbors are processed then it is available for the benchmark.
-
-Creating and customizing dataset configurations
-===============================================
-
-A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations.
-
-A default `datasets.yaml` is provided by CUVS in `${CUVS_HOME}/python/cuvs_bench/cuvs_bench/config/datasets/datasets.yaml` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:
-
-.. code-block:: yaml
-
-    - name: sift-128-euclidean
-      base_file: sift-128-euclidean/base.fbin
-      query_file: sift-128-euclidean/query.fbin
-      groundtruth_neighbors_file: sift-128-euclidean/groundtruth.neighbors.ibin
-      dims: 128
-      distance: euclidean
-
-Configuration files for ANN algorithms supported by `cuvs-bench` are provided in `${CUVS_HOME}/python/cuvs_bench/cuvs_bench/config/algos`. `cuvs_cagra` algorithm configuration looks like:
-
-.. code-block:: yaml
-
-    name: cuvs_cagra
-    constraints:
-      build: cuvs_bench.config.algos.constraints.cuvs_cagra_build
-      search: cuvs_bench.config.algos.constraints.cuvs_cagra_search
-    groups:
-      base:
-        build:
-          graph_degree: [32, 64]
-          intermediate_graph_degree: [64, 96]
-          graph_build_algo: ["NN_DESCENT"]
-        search:
-          itopk: [32, 64, 128]
-
-      large:
-        build:
-          graph_degree: [32, 64]
-        search:
-          itopk: [32, 64, 128]
-
-The default parameters for which the benchmarks are run can be overridden by creating a custom YAML file for algorithms with a `base` group.
-
-The config above has 3 fields:
-
-1. `name` - The name of the algorithm for which the parameters are being specified.
-2. `constraints` - Optional. Python import paths to functions that validate build and search parameter combinations (e.g. ``cuvs_bench.config.algos.constraints.cuvs_cagra_build``). Each function returns ``True`` if the parameters are valid, ``False`` otherwise; invalid combinations are skipped and not benchmarked.
-3. `groups` - Run groups, each with a set of parameters. Each group defines a cross-product of all hyper-parameter fields for `build` and `search`.
-
-The table below contains all algorithms supported by cuVS. Each unique algorithm will have its own set of `build` and `search` settings. The :doc:`ANN Algorithm Parameter Tuning Guide <param_tuning>` contains detailed instructions on choosing build and search parameters for each supported algorithm.
-
-.. list-table::
-
- * - Library
-   - Algorithms
-
- * - FAISS_GPU
-   - `faiss_gpu_flat`, `faiss_gpu_ivf_flat`, `faiss_gpu_ivf_pq`, `faiss_gpu_cagra`
-
- * - FAISS_CPU
-   - `faiss_cpu_flat`, `faiss_cpu_ivf_flat`, `faiss_cpu_ivf_pq`, `faiss_cpu_hnsw_flat`
-
- * - GGNN
-   - `ggnn`
-
- * - HNSWLIB
-   - `hnswlib`
-
- * - DiskANN
-   - `diskann_memory`, `diskann_ssd`
-
- * - cuVS
-   - `cuvs_brute_force`, `cuvs_cagra`, `cuvs_ivf_flat`, `cuvs_ivf_pq`, `cuvs_cagra_hnswlib`, `cuvs_vamana`
-
-
-Multi-GPU benchmarks
---------------------
-
-cuVS implements single node multi-GPU versions of IVF-Flat, IVF-PQ and CAGRA indexes.
-
-.. list-table::
-
- * - Index type
-   - Multi-GPU algo name
-
- * - IVF-Flat
-   - `cuvs_mg_ivf_flat`
-
- * - IVF-PQ
-   - `cuvs_mg_ivf_pq`
-
- * - CAGRA
-   - `cuvs_mg_cagra`
-
-
-Adding a new index algorithm
-============================
-
-Implementation and configuration
---------------------------------
-
-Implementation of a new algorithm should be a C++ class that inherits `class ANN` (defined in `cpp/bench/ann/src/ann.h`) and implements all the pure virtual functions.
-
-In addition, it should define two `struct`s for building and searching parameters. The searching parameter class should inherit `struct ANN<T>::AnnSearchParam`. Take `class HnswLib` as an example, its definition is:
-
-.. code-block:: c++
-
-    template<typename T>
-    class HnswLib : public ANN<T> {
-    public:
-      struct BuildParam {
-        int M;
-        int ef_construction;
-        int num_threads;
-      };
-
-      using typename ANN<T>::AnnSearchParam;
-      struct SearchParam : public AnnSearchParam {
-        int ef;
-        int num_threads;
-      };
-
-      // ...
-    };
-
-
-The benchmark program uses JSON format natively in a configuration file to specify indexes to build, along with the build and search parameters. However the JSON config files are overly verbose and are not meant to be used directly. Instead, the Python scripts parse YAML and create these json files automatically. It's important to realize that these json objects align with the yaml objects for `build_param`, whose value is a JSON object, and `search_param`, whose value is an array of JSON objects. Take the json configuration for `HnswLib` as an example of the json after it's been parsed from yaml:
-
-.. code-block:: json
-
-    {
-      "name" : "hnswlib.M12.ef500.th32",
-      "algo" : "hnswlib",
-      "build_param": {"M":12, "efConstruction":500, "numThreads":32},
-      "file" : "/path/to/file",
-      "search_params" : [
-        {"ef":10, "numThreads":1},
-        {"ef":20, "numThreads":1},
-        {"ef":40, "numThreads":1},
-      ],
-      "search_result_file" : "/path/to/file"
-    },
-
-The build and search params are ultimately passed to the C++ layer as json objects for each param configuration to benchmark. The code below shows how to parse these params for `Hnswlib`:
-
-1. First, add two functions for parsing JSON object to `struct BuildParam` and `struct SearchParam`, respectively:
-
-.. code-block:: c++
-
-    template<typename T>
-    void parse_build_param(const nlohmann::json& conf,
-                           typename cuann::HnswLib<T>::BuildParam& param) {
-      param.ef_construction = conf.at("efConstruction");
-      param.M = conf.at("M");
-      if (conf.contains("numThreads")) {
-        param.num_threads = conf.at("numThreads");
-      }
-    }
-
-    template<typename T>
-    void parse_search_param(const nlohmann::json& conf,
-                            typename cuann::HnswLib<T>::SearchParam& param) {
-      param.ef = conf.at("ef");
-      if (conf.contains("numThreads")) {
-        param.num_threads = conf.at("numThreads");
-      }
-    }
-
-
-
-2. Next, add corresponding `if` case to functions `create_algo()` (in `cpp/bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example,
-
-.. code-block:: c++
-
-      // JSON configuration file contains a line like:  "algo" : "hnswlib"
-      if (algo == "hnswlib") {
-         // ...
-      }
-
-Adding a Cmake target
----------------------
-
-In `cuvs/cpp/bench/ann/CMakeLists.txt`, we provide a `CMake` function to configure a new Benchmark target with the following signature:
-
-
-.. code-block:: cmake
-
-    ConfigureAnnBench(
-      NAME <algo_name>
-      PATH </path/to/algo/benchmark/source/file>
-      INCLUDES <additional_include_directories>
-      CXXFLAGS <additional_cxx_flags>
-      LINKS <additional_link_library_targets>
-    )
-
-To add a target for `HNSWLIB`, we would call the function as:
-
-.. code-block:: cmake
-
-    ConfigureAnnBench(
-      NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES
-      ${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}"
-    )
-
-This will create an executable called `HNSWLIB_ANN_BENCH`, which can then be used to run `HNSWLIB` benchmarks.
-
-Add a new entry to `algos.yaml` to map the name of the algorithm to its binary executable and specify whether the algorithm requires GPU support.
-
-.. code-block:: yaml
-
-    cuvs_ivf_pq:
-      executable: CUVS_IVF_PQ_ANN_BENCH
-      requires_gpu: true
-
-`executable` : specifies the name of the binary that will build/search the index. It is assumed to be available in `cuvs/cpp/build/`.
-`requires_gpu` : denotes whether an algorithm requires GPU to run.
-
-
-.. toctree::
-   :maxdepth: 4
-
-   build.rst
-   datasets.rst
-   param_tuning.rst
-   pluggable_backend.rst
-   wiki_all_dataset.rst
diff --git a/docs/source/cuvs_bench/param_tuning.md b/docs/source/cuvs_bench/param_tuning.md
new file mode 100644
index 0000000000..1464bc83b3
--- /dev/null
+++ b/docs/source/cuvs_bench/param_tuning.md
@@ -0,0 +1,894 @@
+# cuVS Bench Parameter Tuning Guide
+
+This guide outlines the various parameter settings that can be specified in {doc}`cuVS Benchmarks <index>` yaml configuration files and explains the impact they have on corresponding algorithms to help inform their settings for benchmarking across desired levels of recall.
+
+## Benchmark modes
+
+When you run benchmarks with `BenchmarkOrchestrator.run_benchmark()`, you can choose how parameters are explored:
+
+**Sweep mode (default)**
+
+Pass `mode="sweep"` or omit `mode`. The orchestrator builds the full Cartesian product of all build and search parameter lists defined in the algorithm YAML (see {doc}`Creating and customizing dataset configurations <index>`). Every valid combination (after constraint filtering) is run. Use this for exhaustive comparison across the configured parameter grid.
+
+**Tune mode**
+
+Pass `mode="tune"` to perform hyperparameter optimization using Optuna instead of running every combination. You must pass:
+
+- **constraints** (dict): The optimization target and optional bounds. One metric must be `"maximize"` or `"minimize"` (the goal). Others can set hard limits with `{"min": X}` or `{"max": X}`. Examples: `{"recall": "maximize", "latency": {"max": 10}}` or `{"latency": "minimize", "recall": {"min": 0.95}}`.
+- **n_trials** (int, optional): Maximum number of Optuna trials (default 100). Ignored in sweep mode.
+
+Example:
+
+```python
+results = orchestrator.run_benchmark(
+    mode="tune",
+    dataset="deep-image-96-inner",
+    algorithms="cuvs_cagra",
+    constraints={"recall": "maximize", "latency": {"max": 5.0}},
+    n_trials=50,
+    count=10,
+    batch_size=10,
+)
+```
+
+The parameter tables below describe the build and search knobs that sweep mode varies and that tune mode can optimize.
+
+## cuVS Indexes
+
+### cuvs_brute_force
+
+Use cuVS brute-force index for exact search. Brute-force has no further build or search parameters.
+
+### cuvs_ivf_flat
+
+IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
+
+IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlist`
+  - `build`
+  - Y
+  - Positive integer >0
+  - 1024
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+* - `niter`
+  - `build`
+  - N
+  - Positive integer >0
+  - 20
+  - Number of kmeans iterations to use when training the ivf clusters
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `dataset_memory_type`
+  - `build`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `mmap`
+  - Where should the dataset reside?
+
+* - `query_memory_type`
+  - `search`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `device`
+  - Where should the queries reside?
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+```
+
+### cuvs_ivf_pq
+
+IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlist`
+  - `build`
+  - Y
+  - Positive integer >0
+  - 1024
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+* - `niter`
+  - `build`
+  - N
+  - Positive integer >0
+  - 20
+  - Number of kmeans iterations to use when training the ivf clusters
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `pq_dim`
+  - `build`
+  - N
+  - Positive integer. Multiple of 8.
+  - 0
+  - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value.
+
+* - `pq_bits`
+  - `build`
+  - N
+  - Positive integer [4-8]
+  - 8
+  - Bit length of the vector element after quantization.
+
+* - `codebook_kind`
+  - `build`
+  - N
+  - [`cluster`, `subspace`]
+  - `subspace`
+  - Type of codebook. See {doc}`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
+
+* - `dataset_memory_type`
+  - `build`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `mmap`
+  - Where should the dataset reside?
+
+* - `query_memory_type`
+  - `search`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `device`
+  - Where should the queries reside?
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  - 20
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+* - `internalDistanceDtype`
+  - `search`
+  - N
+  - [`float`, `half`]
+  - `half`
+  - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
+
+* - `smemLutDtype`
+  - `search`
+  - N
+  - [`float`, `half`, `fp8`]
+  - `half`
+  - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
+
+* - `refine_ratio`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+```
+
+### cuvs_cagra
+
+CAGRA uses a graph-based index, which creates an intermediate, approximate kNN graph using IVF-PQ and then further refining and optimizing to create a final kNN graph. This kNN graph is used by CAGRA as an index for search.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `graph_degree`
+  - `build`
+  - N
+  - Positive integer >0
+  - 64
+  - Degree of the final kNN graph index.
+
+* - `intermediate_graph_degree`
+  - `build`
+  - N
+  - Positive integer >0
+  - 128
+  - Degree of the intermediate kNN graph before the CAGRA graph is optimized
+
+* - `graph_build_algo`
+  - `build`
+  - `N`
+  - [`IVF_PQ`, `NN_DESCENT`, `ACE`]
+  - `IVF_PQ`
+  - Algorithm to use for building the initial kNN graph, from which CAGRA will optimize into the navigable CAGRA graph
+
+* - `dataset_memory_type`
+  - `build`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `mmap`
+  - Where should the dataset reside?
+
+* - `npartitions`
+  - `build`
+  - N
+  - Positive integer >0
+  - 1
+  - The number of partitions to use for the ACE build. Small values might improve recall but potentially degrade performance and increase memory usage. Partitions should not be too small to prevent issues in KNN graph construction. The partition size is on average 2 * (n_rows / npartitions) * dim * sizeof(T). 2 is because of the core and augmented vectors. Please account for imbalance in the partition sizes (up to 3x in our tests).
+
+* - `build_dir`
+  - `build`
+  - N
+  - String
+  - "/tmp/ace_build"
+  - The directory to use for the ACE build. Must be specified when using ACE build. This should be the fastest disk in the system and hold enough space for twice the dataset, final graph, and label mapping.
+
+* - `ef_construction`
+  - `build`
+  - Y
+  - Positive integer >0
+  - 120
+  - Controls index time and accuracy when using ACE build. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
+
+* - `use_disk`
+  - `build`
+  - N
+  - Boolean
+  - `false`
+  - Whether to use disk-based storage for ACE build. When true, forces ACE to use disk-based storage even if the graph fits in host and GPU memory. When false, ACE will use in-memory storage if the graph fits in host and GPU memory and disk-based storage otherwise.
+
+* - `query_memory_type`
+  - `search`
+  - N
+  - [`device`, `host`, `mmap`]
+  - `device`
+  - Where should the queries reside?
+
+* - `itopk`
+  - `search`
+  - N
+  - Positive integer >0
+  - 64
+  - Number of intermediate search results retained during the search. Higher values improve search accuracy at the cost of speed
+
+* - `search_width`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of graph nodes to select as the starting point for the search in each iteration.
+
+* - `max_iterations`
+  - `search`
+  - N
+  - Positive integer >=0
+  - 0
+  - Upper limit of search iterations. Auto select when 0
+
+* - `algo`
+  - `search`
+  - N
+  - [`auto`, `single_cta`, `multi_cta`, `multi_kernel`]
+  - `auto`
+  - Algorithm to use for search. It's usually best to leave this to `auto`.
+
+* - `graph_memory_type`
+  - `search`
+  - N
+  - [`device`, `host_pinned`, `host_huge_page`]
+  - `device`
+  - Memory type to store graph
+
+* - `internal_dataset_memory_type`
+  - `search`
+  - N
+  - [`device`, `host_pinned`, `host_huge_page`]
+  - `device`
+  - Memory type to store dataset
+```
+
+The `graph_memory_type` or `internal_dataset_memory_type` options can be useful for large datasets that do not fit the device memory. Setting `internal_dataset_memory_type` other than `device` has negative impact on search speed. Using `host_huge_page` option is only supported on systems with Heterogeneous Memory Management or on platforms that natively support GPU access to system allocated memory, for example Grace Hopper.
+
+To fine tune CAGRA index building we can customize IVF-PQ index builder options using the following settings. These take effect only if `graph_build_algo == "IVF_PQ"`. It is recommended to experiment using a separate IVF-PQ index to find the config that gives the largest QPS for large batch. Recall does not need to be very high, since CAGRA further optimizes the kNN neighbor graph. Some of the default values are derived from the dataset size which is assumed to be [n_vecs, dim].
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `ivf_pq_build_nlist`
+  - `build`
+  - N
+  - Positive integer >0
+  - sqrt(n_vecs)
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+* - `ivf_pq_build_niter`
+  - `build`
+  - N
+  - Positive integer >0
+  - 25
+  - Number of k-means iterations to use when training the clusters.
+
+* - `ivf_pq_build_ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 10
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `ivf_pq_pq_dim`
+  - `build`
+  - N
+  - Positive integer. Multiple of 8
+  - dim/2 rounded up to 8
+  - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value. `pq_dim` * `pq_bits` must be a multiple of 8.
+
+* - `ivf_pq_build_pq_bits`
+  - `build`
+  - N
+  - Positive integer [4-8]
+  - 8
+  - Bit length of the vector element after quantization.
+
+* - `ivf_pq_build_codebook_kind`
+  - `build`
+  - N
+  - [`cluster`, `subspace`]
+  - `subspace`
+  - Type of codebook. See {doc}`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
+
+* - `ivf_pq_build_nprobe`
+  - `search`
+  - N
+  - Positive integer >0
+  - min(2*dim, nlist)
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+* - `ivf_pq_build_internalDistanceDtype`
+  - `search`
+  - N
+  - [`float`, `half`]
+  - `half`
+  - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
+
+* - `ivf_pq_build_smemLutDtype`
+  - `search`
+  - N
+  - [`float`, `half`, `fp8`]
+  - `fp8`
+  - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
+
+* - `ivf_pq_build_refine_ratio`
+  - `search`
+  - N
+  - Positive integer >0
+  - 2
+  - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+```
+
+Alternatively, if `graph_build_algo == "NN_DESCENT"`, then we can customize the following parameters
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nn_descent_niter`
+  - `build`
+  - N
+  - Positive integer >0
+  - 20
+  - Number of nn-descent iterations
+
+* - `nn_descent_intermediate_graph_degree`
+  - `build`
+  - N
+  - Positive integer >0
+  - `cagra.intermediate_graph_degree` * 1.5
+  - Intermadiate graph degree during nn-descent iterations
+
+* - nn_descent_termination_threshold
+  - `build`
+  - N
+  - Positive float >0
+  - 1e-4
+  - Early stopping threshold for nn-descent convergence
+```
+
+### cuvs_cagra_hnswlib
+
+This is a benchmark that enables interoperability between `CAGRA` built `HNSW` search. It uses the `CAGRA` built graph as the base layer of an `hnswlib` index to search queries only within the base layer (this is enabled with a simple patch to `hnswlib`).
+
+`build` : Same as `build` of CAGRA
+
+`search` : Same as `search` of Hnswlib
+
+### cuvs_vamana
+
+Benchmark for building an in-memory Vamana graph based index on the GPU and interoperability with DiskANN for search.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `graph_degree`
+  - `build`
+  - N
+  - Positive integer >0
+  - 32
+  - Maximum degree of the graph index
+
+* - `visited_size`
+  - `build`
+  - N
+  - Positive integer >0
+  - 64
+  - Maximum number of visited nodes per search corresponds to the L parameter in the Vamana literature
+
+* - `alpha`
+  - `build`
+  -  N
+  - Positive float >0
+  - 1.2
+  - Alpha for pruning parameter
+
+* - `L_search`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - Maximum number of visited nodes per search corresponds to the L parameter in the Vamana literature. Larger values improve recall at the cost of search time.
+```
+
+## FAISS Indexes
+
+### faiss_gpu_flat
+
+Use FAISS flat index on the GPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
+
+### faiss_gpu_ivf_flat
+
+IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
+
+IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlists`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  - 20
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+```
+
+### faiss_gpu_ivf_pq
+
+IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlist`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `M_ratio`
+  - `build`
+  - Y
+  - Positive integer. Power of 2 [8-64]
+  -
+  - Ratio of number of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
+
+* - `usePrecomputed`
+  - `build`
+  - N
+  - Boolean
+  - `false`
+  - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
+
+* - `useFloat16`
+  - `build`
+  - N
+  - Boolean
+  - `false`
+  - Use half-precision floats for clustering step.
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+* - `refine_ratio`
+  - `search`
+  - N
+  - Positive number >=1
+  - 1
+  - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+```
+
+### faiss_cpu_flat
+
+Use FAISS flat index on the CPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `numThreads`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use for queries.
+```
+
+### faiss_cpu_ivf_flat
+
+Use FAISS IVF-Flat index on CPU
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlists`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+* - `numThreads`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use for queries.
+```
+
+### faiss_cpu_ivf_pq
+
+Use FAISS IVF-PQ index on CPU
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `nlist`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
+
+* - `ratio`
+  - `build`
+  - N
+  - Positive integer >0
+  - 2
+  - `1/ratio` is the number of training points which should be used to train the clusters.
+
+* - `M`
+  - `build`
+  - Y
+  - Positive integer. Power of 2 [8-64]
+  -
+  - Ratio of number of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
+
+* - `usePrecomputed`
+  - `build`
+  - N
+  - Boolean
+  - `false`
+  - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
+
+* - `bitsPerCode`
+  - `build`
+  - N
+  - Positive integer [4-8]
+  - 8
+  - Number of bits for representing each quantized code.
+
+* - `nprobe`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
+
+* - `refine_ratio`
+  - `search`
+  - N
+  - Positive number >=1
+  - 1
+  - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
+
+* - `numThreads`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use for queries.
+```
+
+## HNSW
+
+### cuvs_hnsw
+
+cuVS HNSW builds an HNSW index using the ACE (Augmented Core Extraction) algorithm, which enables GPU-accelerated HNSW index construction for datasets too large to fit in GPU memory.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `hierarchy`
+  - `build`
+  - N
+  - [`NONE`, `CPU`, `GPU`]
+  - `NONE`
+  - Type of HNSW hierarchy to build. `NONE` creates a base-layer-only index, `CPU` builds full hierarchy on CPU, `GPU` builds full hierarchy on GPU.
+
+* - `efConstruction`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
+
+* - `M`
+  - `build`
+  - Y
+  - Positive integer. Often between 2-100
+  -
+  - Number of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption.
+
+* - `numThreads`
+  - `build`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use to build the index.
+
+* - `npartitions`
+  - `build`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of partitions to use for the ACE build. Small values might improve recall but potentially degrade performance and increase memory usage. The partition size is on average 2 * (n_rows / npartitions) * dim * sizeof(T). 2 is because of the core and augmented vectors. Please account for imbalance in the partition sizes (up to 3x in our tests).
+
+* - `ef_construction`
+  - `build`
+  - N
+  - Positive integer >0
+  - 120
+  - Controls index time and accuracy when using ACE build. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
+
+* - `build_dir`
+  - `build`
+  - N
+  - String
+  - "/tmp/ace_build"
+  - The directory to use for the ACE build. This should be the fastest disk in the system and hold enough space for twice the dataset, final graph, and label mapping.
+
+* - `use_disk`
+  - `build`
+  - N
+  - Boolean
+  - `false`
+  - Whether to use disk-based storage for ACE build. When true, forces ACE to use disk-based storage even if the graph fits in host and GPU memory. When false, ACE will use in-memory storage if the graph fits in host and GPU memory and disk-based storage otherwise.
+
+* - `ef`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`.
+
+* - `numThreads`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use for queries.
+```
+
+### hnswlib
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `efConstruction`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
+
+* - `M`
+  - `build`
+  - Y
+  - Positive integer. Often between 2-100
+  -
+  - Number of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption.
+
+* - `numThreads`
+  - `build`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use to build the index.
+
+* - `ef`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`.
+
+* - `numThreads`
+  - `search`
+  - N
+  - Positive integer >0
+  - 1
+  - Number of threads to use for queries.
+```
+
+Please refer to [HNSW algorithm parameters guide](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) from `hnswlib` to learn more about these arguments.
+
+## DiskANN
+
+### diskann_memory
+
+Use DiskANN in-memory index for approximate search.
+
+```{list-table}
+* - Parameter
+  - Type
+  - Required
+  - Data Type
+  - Default
+  - Description
+
+* - `R`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - Maximum degree of the graph index
+
+* - `L_build`
+  - `build`
+  - Y
+  - Positive integer >0
+  -
+  - number of visited nodes per greedy search during graph construction
+
+* - `alpha`
+  - `build`
+  - N
+  - Positive number >=1
+  - 1.2
+  - controls the pruning parameter of the graph construction
+
+* - `num_threads`
+  - `build`
+  - N
+  - Positive integer >0
+  - omp_get_max_threads()
+  - Number of CPU threads to use to build the index.
+
+* - `L_search`
+  - `search`
+  - Y
+  - Positive integer >0
+  -
+  - visited list size during search
+```
+
diff --git a/docs/source/cuvs_bench/param_tuning.rst b/docs/source/cuvs_bench/param_tuning.rst
deleted file mode 100644
index 692fd7eb6a..0000000000
--- a/docs/source/cuvs_bench/param_tuning.rst
+++ /dev/null
@@ -1,918 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-cuVS Bench Parameter Tuning Guide
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This guide outlines the various parameter settings that can be specified in :doc:`cuVS Benchmarks <index>` yaml configuration files and explains the impact they have on corresponding algorithms to help inform their settings for benchmarking across desired levels of recall.
-
-Benchmark modes
-===============
-
-When you run benchmarks with ``BenchmarkOrchestrator.run_benchmark()``, you can choose how parameters are explored:
-
-**Sweep mode (default)**
-
-Pass ``mode="sweep"`` or omit ``mode``. The orchestrator builds the full Cartesian product of all build and search parameter lists defined in the algorithm YAML (see :doc:`Creating and customizing dataset configurations <index>`). Every valid combination (after constraint filtering) is run. Use this for exhaustive comparison across the configured parameter grid.
-
-**Tune mode**
-
-Pass ``mode="tune"`` to perform hyperparameter optimization using Optuna instead of running every combination. You must pass:
-
-- **constraints** (dict): The optimization target and optional bounds. One metric must be ``"maximize"`` or ``"minimize"`` (the goal). Others can set hard limits with ``{"min": X}`` or ``{"max": X}``. Examples: ``{"recall": "maximize", "latency": {"max": 10}}`` or ``{"latency": "minimize", "recall": {"min": 0.95}}``.
-- **n_trials** (int, optional): Maximum number of Optuna trials (default 100). Ignored in sweep mode.
-
-Example:
-
-.. code-block:: python
-
-    results = orchestrator.run_benchmark(
-        mode="tune",
-        dataset="deep-image-96-inner",
-        algorithms="cuvs_cagra",
-        constraints={"recall": "maximize", "latency": {"max": 5.0}},
-        n_trials=50,
-        count=10,
-        batch_size=10,
-    )
-
-The parameter tables below describe the build and search knobs that sweep mode varies and that tune mode can optimize.
-
-cuVS Indexes
-============
-
-cuvs_brute_force
-----------------
-
-Use cuVS brute-force index for exact search. Brute-force has no further build or search parameters.
-
-cuvs_ivf_flat
--------------
-
-IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
-
-IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlist`
-   - `build`
-   - Y
-   - Positive integer >0
-   - 1024
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
-
- * - `niter`
-   - `build`
-   - N
-   - Positive integer >0
-   - 20
-   - Number of kmeans iterations to use when training the ivf clusters
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `dataset_memory_type`
-   - `build`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `mmap`
-   - Where should the dataset reside?
-
- * - `query_memory_type`
-   - `search`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `device`
-   - Where should the queries reside?
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
-
-cuvs_ivf_pq
------------
-
-IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlist`
-   - `build`
-   - Y
-   - Positive integer >0
-   - 1024
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
-
- * - `niter`
-   - `build`
-   - N
-   - Positive integer >0
-   - 20
-   - Number of kmeans iterations to use when training the ivf clusters
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `pq_dim`
-   - `build`
-   - N
-   - Positive integer. Multiple of 8.
-   - 0
-   - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value.
-
- * - `pq_bits`
-   - `build`
-   - N
-   - Positive integer [4-8]
-   - 8
-   - Bit length of the vector element after quantization.
-
- * - `codebook_kind`
-   - `build`
-   - N
-   - [`cluster`, `subspace`]
-   - `subspace`
-   - Type of codebook. See :doc:`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
-
- * - `dataset_memory_type`
-   - `build`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `mmap`
-   - Where should the dataset reside?
-
- * - `query_memory_type`
-   - `search`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `device`
-   - Where should the queries reside?
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   - 20
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
- * - `internalDistanceDtype`
-   - `search`
-   - N
-   - [`float`, `half`]
-   - `half`
-   - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
-
- * - `smemLutDtype`
-   - `search`
-   - N
-   - [`float`, `half`, `fp8`]
-   - `half`
-   - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
-
- * - `refine_ratio`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
-
-
-cuvs_cagra
-----------
-
-CAGRA uses a graph-based index, which creates an intermediate, approximate kNN graph using IVF-PQ and then further refining and optimizing to create a final kNN graph. This kNN graph is used by CAGRA as an index for search.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `graph_degree`
-   - `build`
-   - N
-   - Positive integer >0
-   - 64
-   - Degree of the final kNN graph index.
-
- * - `intermediate_graph_degree`
-   - `build`
-   - N
-   - Positive integer >0
-   - 128
-   - Degree of the intermediate kNN graph before the CAGRA graph is optimized
-
- * - `graph_build_algo`
-   - `build`
-   - `N`
-   - [`IVF_PQ`, `NN_DESCENT`, `ACE`]
-   - `IVF_PQ`
-   - Algorithm to use for building the initial kNN graph, from which CAGRA will optimize into the navigable CAGRA graph
-
- * - `dataset_memory_type`
-   - `build`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `mmap`
-   - Where should the dataset reside?
-
- * - `npartitions`
-   - `build`
-   - N
-   - Positive integer >0
-   - 1
-   - The number of partitions to use for the ACE build. Small values might improve recall but potentially degrade performance and increase memory usage. Partitions should not be too small to prevent issues in KNN graph construction. The partition size is on average 2 * (n_rows / npartitions) * dim * sizeof(T). 2 is because of the core and augmented vectors. Please account for imbalance in the partition sizes (up to 3x in our tests).
-
- * - `build_dir`
-   - `build`
-   - N
-   - String
-   - "/tmp/ace_build"
-   - The directory to use for the ACE build. Must be specified when using ACE build. This should be the fastest disk in the system and hold enough space for twice the dataset, final graph, and label mapping.
-
- * - `ef_construction`
-   - `build`
-   - Y
-   - Positive integer >0
-   - 120
-   - Controls index time and accuracy when using ACE build. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
-
- * - `use_disk`
-   - `build`
-   - N
-   - Boolean
-   - `false`
-   - Whether to use disk-based storage for ACE build. When true, forces ACE to use disk-based storage even if the graph fits in host and GPU memory. When false, ACE will use in-memory storage if the graph fits in host and GPU memory and disk-based storage otherwise.
-
- * - `query_memory_type`
-   - `search`
-   - N
-   - [`device`, `host`, `mmap`]
-   - `device`
-   - Where should the queries reside?
-
- * - `itopk`
-   - `search`
-   - N
-   - Positive integer >0
-   - 64
-   - Number of intermediate search results retained during the search. Higher values improve search accuracy at the cost of speed
-
- * - `search_width`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of graph nodes to select as the starting point for the search in each iteration.
-
- * - `max_iterations`
-   - `search`
-   - N
-   - Positive integer >=0
-   - 0
-   - Upper limit of search iterations. Auto select when 0
-
- * - `algo`
-   - `search`
-   - N
-   - [`auto`, `single_cta`, `multi_cta`, `multi_kernel`]
-   - `auto`
-   - Algorithm to use for search. It's usually best to leave this to `auto`.
-
- * - `graph_memory_type`
-   - `search`
-   - N
-   - [`device`, `host_pinned`, `host_huge_page`]
-   - `device`
-   - Memory type to store graph
-
- * - `internal_dataset_memory_type`
-   - `search`
-   - N
-   - [`device`, `host_pinned`, `host_huge_page`]
-   - `device`
-   - Memory type to store dataset
-
-The `graph_memory_type` or `internal_dataset_memory_type` options can be useful for large datasets that do not fit the device memory. Setting `internal_dataset_memory_type` other than `device` has negative impact on search speed. Using `host_huge_page` option is only supported on systems with Heterogeneous Memory Management or on platforms that natively support GPU access to system allocated memory, for example Grace Hopper.
-
-To fine tune CAGRA index building we can customize IVF-PQ index builder options using the following settings. These take effect only if `graph_build_algo == "IVF_PQ"`. It is recommended to experiment using a separate IVF-PQ index to find the config that gives the largest QPS for large batch. Recall does not need to be very high, since CAGRA further optimizes the kNN neighbor graph. Some of the default values are derived from the dataset size which is assumed to be [n_vecs, dim].
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `ivf_pq_build_nlist`
-   - `build`
-   - N
-   - Positive integer >0
-   - sqrt(n_vecs)
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
-
- * - `ivf_pq_build_niter`
-   - `build`
-   - N
-   - Positive integer >0
-   - 25
-   - Number of k-means iterations to use when training the clusters.
-
- * - `ivf_pq_build_ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 10
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `ivf_pq_pq_dim`
-   - `build`
-   - N
-   - Positive integer. Multiple of 8
-   - dim/2 rounded up to 8
-   - Dimensionality of the vector after product quantization. When 0, a heuristic is used to select this value. `pq_dim` * `pq_bits` must be a multiple of 8.
-
- * - `ivf_pq_build_pq_bits`
-   - `build`
-   - N
-   - Positive integer [4-8]
-   - 8
-   - Bit length of the vector element after quantization.
-
- * - `ivf_pq_build_codebook_kind`
-   - `build`
-   - N
-   - [`cluster`, `subspace`]
-   - `subspace`
-   - Type of codebook. See :doc:`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
-
- * - `ivf_pq_build_nprobe`
-   - `search`
-   - N
-   - Positive integer >0
-   - min(2*dim, nlist)
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
- * - `ivf_pq_build_internalDistanceDtype`
-   - `search`
-   - N
-   - [`float`, `half`]
-   - `half`
-   - The precision to use for the distance computations. Lower precision can increase performance at the cost of accuracy.
-
- * - `ivf_pq_build_smemLutDtype`
-   - `search`
-   - N
-   - [`float`, `half`, `fp8`]
-   - `fp8`
-   - The precision to use for the lookup table in shared memory. Lower precision can increase performance at the cost of accuracy.
-
- * - `ivf_pq_build_refine_ratio`
-   - `search`
-   - N
-   - Positive integer >0
-   - 2
-   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
-
-Alternatively, if `graph_build_algo == "NN_DESCENT"`, then we can customize the following parameters
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nn_descent_niter`
-   - `build`
-   - N
-   - Positive integer >0
-   - 20
-   - Number of nn-descent iterations
-
- * - `nn_descent_intermediate_graph_degree`
-   - `build`
-   - N
-   - Positive integer >0
-   - `cagra.intermediate_graph_degree` * 1.5
-   - Intermadiate graph degree during nn-descent iterations
-
- * - nn_descent_termination_threshold
-   - `build`
-   - N
-   - Positive float >0
-   - 1e-4
-   - Early stopping threshold for nn-descent convergence
-
-cuvs_cagra_hnswlib
-------------------
-
-This is a benchmark that enables interoperability between `CAGRA` built `HNSW` search. It uses the `CAGRA` built graph as the base layer of an `hnswlib` index to search queries only within the base layer (this is enabled with a simple patch to `hnswlib`).
-
-`build` : Same as `build` of CAGRA
-
-`search` : Same as `search` of Hnswlib
-
-cuvs_vamana
------------
-
-Benchmark for building an in-memory Vamana graph based index on the GPU and interoperability with DiskANN for search.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `graph_degree`
-   - `build`
-   - N
-   - Positive integer >0
-   - 32
-   - Maximum degree of the graph index
-
- * - `visited_size`
-   - `build`
-   - N
-   - Positive integer >0
-   - 64
-   - Maximum number of visited nodes per search corresponds to the L parameter in the Vamana literature
-
- * - `alpha`
-   - `build`
-   -  N
-   - Positive float >0
-   - 1.2
-   - Alpha for pruning parameter
-
- * - `L_search`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - Maximum number of visited nodes per search corresponds to the L parameter in the Vamana literature. Larger values improve recall at the cost of search time.
-
-FAISS Indexes
-=============
-
-faiss_gpu_flat
---------------
-
-Use FAISS flat index on the GPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
-
-faiss_gpu_ivf_flat
-------------------
-
-IVF-flat uses an inverted-file index, which partitions the vectors into a series of clusters, or lists, storing them in an interleaved format which is optimized for fast distance computation. The searching of an IVF-flat index reduces the total vectors in the index to those within some user-specified nearest clusters called probes.
-
-IVF-flat is a simple algorithm which won't save any space, but it provides competitive search times even at higher levels of recall.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlists`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   - 20
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
-faiss_gpu_ivf_pq
-----------------
-
-IVF-pq is an inverted-file index, which partitions the vectors into a series of clusters, or lists, in a similar way to IVF-flat above. The difference is that IVF-PQ uses product quantization to also compress the vectors, giving the index a smaller memory footprint. Unfortunately, higher levels of compression can also shrink recall, which a refinement step can improve when the original vectors are still available.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlist`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `M_ratio`
-   - `build`
-   - Y
-   - Positive integer. Power of 2 [8-64]
-   -
-   - Ratio of number of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
-
- * - `usePrecomputed`
-   - `build`
-   - N
-   - Boolean
-   - `false`
-   - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
-
- * - `useFloat16`
-   - `build`
-   - N
-   - Boolean
-   - `false`
-   - Use half-precision floats for clustering step.
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
- * - `refine_ratio`
-   - `search`
-   - N
-   - Positive number >=1
-   - 1
-   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
-
-
-faiss_cpu_flat
---------------
-
-Use FAISS flat index on the CPU, which performs an exact search using brute-force and doesn't have any further build or search parameters.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `numThreads`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use for queries.
-
-faiss_cpu_ivf_flat
-------------------
-
-Use FAISS IVF-Flat index on CPU
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlists`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
- * - `numThreads`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use for queries.
-
-faiss_cpu_ivf_pq
-----------------
-
-Use FAISS IVF-PQ index on CPU
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `nlist`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Number of clusters to partition the vectors into. Larger values will put less points into each cluster but this will impact index build time as more clusters need to be trained.
-
- * - `ratio`
-   - `build`
-   - N
-   - Positive integer >0
-   - 2
-   - `1/ratio` is the number of training points which should be used to train the clusters.
-
- * - `M`
-   - `build`
-   - Y
-   - Positive integer. Power of 2 [8-64]
-   -
-   - Ratio of number of chunks or subquantizers for each vector. Computed by `dims` / `M_ratio`
-
- * - `usePrecomputed`
-   - `build`
-   - N
-   - Boolean
-   - `false`
-   - Use pre-computed lookup tables to speed up search at the cost of increased memory usage.
-
- * - `bitsPerCode`
-   - `build`
-   - N
-   - Positive integer [4-8]
-   - 8
-   - Number of bits for representing each quantized code.
-
- * - `nprobe`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - The closest number of clusters to search for each query vector. Larger values will improve recall but will search more points in the index.
-
- * - `refine_ratio`
-   - `search`
-   - N
-   - Positive number >=1
-   - 1
-   - `refine_ratio * k` nearest neighbors are queried from the index initially and an additional refinement step improves recall by selecting only the best `k` neighbors.
-
- * - `numThreads`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use for queries.
-
-HNSW
-====
-
-cuvs_hnsw
----------
-
-cuVS HNSW builds an HNSW index using the ACE (Augmented Core Extraction) algorithm, which enables GPU-accelerated HNSW index construction for datasets too large to fit in GPU memory.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `hierarchy`
-   - `build`
-   - N
-   - [`NONE`, `CPU`, `GPU`]
-   - `NONE`
-   - Type of HNSW hierarchy to build. `NONE` creates a base-layer-only index, `CPU` builds full hierarchy on CPU, `GPU` builds full hierarchy on GPU.
-
- * - `efConstruction`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
-
- * - `M`
-   - `build`
-   - Y
-   - Positive integer. Often between 2-100
-   -
-   - Number of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption.
-
- * - `numThreads`
-   - `build`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use to build the index.
-
- * - `npartitions`
-   - `build`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of partitions to use for the ACE build. Small values might improve recall but potentially degrade performance and increase memory usage. The partition size is on average 2 * (n_rows / npartitions) * dim * sizeof(T). 2 is because of the core and augmented vectors. Please account for imbalance in the partition sizes (up to 3x in our tests).
-
- * - `ef_construction`
-   - `build`
-   - N
-   - Positive integer >0
-   - 120
-   - Controls index time and accuracy when using ACE build. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
-
- * - `build_dir`
-   - `build`
-   - N
-   - String
-   - "/tmp/ace_build"
-   - The directory to use for the ACE build. This should be the fastest disk in the system and hold enough space for twice the dataset, final graph, and label mapping.
-
- * - `use_disk`
-   - `build`
-   - N
-   - Boolean
-   - `false`
-   - Whether to use disk-based storage for ACE build. When true, forces ACE to use disk-based storage even if the graph fits in host and GPU memory. When false, ACE will use in-memory storage if the graph fits in host and GPU memory and disk-based storage otherwise.
-
- * - `ef`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`.
-
- * - `numThreads`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use for queries.
-
-hnswlib
--------
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `efConstruction`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Controls index time and accuracy. Bigger values increase the index quality. At some point, increasing this will no longer improve the quality.
-
- * - `M`
-   - `build`
-   - Y
-   - Positive integer. Often between 2-100
-   -
-   - Number of bi-directional links create for every new element during construction. Higher values work for higher intrinsic dimensionality and/or high recall, low values can work for datasets with low intrinsic dimensionality and/or low recalls. Also affects the algorithm's memory consumption.
-
- * - `numThreads`
-   - `build`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use to build the index.
-
- * - `ef`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - Size of the dynamic list for the nearest neighbors used for search. Higher value leads to more accurate but slower search. Cannot be lower than `k`.
-
- * - `numThreads`
-   - `search`
-   - N
-   - Positive integer >0
-   - 1
-   - Number of threads to use for queries.
-
-Please refer to `HNSW algorithm parameters guide <https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md>`_ from `hnswlib` to learn more about these arguments.
-
-DiskANN
-=======
-
-diskann_memory
---------------
-
-Use DiskANN in-memory index for approximate search.
-
-.. list-table::
-
- * - Parameter
-   - Type
-   - Required
-   - Data Type
-   - Default
-   - Description
-
- * - `R`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - Maximum degree of the graph index
-
- * - `L_build`
-   - `build`
-   - Y
-   - Positive integer >0
-   -
-   - number of visited nodes per greedy search during graph construction
-
- * - `alpha`
-   - `build`
-   - N
-   - Positive number >=1
-   - 1.2
-   - controls the pruning parameter of the graph construction
-
- * - `num_threads`
-   - `build`
-   - N
-   - Positive integer >0
-   - omp_get_max_threads()
-   - Number of CPU threads to use to build the index.
-
- * - `L_search`
-   - `search`
-   - Y
-   - Positive integer >0
-   -
-   - visited list size during search
diff --git a/docs/source/cuvs_bench/pluggable_backend.md b/docs/source/cuvs_bench/pluggable_backend.md
new file mode 100644
index 0000000000..c53031e2ea
--- /dev/null
+++ b/docs/source/cuvs_bench/pluggable_backend.md
@@ -0,0 +1,236 @@
+# Pluggable Backend
+
+cuVS Bench uses a pluggable API so that benchmarks can be run through different execution paths. The default path runs C++ benchmark executables; other backends (e.g. Elasticsearch, Milvus) can be added by implementing the same interface and registering them. Two pieces work together: a **config loader** turns the user's arguments (dataset, algorithms, k, batch_size, and the like) into a structured configuration; a **backend** takes that configuration and runs build and search. Both are registered under a backend type name (e.g. `cpp_gbench`). When `BenchmarkOrchestrator(backend_type="cpp_gbench").run_benchmark(...)` is called, the orchestrator uses the config loader for that type to produce the configuration, then passes it to the backend for that type.
+
+The following shows how the default backend is used:
+
+```python
+from cuvs_bench.orchestrator import BenchmarkOrchestrator
+
+orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
+results = orchestrator.run_benchmark(
+    dataset="deep-image-96-inner",
+    algorithms="cuvs_cagra",
+    count=10,
+    batch_size=10,
+    build=True,
+    search=True,
+)
+```
+
+## How a run flows
+
+1. The user calls `orchestrator.run_benchmark(backend_type="...", dataset=..., algorithms=..., count=..., **kwargs)`.
+
+2. The orchestrator looks up the **config loader** for that `backend_type` and calls its **load()** method. The loader reads YAML (or other sources), expands parameter combinations, applies constraints, and returns a **DatasetConfig** and a list of **BenchmarkConfig** (each describing one or more index configs: algorithm, build params, search params).
+
+3. The orchestrator obtains the **backend** for that `backend_type` from the **BackendRegistry** (instantiating it with the config it needs, e.g. executable path, host/port).
+
+4. The orchestrator calls the backend's **build(dataset, indexes, ...)** then **search(dataset, indexes, k, batch_size, ...)**. The backend uses the same config shape that its loader produced.
+
+5. The backend returns **BuildResult** and **SearchResult**; the orchestrator aggregates and returns them.
+
+The config loader and the backend are thus a pair: the loader defines what to run (which algorithms and parameters); the backend defines how it runs (C++ subprocess, HTTP to a service, and so on).
+
+## What the config loader produces
+
+The orchestrator calls the config loader's **load()** method with the same arguments passed to `run_benchmark()` (e.g. `dataset`, `dataset_path`, `algorithms`, `count`, `batch_size`, `groups`, `algo_groups`, and backend-specific options). The loader must return two things:
+
+- **DatasetConfig** – Dataset metadata: `name`, `base_file`, `query_file`, `groundtruth_neighbors_file`, `distance` (e.g. `"euclidean"`), `dims`, and optional `subset_size`. These are used by the orchestrator to build the in-memory `Dataset` and by the backend if it needs file paths.
+
+- **List[BenchmarkConfig]** – Each **BenchmarkConfig** has:
+  - **indexes**: a list of **IndexConfig**. Each **IndexConfig** has `name` (e.g. `"my_algo.param1value"`), `algo` (algorithm name), `build_param` (dict of build parameters), `search_params` (list of dicts, one per search parameter combination to benchmark), and `file` (path or identifier where the index is stored).
+  - **backend_config**: a dict passed to the backend constructor (e.g. `executable_path` for C++, or `host`, `port`, `index_name` for a network backend). The backend receives this as its `config` in `__init__`.
+
+The following shows how to construct a minimal `DatasetConfig` and one `BenchmarkConfig` (one index, one search param set) so the backend runs a single build and search configuration:
+
+```python
+from cuvs_bench.orchestrator.config_loaders import (
+    ConfigLoader,
+    DatasetConfig,
+    BenchmarkConfig,
+    IndexConfig,
+)
+
+class MyConfigLoader(ConfigLoader):
+    @property
+    def backend_type(self) -> str:
+        return "my_backend"
+
+    def load(self, dataset, dataset_path, algorithms, count=10, batch_size=10000, **kwargs):
+        path_to_base = ...  # path to base vectors file
+        path_to_queries = ...  # path to query file
+        path_to_groundtruth = ...  # path to groundtruth neighbors file
+        path_to_index = ...  # path or id where the index is stored
+        dataset_config = DatasetConfig(
+            name=dataset,
+            base_file=path_to_base,
+            query_file=path_to_queries,
+            groundtruth_neighbors_file=path_to_groundtruth,
+            distance="euclidean",
+            dims=128,
+        )
+        index = IndexConfig(
+            name=f"{algorithms}.default",
+            algo=algorithms,
+            build_param={"nlist": 1024},
+            search_params=[{"nprobe": 10}],
+            file=path_to_index,
+        )
+        benchmark_config = BenchmarkConfig(
+            indexes=[index],
+            backend_config={
+                "host": ...,  # backend host
+                "port": ...,  # backend port
+                "index_name": ...,  # name of the index on the backend
+            },
+        )
+        return dataset_config, [benchmark_config]
+```
+
+## Adding a new backend
+
+To add a new execution path (e.g. Elasticsearch):
+
+1. Implement a **config loader**. Subclass **ConfigLoader** (from `cuvs_bench.orchestrator.config_loaders`). Implement **load()** to accept the kwargs the orchestrator passes (dataset, dataset_path, algorithms, count, batch_size, and the like) and return `(DatasetConfig, List[BenchmarkConfig])`. Populate **DatasetConfig** with dataset paths and metadata; for each run you want, add an **IndexConfig** (name, algo, build_param, search_params, file) and a **BenchmarkConfig** (indexes, backend_config). The **backend_config** dict is passed to your backend's constructor. Register the loader with **register_config_loader("my_backend", MyConfigLoader)**.
+
+2. Implement the **backend**. Subclass **BenchmarkBackend** (from `cuvs_bench.backends.base`). In **__init__(self, config)**, store the config (this is the **backend_config** produced by the loader). Implement **build(dataset, indexes, force=False, dry_run=False)** to return a **BuildResult** (index_path, build_time_seconds, index_size_bytes, algorithm, build_params, metadata, success). Implement **search(dataset, indexes, k, batch_size, mode=..., ...)** to return a **SearchResult** (neighbors, distances, search_time_ms, queries_per_second, recall, algorithm, search_params, success). Implement the **algo** property (e.g. from `self.config["algo"]`). Set **requires_gpu** or **requires_network** in config if the backend needs them. Register the class with **get_registry().register("my_backend", MyBackend)**.
+
+3. Use the new backend by calling `BenchmarkOrchestrator(backend_type="my_backend").run_benchmark(dataset=..., dataset_path=..., algorithms=..., **kwargs)`. The orchestrator will use your loader to build the configuration and your backend to run build and search.
+
+After implementing your loader and backend, register them as follows:
+
+```python
+from cuvs_bench.orchestrator import register_config_loader
+from cuvs_bench.backends import get_registry
+
+register_config_loader("my_backend", MyConfigLoader)
+get_registry().register("my_backend", MyBackend)
+```
+
+## Example: adding an Elasticsearch backend
+
+The following example shows a minimal Elasticsearch-style backend. The config loader builds one dataset config and one benchmark config with a single index; the backend stubs build and search and returns the result types the orchestrator expects. In practice you would replace the stub logic with real Elasticsearch API calls.
+
+Config loader: the **load()** method receives `dataset`, `dataset_path`, `algorithms`, `count`, `batch_size`, and optional kwargs. It returns a **DatasetConfig** (filled from dataset path and name) and a list of one **BenchmarkConfig** containing one **IndexConfig** and a **backend_config** with `host`, `port`, and `index_name` for the backend to use.
+
+```python
+from cuvs_bench.orchestrator.config_loaders import (
+    ConfigLoader,
+    DatasetConfig,
+    BenchmarkConfig,
+    IndexConfig,
+)
+
+class ElasticsearchConfigLoader(ConfigLoader):
+    @property
+    def backend_type(self) -> str:
+        return "elasticsearch"
+
+    def load(self, dataset, dataset_path, algorithms, count=10, batch_size=10000, **kwargs):
+        path_to_base = ...  # path to base vectors (e.g. from dataset_path/dataset)
+        path_to_queries = ...  # path to query vectors
+        path_to_groundtruth = ...  # path to groundtruth file
+        path_to_index = ...  # path or id for the index
+        dataset_config = DatasetConfig(
+            name=dataset,
+            base_file=path_to_base,
+            query_file=path_to_queries,
+            groundtruth_neighbors_file=path_to_groundtruth,
+            distance="euclidean",
+            dims=kwargs.get("dims", 128),
+        )
+        index = IndexConfig(
+            name=f"{algorithms}.es",
+            algo=algorithms,
+            build_param={},
+            search_params=[{"ef_search": 100}],
+            file=path_to_index,
+        )
+        benchmark_config = BenchmarkConfig(
+            indexes=[index],
+            backend_config={
+                "host": ...,  # Elasticsearch host
+                "port": ...,  # Elasticsearch port
+                "index_name": ...,  # name of the vector index
+                "algo": algorithms,
+            },
+        )
+        return dataset_config, [benchmark_config]
+```
+
+Backend: the backend is constructed with **backend_config** (host, port, index_name, algo). **build()** and **search()** return **BuildResult** and **SearchResult** with the required fields; here they are stubbed with minimal values. Replace the stub body with actual Elasticsearch index creation and search calls.
+
+```python
+import numpy as np
+from cuvs_bench.backends.base import (
+    BenchmarkBackend,
+    Dataset,
+    BuildResult,
+    SearchResult,
+)
+from cuvs_bench.orchestrator.config_loaders import IndexConfig
+
+class ElasticsearchBackend(BenchmarkBackend):
+    @property
+    def algo(self) -> str:
+        return self.config.get("algo", "elasticsearch")
+
+    def build(self, dataset, indexes, force=False, dry_run=False):
+        # Stub: in practice, create ES index and bulk-index dataset.base_vectors
+        return BuildResult(
+            index_path=indexes[0].file if indexes else "",
+            build_time_seconds=0.0,
+            index_size_bytes=0,
+            algorithm=self.algo,
+            build_params=indexes[0].build_param if indexes else {},
+            metadata={},
+            success=True,
+        )
+
+    def search(self, dataset, indexes, k, batch_size=10000, mode="latency", force=False, search_threads=None, dry_run=False):
+        # Stub: in practice, run ES kNN search and compute recall
+        n_queries = dataset.n_queries
+        return SearchResult(
+            neighbors=np.zeros((n_queries, k), dtype=np.int64),
+            distances=np.zeros((n_queries, k), dtype=np.float32),
+            search_time_ms=0.0,
+            queries_per_second=0.0,
+            recall=0.0,
+            algorithm=self.algo,
+            search_params=indexes[0].search_params if indexes else [],
+            success=True,
+        )
+```
+
+Registration:
+
+```python
+from cuvs_bench.orchestrator import register_config_loader
+from cuvs_bench.backends import get_registry
+
+register_config_loader("elasticsearch", ElasticsearchConfigLoader)
+get_registry().register("elasticsearch", ElasticsearchBackend)
+```
+
+The built-in **CppGoogleBenchmarkBackend** (`backend_type="cpp_gbench"`) is one such pair: **CppGBenchConfigLoader** reads the YAML under `config/datasets` and `config/algos`, expands the Cartesian product, and validates with the constraint functions; the backend runs the C++ benchmark executables and merges results. Adding a new C++ algorithm (see {doc}`index`) only adds another executable and config for this backend; it does not add a new backend.
+
+## Components at a glance
+
+```{list-table}
+  :header-rows: 1
+  :widths: 20 80
+
+* - Component
+  - Description
+
+* - ConfigLoader
+  - Abstract. **load(**kwargs)** returns `(DatasetConfig, List[BenchmarkConfig])`. Register with **register_config_loader(backend_type, loader_class)**.
+
+* - BenchmarkBackend
+  - Abstract. **build(dataset, indexes, force, dry_run)** returns `BuildResult`; **search(dataset, indexes, k, batch_size, mode, ...)** returns `SearchResult`. Optional **initialize()** and **cleanup()**. Properties: **algo**, **requires_gpu**, **requires_network** (from config). Register with **BackendRegistry.register(name, backend_class)**; get an instance with **get_backend(name, config)**.
+
+* - BackendRegistry
+  - **get_registry()** returns the singleton. **register(name, backend_class)** and **get_backend(name, config)** tie a backend type name to the class and to instances.
+```
+
diff --git a/docs/source/cuvs_bench/pluggable_backend.rst b/docs/source/cuvs_bench/pluggable_backend.rst
deleted file mode 100644
index 3655c1b0b6..0000000000
--- a/docs/source/cuvs_bench/pluggable_backend.rst
+++ /dev/null
@@ -1,241 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~~~~~
-Pluggable Backend
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-cuVS Bench uses a pluggable API so that benchmarks can be run through different execution paths. The default path runs C++ benchmark executables; other backends (e.g. Elasticsearch, Milvus) can be added by implementing the same interface and registering them. Two pieces work together: a **config loader** turns the user's arguments (dataset, algorithms, k, batch_size, and the like) into a structured configuration; a **backend** takes that configuration and runs build and search. Both are registered under a backend type name (e.g. ``cpp_gbench``). When ``BenchmarkOrchestrator(backend_type="cpp_gbench").run_benchmark(...)`` is called, the orchestrator uses the config loader for that type to produce the configuration, then passes it to the backend for that type.
-
-The following shows how the default backend is used:
-
-.. code-block:: python
-
-    from cuvs_bench.orchestrator import BenchmarkOrchestrator
-
-    orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
-    results = orchestrator.run_benchmark(
-        dataset="deep-image-96-inner",
-        algorithms="cuvs_cagra",
-        count=10,
-        batch_size=10,
-        build=True,
-        search=True,
-    )
-
-How a run flows
----------------
-
-1. The user calls ``orchestrator.run_benchmark(backend_type="...", dataset=..., algorithms=..., count=..., **kwargs)``.
-
-2. The orchestrator looks up the **config loader** for that ``backend_type`` and calls its **load()** method. The loader reads YAML (or other sources), expands parameter combinations, applies constraints, and returns a **DatasetConfig** and a list of **BenchmarkConfig** (each describing one or more index configs: algorithm, build params, search params).
-
-3. The orchestrator obtains the **backend** for that ``backend_type`` from the **BackendRegistry** (instantiating it with the config it needs, e.g. executable path, host/port).
-
-4. The orchestrator calls the backend's **build(dataset, indexes, ...)** then **search(dataset, indexes, k, batch_size, ...)**. The backend uses the same config shape that its loader produced.
-
-5. The backend returns **BuildResult** and **SearchResult**; the orchestrator aggregates and returns them.
-
-The config loader and the backend are thus a pair: the loader defines what to run (which algorithms and parameters); the backend defines how it runs (C++ subprocess, HTTP to a service, and so on).
-
-What the config loader produces
--------------------------------
-
-The orchestrator calls the config loader's **load()** method with the same arguments passed to ``run_benchmark()`` (e.g. ``dataset``, ``dataset_path``, ``algorithms``, ``count``, ``batch_size``, ``groups``, ``algo_groups``, and backend-specific options). The loader must return two things:
-
-- **DatasetConfig** – Dataset metadata: ``name``, ``base_file``, ``query_file``, ``groundtruth_neighbors_file``, ``distance`` (e.g. ``"euclidean"``), ``dims``, and optional ``subset_size``. These are used by the orchestrator to build the in-memory ``Dataset`` and by the backend if it needs file paths.
-
-- **List[BenchmarkConfig]** – Each **BenchmarkConfig** has:
-  - **indexes**: a list of **IndexConfig**. Each **IndexConfig** has ``name`` (e.g. ``"my_algo.param1value"``), ``algo`` (algorithm name), ``build_param`` (dict of build parameters), ``search_params`` (list of dicts, one per search parameter combination to benchmark), and ``file`` (path or identifier where the index is stored).
-  - **backend_config**: a dict passed to the backend constructor (e.g. ``executable_path`` for C++, or ``host``, ``port``, ``index_name`` for a network backend). The backend receives this as its ``config`` in ``__init__``.
-
-The following shows how to construct a minimal ``DatasetConfig`` and one ``BenchmarkConfig`` (one index, one search param set) so the backend runs a single build and search configuration:
-
-.. code-block:: python
-
-    from cuvs_bench.orchestrator.config_loaders import (
-        ConfigLoader,
-        DatasetConfig,
-        BenchmarkConfig,
-        IndexConfig,
-    )
-
-    class MyConfigLoader(ConfigLoader):
-        @property
-        def backend_type(self) -> str:
-            return "my_backend"
-
-        def load(self, dataset, dataset_path, algorithms, count=10, batch_size=10000, **kwargs):
-            path_to_base = ...  # path to base vectors file
-            path_to_queries = ...  # path to query file
-            path_to_groundtruth = ...  # path to groundtruth neighbors file
-            path_to_index = ...  # path or id where the index is stored
-            dataset_config = DatasetConfig(
-                name=dataset,
-                base_file=path_to_base,
-                query_file=path_to_queries,
-                groundtruth_neighbors_file=path_to_groundtruth,
-                distance="euclidean",
-                dims=128,
-            )
-            index = IndexConfig(
-                name=f"{algorithms}.default",
-                algo=algorithms,
-                build_param={"nlist": 1024},
-                search_params=[{"nprobe": 10}],
-                file=path_to_index,
-            )
-            benchmark_config = BenchmarkConfig(
-                indexes=[index],
-                backend_config={
-                    "host": ...,  # backend host
-                    "port": ...,  # backend port
-                    "index_name": ...,  # name of the index on the backend
-                },
-            )
-            return dataset_config, [benchmark_config]
-
-Adding a new backend
---------------------
-
-To add a new execution path (e.g. Elasticsearch):
-
-1. Implement a **config loader**. Subclass **ConfigLoader** (from ``cuvs_bench.orchestrator.config_loaders``). Implement **load()** to accept the kwargs the orchestrator passes (dataset, dataset_path, algorithms, count, batch_size, and the like) and return ``(DatasetConfig, List[BenchmarkConfig])``. Populate **DatasetConfig** with dataset paths and metadata; for each run you want, add an **IndexConfig** (name, algo, build_param, search_params, file) and a **BenchmarkConfig** (indexes, backend_config). The **backend_config** dict is passed to your backend's constructor. Register the loader with **register_config_loader("my_backend", MyConfigLoader)**.
-
-2. Implement the **backend**. Subclass **BenchmarkBackend** (from ``cuvs_bench.backends.base``). In **__init__(self, config)**, store the config (this is the **backend_config** produced by the loader). Implement **build(dataset, indexes, force=False, dry_run=False)** to return a **BuildResult** (index_path, build_time_seconds, index_size_bytes, algorithm, build_params, metadata, success). Implement **search(dataset, indexes, k, batch_size, mode=..., ...)** to return a **SearchResult** (neighbors, distances, search_time_ms, queries_per_second, recall, algorithm, search_params, success). Implement the **algo** property (e.g. from ``self.config["algo"]``). Set **requires_gpu** or **requires_network** in config if the backend needs them. Register the class with **get_registry().register("my_backend", MyBackend)**.
-
-3. Use the new backend by calling ``BenchmarkOrchestrator(backend_type="my_backend").run_benchmark(dataset=..., dataset_path=..., algorithms=..., **kwargs)``. The orchestrator will use your loader to build the configuration and your backend to run build and search.
-
-After implementing your loader and backend, register them as follows:
-
-.. code-block:: python
-
-    from cuvs_bench.orchestrator import register_config_loader
-    from cuvs_bench.backends import get_registry
-
-    register_config_loader("my_backend", MyConfigLoader)
-    get_registry().register("my_backend", MyBackend)
-
-Example: adding an Elasticsearch backend
------------------------------------------
-
-The following example shows a minimal Elasticsearch-style backend. The config loader builds one dataset config and one benchmark config with a single index; the backend stubs build and search and returns the result types the orchestrator expects. In practice you would replace the stub logic with real Elasticsearch API calls.
-
-Config loader: the **load()** method receives ``dataset``, ``dataset_path``, ``algorithms``, ``count``, ``batch_size``, and optional kwargs. It returns a **DatasetConfig** (filled from dataset path and name) and a list of one **BenchmarkConfig** containing one **IndexConfig** and a **backend_config** with ``host``, ``port``, and ``index_name`` for the backend to use.
-
-.. code-block:: python
-
-    from cuvs_bench.orchestrator.config_loaders import (
-        ConfigLoader,
-        DatasetConfig,
-        BenchmarkConfig,
-        IndexConfig,
-    )
-
-    class ElasticsearchConfigLoader(ConfigLoader):
-        @property
-        def backend_type(self) -> str:
-            return "elasticsearch"
-
-        def load(self, dataset, dataset_path, algorithms, count=10, batch_size=10000, **kwargs):
-            path_to_base = ...  # path to base vectors (e.g. from dataset_path/dataset)
-            path_to_queries = ...  # path to query vectors
-            path_to_groundtruth = ...  # path to groundtruth file
-            path_to_index = ...  # path or id for the index
-            dataset_config = DatasetConfig(
-                name=dataset,
-                base_file=path_to_base,
-                query_file=path_to_queries,
-                groundtruth_neighbors_file=path_to_groundtruth,
-                distance="euclidean",
-                dims=kwargs.get("dims", 128),
-            )
-            index = IndexConfig(
-                name=f"{algorithms}.es",
-                algo=algorithms,
-                build_param={},
-                search_params=[{"ef_search": 100}],
-                file=path_to_index,
-            )
-            benchmark_config = BenchmarkConfig(
-                indexes=[index],
-                backend_config={
-                    "host": ...,  # Elasticsearch host
-                    "port": ...,  # Elasticsearch port
-                    "index_name": ...,  # name of the vector index
-                    "algo": algorithms,
-                },
-            )
-            return dataset_config, [benchmark_config]
-
-Backend: the backend is constructed with **backend_config** (host, port, index_name, algo). **build()** and **search()** return **BuildResult** and **SearchResult** with the required fields; here they are stubbed with minimal values. Replace the stub body with actual Elasticsearch index creation and search calls.
-
-.. code-block:: python
-
-    import numpy as np
-    from cuvs_bench.backends.base import (
-        BenchmarkBackend,
-        Dataset,
-        BuildResult,
-        SearchResult,
-    )
-    from cuvs_bench.orchestrator.config_loaders import IndexConfig
-
-    class ElasticsearchBackend(BenchmarkBackend):
-        @property
-        def algo(self) -> str:
-            return self.config.get("algo", "elasticsearch")
-
-        def build(self, dataset, indexes, force=False, dry_run=False):
-            # Stub: in practice, create ES index and bulk-index dataset.base_vectors
-            return BuildResult(
-                index_path=indexes[0].file if indexes else "",
-                build_time_seconds=0.0,
-                index_size_bytes=0,
-                algorithm=self.algo,
-                build_params=indexes[0].build_param if indexes else {},
-                metadata={},
-                success=True,
-            )
-
-        def search(self, dataset, indexes, k, batch_size=10000, mode="latency", force=False, search_threads=None, dry_run=False):
-            # Stub: in practice, run ES kNN search and compute recall
-            n_queries = dataset.n_queries
-            return SearchResult(
-                neighbors=np.zeros((n_queries, k), dtype=np.int64),
-                distances=np.zeros((n_queries, k), dtype=np.float32),
-                search_time_ms=0.0,
-                queries_per_second=0.0,
-                recall=0.0,
-                algorithm=self.algo,
-                search_params=indexes[0].search_params if indexes else [],
-                success=True,
-            )
-
-Registration:
-
-.. code-block:: python
-
-    from cuvs_bench.orchestrator import register_config_loader
-    from cuvs_bench.backends import get_registry
-
-    register_config_loader("elasticsearch", ElasticsearchConfigLoader)
-    get_registry().register("elasticsearch", ElasticsearchBackend)
-
-The built-in **CppGoogleBenchmarkBackend** (``backend_type="cpp_gbench"``) is one such pair: **CppGBenchConfigLoader** reads the YAML under ``config/datasets`` and ``config/algos``, expands the Cartesian product, and validates with the constraint functions; the backend runs the C++ benchmark executables and merges results. Adding a new C++ algorithm (see :doc:`index`) only adds another executable and config for this backend; it does not add a new backend.
-
-Components at a glance
-----------------------
-
-.. list-table::
-   :header-rows: 1
-   :widths: 20 80
-
- * - Component
-   - Description
-
- * - ConfigLoader
-   - Abstract. **load(**kwargs)** returns ``(DatasetConfig, List[BenchmarkConfig])``. Register with **register_config_loader(backend_type, loader_class)**.
-
- * - BenchmarkBackend
-   - Abstract. **build(dataset, indexes, force, dry_run)** returns ``BuildResult``; **search(dataset, indexes, k, batch_size, mode, ...)** returns ``SearchResult``. Optional **initialize()** and **cleanup()**. Properties: **algo**, **requires_gpu**, **requires_network** (from config). Register with **BackendRegistry.register(name, backend_class)**; get an instance with **get_backend(name, config)**.
-
- * - BackendRegistry
-   - **get_registry()** returns the singleton. **register(name, backend_class)** and **get_backend(name, config)** tie a backend type name to the class and to instances.
diff --git a/docs/source/cuvs_bench/wiki_all_dataset.rst b/docs/source/cuvs_bench/wiki_all_dataset.md
similarity index 57%
rename from docs/source/cuvs_bench/wiki_all_dataset.rst
rename to docs/source/cuvs_bench/wiki_all_dataset.md
index 38b72ae3f5..3e26ca0d9e 100644
--- a/docs/source/cuvs_bench/wiki_all_dataset.rst
+++ b/docs/source/cuvs_bench/wiki_all_dataset.md
@@ -1,56 +1,49 @@
-~~~~~~~~~~~~~~~~
-Wiki-all Dataset
-~~~~~~~~~~~~~~~~
+# Wiki-all Dataset
 
 
 The `wiki-all` dataset was created to stress vector search algorithms at scale with both a large number of vectors and dimensions. The entire dataset contains 88M vectors with 768 dimensions and is meant for testing the types of vectors one would typically encounter in retrieval augmented generation (RAG) workloads. The full dataset is ~251GB in size, which is intentionally larger than the typical memory of GPUs. The massive scale is intended to promote the use of compression and efficient out-of-core methods for both indexing and search.
 
-The dataset is composed of English wiki texts from `Kaggle <https://www.kaggle.com/datasets/jjinho/wikipedia-20230701>`_ and multi-lingual wiki texts from `Cohere Wikipedia <https://huggingface.co/datasets/Cohere/wikipedia-22-12>`_.
+The dataset is composed of English wiki texts from [Kaggle](https://www.kaggle.com/datasets/jjinho/wikipedia-20230701) and multi-lingual wiki texts from [Cohere Wikipedia](https://huggingface.co/datasets/Cohere/wikipedia-22-12).
 
 Cohere's English Texts are older (2022) and smaller than the Kaggle English Wiki texts (2023) so the English texts have been removed from Cohere completely. The final Wiki texts include English Wiki from Kaggle and the other languages from Cohere. The English texts constitute 50% of the total text size.
 
-To form the final dataset, the Wiki texts were chunked into 85 million 128-token pieces. For reference, Cohere chunks Wiki texts into 104-token pieces. Finally, the embeddings of each chunk were computed using the `paraphrase-multilingual-mpnet-base-v2 <https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2>`_ embedding model. The resulting dataset is an embedding matrix of size 88 million by 768. Also included with the dataset is a query file containing 10k query vectors and a groundtruth file to evaluate nearest neighbors algorithms.
+To form the final dataset, the Wiki texts were chunked into 85 million 128-token pieces. For reference, Cohere chunks Wiki texts into 104-token pieces. Finally, the embeddings of each chunk were computed using the [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) embedding model. The resulting dataset is an embedding matrix of size 88 million by 768. Also included with the dataset is a query file containing 10k query vectors and a groundtruth file to evaluate nearest neighbors algorithms.
 
-Getting the dataset
-===================
+## Getting the dataset
 
-Full dataset
-------------
+### Full dataset
 
-A version of the dataset is made available in the binary format that can be used directly by the :doc:`cuvs-bench <index>` tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.
+A version of the dataset is made available in the binary format that can be used directly by the {doc}`cuvs-bench <index>` tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.
 
 The following will download all 10 the parts and untar them to a `wiki_all_88M` directory:
 
-.. code-block:: bash
-
-    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C wiki_all_88M/
+```bash
+curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.{00..9} | tar -xf - -C wiki_all_88M/
+```
 
 The above has the unfortunate drawback that if the command should fail for any reason, all the parts need to be re-downloaded. The files can also be downloaded individually and then untarred to the directory. Each file is ~27GB and there are 10 of them.
 
-.. code-block:: bash
-
-    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.00
-    ...
-    curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.09
+```bash
+curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.00
+...
+curl -s https://data.rapids.ai/raft/datasets/wiki_all/wiki_all.tar.09
 
-    cat wiki_all.tar.* | tar -xf - -C wiki_all_88M/
+cat wiki_all.tar.* | tar -xf - -C wiki_all_88M/
+```
 
-1M and 10M subsets
-------------------
+### 1M and 10M subsets
 
 Also available are 1M and 10M subsets of the full dataset which are 2.9GB and 29GB, respectively. These subsets also include query sets of 10k vectors and corresponding groundtruth files.
 
-.. code-block:: bash
-
-    curl -s https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
-    curl -s https://data.rapids.ai/raft/datasets/wiki_all_10M/wiki_all_10M.tar
+```bash
+curl -s https://data.rapids.ai/raft/datasets/wiki_all_1M/wiki_all_1M.tar
+curl -s https://data.rapids.ai/raft/datasets/wiki_all_10M/wiki_all_10M.tar
+```
 
-Using the dataset
-=================
+## Using the dataset
 
 After the dataset is downloaded and extracted to the `wiki_all_88M` directory (or `wiki_all_1M`/`wiki_all_10M` depending on whether the subsets are used), the files can be used in the benchmarking tool. The dataset name is `wiki_all` (or `wiki_all_1M`/`wiki_all_10M`), and the benchmarking tool can be used by specifying the appropriate name `--dataset wiki_all_88M` in the scripts.
 
-License info
-============
+## License info
 
-The English wiki texts available on Kaggle come with the `CC BY-NCSA 4.0 <https://creativecommons.org/licenses/by-nc-sa/4.0/>`_ license and the Cohere wikipedia data set comes with the `Apache 2.0 <https://choosealicense.com/licenses/apache-2.0/>`_ license.
+The English wiki texts available on Kaggle come with the [CC BY-NCSA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license and the Cohere wikipedia data set comes with the [Apache 2.0](https://choosealicense.com/licenses/apache-2.0/) license.
diff --git a/docs/source/filtering.md b/docs/source/filtering.md
new file mode 100644
index 0000000000..4cd902f623
--- /dev/null
+++ b/docs/source/filtering.md
@@ -0,0 +1,109 @@
+(filtering)=
+
+# Filtering vector indexes
+
+cuVS supports different type of filtering depending on the vector index being used. The main method used in all of the vector indexes
+is pre-filtering, which is a technique that will take into account the filtering of the vectors before computing its closest neighbors, saving
+some computation from calculating distances.
+
+## Bitset
+
+A bitset is an array of bits where each bit can have two possible values: `0` and `1`, which signify in the context of filtering whether
+a sample should be filtered or not. `0` means that the corresponding vector will be filtered, and will therefore not be present in the results of the search.
+This mechanism is optimized to take as little memory space as possible, and is available through the RAFT library
+(check out RAFT's `bitset API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitset/>`). When calling a search function of an ANN index, the
+bitset length should match the number of vectors present in the database.
+
+## Bitmap
+
+A bitmap is based on the same principle as a bitset, but in two dimensions. This allows users to provide a different bitset for each query
+being searched. Check out RAFT's `bitmap API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitmap/>`.
+
+## Examples
+
+### Using a Bitset filter on a CAGRA index
+
+```c++
+#include <cuvs/neighbors/cagra.hpp>
+#include <cuvs/core/bitset.hpp>
+
+using namespace cuvs::neighbors;
+cagra::index index;
+
+// ... build index ...
+
+cagra::search_params search_params;
+raft::device_resources res;
+raft::device_matrix_view<float> queries = load_queries();
+raft::device_matrix_view<uint32_t> neighbors = make_device_matrix_view<uint32_t>(n_queries, k);
+raft::device_matrix_view<float> distances = make_device_matrix_view<float>(n_queries, k);
+
+// Load a list of all the samples that will get filtered
+std::vector<uint32_t> removed_indices_host = get_invalid_indices();
+auto removed_indices_device =
+      raft::make_device_vector<uint32_t, uint32_t>(res, removed_indices_host.size());
+// Copy this list to device
+raft::copy(removed_indices_device.data_handle(), removed_indices_host.data(),
+           removed_indices_host.size(), raft::resource::get_cuda_stream(res));
+
+// Create a bitset with the list of samples to filter.
+cuvs::core::bitset<uint32_t, uint32_t> removed_indices_bitset(
+    res, removed_indices_device.view(), index.size());
+// Use a `bitset_filter` in the `cagra::search` function call.
+auto bitset_filter =
+      cuvs::neighbors::filtering::bitset_filter(removed_indices_bitset.view());
+cagra::search(res,
+              search_params,
+              index,
+              queries,
+              neighbors,
+              distances,
+              bitset_filter);
+```
+
+### Using a Bitmap filter on a Brute-force index
+
+```c++
+#include <cuvs/neighbors/brute_force.hpp>
+#include <cuvs/core/bitmap.hpp>
+
+using namespace cuvs::neighbors;
+using indexing_dtype = int64_t;
+
+// ... build index ...
+brute_force::index_params index_params;
+brute_force::search_params search_params;
+raft::device_resources res;
+raft::device_matrix_view<float, indexing_dtype> dataset = load_dataset(n_vectors, dim);
+raft::device_matrix_view<float, indexing_dtype> queries = load_queries(n_queries, dim);
+auto index = brute_force::build(res, index_params, raft::make_const_mdspan(dataset.view()));
+
+// Load a list of all the samples that will get filtered
+std::vector<uint32_t> removed_indices_host = get_invalid_indices();
+auto removed_indices_device =
+      raft::make_device_vector<uint32_t, uint32_t>(res, removed_indices_host.size());
+// Copy this list to device
+raft::copy(removed_indices_device.data_handle(), removed_indices_host.data(),
+           removed_indices_host.size(), raft::resource::get_cuda_stream(res));
+
+// Create a bitmap with the list of samples to filter.
+cuvs::core::bitset<uint32_t, indexing_dtype> removed_indices_bitset(
+  res, removed_indices_device.view(), n_queries * n_vectors);
+cuvs::core::bitmap_view<const uint32_t, indexing_dtype> removed_indices_bitmap(
+    removed_indices_bitset.data(), n_queries, n_vectors);
+
+// Use a `bitmap_filter` in the `brute_force::search` function call.
+auto bitmap_filter =
+      cuvs::neighbors::filtering::bitmap_filter(removed_indices_bitmap);
+
+auto neighbors = raft::make_device_matrix_view<uint32_t, indexing_dtype>(n_queries, k);
+auto distances = raft::make_device_matrix_view<float, indexing_dtype>(n_queries, k);
+brute_force::search(res,
+                    search_params,
+                    index,
+                    raft::make_const_mdspan(queries.view()),
+                    neighbors.view(),
+                    distances.view(),
+                    bitmap_filter);
+```
+
diff --git a/docs/source/filtering.rst b/docs/source/filtering.rst
deleted file mode 100644
index cb168f94c8..0000000000
--- a/docs/source/filtering.rst
+++ /dev/null
@@ -1,116 +0,0 @@
-.. _filtering:
-
-~~~~~~~~~~~~~~~~~~~~~~~~
-Filtering vector indexes
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-cuVS supports different type of filtering depending on the vector index being used. The main method used in all of the vector indexes
-is pre-filtering, which is a technique that will take into account the filtering of the vectors before computing its closest neighbors, saving
-some computation from calculating distances.
-
-Bitset
-======
-
-A bitset is an array of bits where each bit can have two possible values: `0` and `1`, which signify in the context of filtering whether
-a sample should be filtered or not. `0` means that the corresponding vector will be filtered, and will therefore not be present in the results of the search.
-This mechanism is optimized to take as little memory space as possible, and is available through the RAFT library
-(check out RAFT's `bitset API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitset/>`). When calling a search function of an ANN index, the
-bitset length should match the number of vectors present in the database.
-
-Bitmap
-======
-
-A bitmap is based on the same principle as a bitset, but in two dimensions. This allows users to provide a different bitset for each query
-being searched. Check out RAFT's `bitmap API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitmap/>`.
-
-Examples
-========
-
-Using a Bitset filter on a CAGRA index
---------------------------------------
-
-.. code-block:: c++
-
-    #include <cuvs/neighbors/cagra.hpp>
-    #include <cuvs/core/bitset.hpp>
-
-    using namespace cuvs::neighbors;
-    cagra::index index;
-
-    // ... build index ...
-
-    cagra::search_params search_params;
-    raft::device_resources res;
-    raft::device_matrix_view<float> queries = load_queries();
-    raft::device_matrix_view<uint32_t> neighbors = make_device_matrix_view<uint32_t>(n_queries, k);
-    raft::device_matrix_view<float> distances = make_device_matrix_view<float>(n_queries, k);
-
-    // Load a list of all the samples that will get filtered
-    std::vector<uint32_t> removed_indices_host = get_invalid_indices();
-    auto removed_indices_device =
-          raft::make_device_vector<uint32_t, uint32_t>(res, removed_indices_host.size());
-    // Copy this list to device
-    raft::copy(removed_indices_device.data_handle(), removed_indices_host.data(),
-               removed_indices_host.size(), raft::resource::get_cuda_stream(res));
-
-    // Create a bitset with the list of samples to filter.
-    cuvs::core::bitset<uint32_t, uint32_t> removed_indices_bitset(
-        res, removed_indices_device.view(), index.size());
-    // Use a `bitset_filter` in the `cagra::search` function call.
-    auto bitset_filter =
-          cuvs::neighbors::filtering::bitset_filter(removed_indices_bitset.view());
-    cagra::search(res,
-                  search_params,
-                  index,
-                  queries,
-                  neighbors,
-                  distances,
-                  bitset_filter);
-
-
-Using a Bitmap filter on a Brute-force index
---------------------------------------------
-
-.. code-block:: c++
-
-    #include <cuvs/neighbors/brute_force.hpp>
-    #include <cuvs/core/bitmap.hpp>
-
-    using namespace cuvs::neighbors;
-    using indexing_dtype = int64_t;
-
-    // ... build index ...
-    brute_force::index_params index_params;
-    brute_force::search_params search_params;
-    raft::device_resources res;
-    raft::device_matrix_view<float, indexing_dtype> dataset = load_dataset(n_vectors, dim);
-    raft::device_matrix_view<float, indexing_dtype> queries = load_queries(n_queries, dim);
-    auto index = brute_force::build(res, index_params, raft::make_const_mdspan(dataset.view()));
-
-    // Load a list of all the samples that will get filtered
-    std::vector<uint32_t> removed_indices_host = get_invalid_indices();
-    auto removed_indices_device =
-          raft::make_device_vector<uint32_t, uint32_t>(res, removed_indices_host.size());
-    // Copy this list to device
-    raft::copy(removed_indices_device.data_handle(), removed_indices_host.data(),
-               removed_indices_host.size(), raft::resource::get_cuda_stream(res));
-
-    // Create a bitmap with the list of samples to filter.
-    cuvs::core::bitset<uint32_t, indexing_dtype> removed_indices_bitset(
-      res, removed_indices_device.view(), n_queries * n_vectors);
-    cuvs::core::bitmap_view<const uint32_t, indexing_dtype> removed_indices_bitmap(
-        removed_indices_bitset.data(), n_queries, n_vectors);
-
-    // Use a `bitmap_filter` in the `brute_force::search` function call.
-    auto bitmap_filter =
-          cuvs::neighbors::filtering::bitmap_filter(removed_indices_bitmap);
-
-    auto neighbors = raft::make_device_matrix_view<uint32_t, indexing_dtype>(n_queries, k);
-    auto distances = raft::make_device_matrix_view<float, indexing_dtype>(n_queries, k);
-    brute_force::search(res,
-                        search_params,
-                        index,
-                        raft::make_const_mdspan(queries.view()),
-                        neighbors.view(),
-                        distances.view(),
-                        bitmap_filter);
diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md
new file mode 100644
index 0000000000..d108652653
--- /dev/null
+++ b/docs/source/getting_started.md
@@ -0,0 +1,115 @@
+# Getting Started
+
+- [New to vector search?](#new-to-vector-search)
+
+  * {doc}`Primer on vector search <choosing_and_configuring_indexes>`
+
+  * {doc}`Vector search indexes vs vector databases <vector_databases_vs_vector_search>`
+
+  * {doc}`Index tuning guide <tuning_guide>`
+
+  * {doc}`Comparing vector search index performance <comparing_indexes>`
+
+- [Supported indexes](#supported-indexes)
+
+  * {doc}`Vector search index guide <neighbors/neighbors>`
+
+- [Using cuVS APIs](#using-cuvs-apis)
+
+  * {doc}`C API Docs <c_api>`
+
+  * {doc}`C++ API Docs <cpp_api>`
+
+  * {doc}`Python API Docs <python_api>`
+
+  * {doc}`Rust API Docs <rust_api/index>`
+
+  * {doc}`API basics <api_basics>`
+
+  * {doc}`API interoperability <api_interoperability>`
+
+- [Where to next?](#where-to-next)
+
+  * [Social media](#social-media)
+
+  * [Blogs](#blogs)
+
+  * [Research](#research)
+
+  * [Get involved](#get-involved)
+
+## New to vector search?
+
+If you are unfamiliar with the basics of vector search or how vector search differs from vector databases, then {doc}`this primer on vector search guide <choosing_and_configuring_indexes>` should provide some good insight. Another good resource for the uninitiated is our {doc}`vector databases vs vector search <vector_databases_vs_vector_search>` guide. As outlined in the primer, vector search as used in vector databases is often closer to machine learning than to traditional databases. This means that while traditional databases can often be slow without any performance tuning, they will usually still yield the correct results. Unfortunately, vector search indexes, like other machine learning models, can yield garbage results if not tuned correctly.
+
+Fortunately, this opens up the whole world of hyperparameter optimization to improve vector search performance and quality. Please see our {doc}`index tuning guide <tuning_guide>` for more information.
+
+When comparing the performance of vector search indexes, it is important that considerations are made with respect to three main dimensions:
+
+1. Build time
+1. Search quality
+1. Search performance
+
+Please see the {doc}`primer on comparing vector search index performance <comparing_indexes>` for more information on methodologies and how to make a fair apples-to-apples comparison during your evaluations.
+
+## Supported indexes
+
+cuVS supports many of the standard index types with the list continuing to grow and stay current with the state-of-the-art. Please refer to our {doc}`vector search index guide <neighbors/neighbors>` to learn more about each individual index type, when they can be useful on the GPU, the tuning knobs they offer to trade off performance and quality.
+
+The primary goal of cuVS is to enable speed, scale, and flexibility (in that order)- and one of the important value propositions is to enhance existing software deployments with extensible GPU capabilities to improve pain points while not interrupting parts of the system that work well today with CPU.
+
+
+## Using cuVS APIs
+
+cuVS is a C++ library at its core, which is wrapped with a C library and exposed further through various different languages. cuVS currently provides APIs and documentation for {doc}`C <c_api>`, {doc}`C++ <cpp_api>`, {doc}`Python <python_api>`, and {doc}`Rust <rust_api/index>` with more languages in the works. our {doc}`API basics <api_basics>` provides some background and context about the important paradigms and vocabulary types you'll encounter when working with cuVS types.
+
+Please refer to the {doc}`guide on API interoperability <api_interoperability>` for more information on how cuVS can work seamlessly with other libraries like numpy, cupy, tensorflow, and pytorch, even without having to copy device memory.
+
+
+## Where to next?
+
+cuVS is free and open source software, licensed under Apache 2.0 Once you are familiar with and/or have used cuVS, you can access the developer community most easily through [Github](https://github.com/rapidsai/cuvs). Please open Github issues for any bugs, questions or feature requests.
+
+### Social media
+
+You can access the RAPIDS community through [Slack](https://rapids.ai/slack-invite) , [Stack Overflow](https://stackoverflow.com/tags/rapids) and [X](https://twitter.com/rapidsai)
+
+### Blogs
+
+We frequently publish blogs on GPU-enabled vector search, which can provide great deep dives into various important topics and breakthroughs:
+
+1. [See all cuVS blogs](https://developer.nvidia.com/blog/recent-posts/?products=cuVS)
+1. [Accelerated Vector Search: Approximating with cuVS IVF-Flat](https://developer.nvidia.com/blog/accelerated-vector-search-approximating-with-rapids-raft-ivf-flat/)
+1. Accelerating Vector Search with cuVS IVF-PQ ([Part 1](https://developer.nvidia.com/blog/accelerating-vector-search-rapids-cuvs-ivf-pq-deep-dive-part-1/), [Part 2](https://developer.nvidia.com/blog/accelerating-vector-search-nvidia-cuvs-ivf-pq-performance-tuning-part-2/))
+
+### Research
+
+For the interested reader, many of the accelerated implementations in cuVS are also based on research papers which can provide a lot more background. We also ask you to please cite the corresponding algorithms by referencing them in your own research.
+
+1. [CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search](https://arxiv.org/abs/2308.15136)
+1. [Top-K Algorithms on GPU: A Comprehensive Study and New Methods](https://dl.acm.org/doi/10.1145/3581784.3607062)
+1. [Fast K-NN Graph Construction by GPU Based NN-Descent](https://dl.acm.org/doi/abs/10.1145/3459637.3482344?casa_token=O_nan1B1F5cAAAAA:QHWDEhh0wmd6UUTLY9_Gv6c3XI-5DXM9mXVaUXOYeStlpxTPmV3nKvABRfoivZAaQ3n8FWyrkWw)
+1. [cuSLINK: Single-linkage Agglomerative Clustering on the GPU](https://arxiv.org/abs/2306.16354)
+1. [GPU Semiring Primitives for Sparse Neighborhood Methods](https://arxiv.org/abs/2104.06357)
+1. [VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs](https://arxiv.org/abs/2506.00812)
+
+
+### Get involved
+
+We always welcome patches for new features and bug fixes. Please read our [contributing guide](contributing.md) for more information on contributing patches to cuVS.
+
+
+```{toctree}
+:hidden:
+
+choosing_and_configuring_indexes.md
+vector_databases_vs_vector_search.md
+tuning_guide.md
+comparing_indexes.md
+neighbors/neighbors.md
+api_basics.md
+api_interoperability.md
+working_with_ann_indexes.md
+filtering.md
+```
+
diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst
deleted file mode 100644
index 656bdf32e4..0000000000
--- a/docs/source/getting_started.rst
+++ /dev/null
@@ -1,124 +0,0 @@
-~~~~~~~~~~~~~~~
-Getting Started
-~~~~~~~~~~~~~~~
-
-- `New to vector search?`_
-
-  * :doc:`Primer on vector search <choosing_and_configuring_indexes>`
-
-  * :doc:`Vector search indexes vs vector databases <vector_databases_vs_vector_search>`
-
-  * :doc:`Index tuning guide <tuning_guide>`
-
-  * :doc:`Comparing vector search index performance <comparing_indexes>`
-
-- `Supported indexes`_
-
-  * :doc:`Vector search index guide <neighbors/neighbors>`
-
-- `Using cuVS APIs`_
-
-  * :doc:`C API Docs <c_api>`
-
-  * :doc:`C++ API Docs <cpp_api>`
-
-  * :doc:`Python API Docs <python_api>`
-
-  * :doc:`Rust API Docs <rust_api/index>`
-
-  * :doc:`API basics <api_basics>`
-
-  * :doc:`API interoperability <api_interoperability>`
-
-- `Where to next?`_
-
-  * `Social media`_
-
-  * `Blogs`_
-
-  * `Research`_
-
-  * `Get involved`_
-
-New to vector search?
-=====================
-
-If you are unfamiliar with the basics of vector search or how vector search differs from vector databases, then :doc:`this primer on vector search guide <choosing_and_configuring_indexes>` should provide some good insight. Another good resource for the uninitiated is our :doc:`vector databases vs vector search <vector_databases_vs_vector_search>` guide. As outlined in the primer, vector search as used in vector databases is often closer to machine learning than to traditional databases. This means that while traditional databases can often be slow without any performance tuning, they will usually still yield the correct results. Unfortunately, vector search indexes, like other machine learning models, can yield garbage results if not tuned correctly.
-
-Fortunately, this opens up the whole world of hyperparameter optimization to improve vector search performance and quality. Please see our :doc:`index tuning guide <tuning_guide>` for more information.
-
-When comparing the performance of vector search indexes, it is important that considerations are made with respect to three main dimensions:
-
-#. Build time
-#. Search quality
-#. Search performance
-
-Please see the :doc:`primer on comparing vector search index performance <comparing_indexes>` for more information on methodologies and how to make a fair apples-to-apples comparison during your evaluations.
-
-Supported indexes
-=================
-
-cuVS supports many of the standard index types with the list continuing to grow and stay current with the state-of-the-art. Please refer to our :doc:`vector search index guide <neighbors/neighbors>` to learn more about each individual index type, when they can be useful on the GPU, the tuning knobs they offer to trade off performance and quality.
-
-The primary goal of cuVS is to enable speed, scale, and flexibility (in that order)- and one of the important value propositions is to enhance existing software deployments with extensible GPU capabilities to improve pain points while not interrupting parts of the system that work well today with CPU.
-
-
-Using cuVS APIs
-===============
-
-cuVS is a C++ library at its core, which is wrapped with a C library and exposed further through various different languages. cuVS currently provides APIs and documentation for :doc:`C <c_api>`, :doc:`C++ <cpp_api>`, :doc:`Python <python_api>`, and :doc:`Rust <rust_api/index>` with more languages in the works. our :doc:`API basics <api_basics>` provides some background and context about the important paradigms and vocabulary types you'll encounter when working with cuVS types.
-
-Please refer to the :doc:`guide on API interoperability <api_interoperability>` for more information on how cuVS can work seamlessly with other libraries like numpy, cupy, tensorflow, and pytorch, even without having to copy device memory.
-
-
-Where to next?
-==============
-
-cuVS is free and open source software, licensed under Apache 2.0 Once you are familiar with and/or have used cuVS, you can access the developer community most easily through `Github <https://github.com/rapidsai/cuvs>`_. Please open Github issues for any bugs, questions or feature requests.
-
-Social media
-------------
-
-You can access the RAPIDS community through `Slack <https://rapids.ai/slack-invite>`_ , `Stack Overflow <https://stackoverflow.com/tags/rapids>`_ and `X <https://twitter.com/rapidsai>`_
-
-Blogs
------
-
-We frequently publish blogs on GPU-enabled vector search, which can provide great deep dives into various important topics and breakthroughs:
-
-#. `See all cuVS blogs <https://developer.nvidia.com/blog/recent-posts/?products=cuVS>`_
-#. `Accelerated Vector Search: Approximating with cuVS IVF-Flat <https://developer.nvidia.com/blog/accelerated-vector-search-approximating-with-rapids-raft-ivf-flat/>`_
-#. Accelerating Vector Search with cuVS IVF-PQ (`Part 1 <https://developer.nvidia.com/blog/accelerating-vector-search-rapids-cuvs-ivf-pq-deep-dive-part-1/>`_, `Part 2 <https://developer.nvidia.com/blog/accelerating-vector-search-nvidia-cuvs-ivf-pq-performance-tuning-part-2/>`_)
-
-Research
---------
-
-For the interested reader, many of the accelerated implementations in cuVS are also based on research papers which can provide a lot more background. We also ask you to please cite the corresponding algorithms by referencing them in your own research.
-
-#. `CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search <https://arxiv.org/abs/2308.15136>`_
-#. `Top-K Algorithms on GPU: A Comprehensive Study and New Methods <https://dl.acm.org/doi/10.1145/3581784.3607062>`_
-#. `Fast K-NN Graph Construction by GPU Based NN-Descent <https://dl.acm.org/doi/abs/10.1145/3459637.3482344?casa_token=O_nan1B1F5cAAAAA:QHWDEhh0wmd6UUTLY9_Gv6c3XI-5DXM9mXVaUXOYeStlpxTPmV3nKvABRfoivZAaQ3n8FWyrkWw>`_
-#. `cuSLINK: Single-linkage Agglomerative Clustering on the GPU <https://arxiv.org/abs/2306.16354>`_
-#. `GPU Semiring Primitives for Sparse Neighborhood Methods <https://arxiv.org/abs/2104.06357>`_
-#. `VecFlow: A High-Performance Vector Data Management System for Filtered-Search on GPUs <https://arxiv.org/abs/2506.00812>`_
-
-
-Get involved
-------------
-
-We always welcome patches for new features and bug fixes. Please read our `contributing guide <contributing.md>`_ for more information on contributing patches to cuVS.
-
-
-
-.. toctree::
-   :hidden:
-
-   choosing_and_configuring_indexes.rst
-   vector_databases_vs_vector_search.rst
-   tuning_guide.rst
-   comparing_indexes.rst
-   neighbors/neighbors.rst
-   api_basics.rst
-   api_interoperability.rst
-   working_with_ann_indexes.rst
-   filtering.rst
diff --git a/docs/source/index.rst b/docs/source/index.md
similarity index 60%
rename from docs/source/index.rst
rename to docs/source/index.md
index ecf92ffa8e..ed4daad7fd 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.md
@@ -1,36 +1,31 @@
-cuVS: Vector Search and Clustering on the GPU
-=============================================
+# cuVS: Vector Search and Clustering on the GPU
 
-Welcome to cuVS, the premier library for GPU-accelerated vector search and clustering! cuVS provides several core building blocks for constructing new algorithms, as well as end-to-end vector search and clustering algorithms for use either standalone or through a growing list of :doc:`integrations <integrations>`.
+Welcome to cuVS, the premier library for GPU-accelerated vector search and clustering! cuVS provides several core building blocks for constructing new algorithms, as well as end-to-end vector search and clustering algorithms for use either standalone or through a growing list of {doc}`integrations <integrations>`.
 
-Useful Resources
-################
+## Useful Resources
 
-.. _cuvs_reference: https://docs.rapids.ai/api/cuvs/stable/
+[cuvs_reference]: https://docs.rapids.ai/api/cuvs/stable/
 
-- `Example Notebooks <https://github.com/rapidsai/cuvs/tree/HEAD/notebooks>`_: Example notebooks
-- `Code Examples <https://github.com/rapidsai/cuvs/tree/HEAD/examples>`_: Self-contained code examples
-- `RAPIDS Community <https://rapids.ai/community.html>`_: Get help, contribute, and collaborate.
-- `GitHub repository <https://github.com/rapidsai/cuvs>`_: Download the cuVS source code.
-- `Issue tracker <https://github.com/rapidsai/cuvs/issues>`_: Report issues or request features.
+- [Example Notebooks](https://github.com/rapidsai/cuvs/tree/HEAD/notebooks): Example notebooks
+- [Code Examples](https://github.com/rapidsai/cuvs/tree/HEAD/examples): Self-contained code examples
+- [RAPIDS Community](https://rapids.ai/community.html): Get help, contribute, and collaborate.
+- [GitHub repository](https://github.com/rapidsai/cuvs): Download the cuVS source code.
+- [Issue tracker](https://github.com/rapidsai/cuvs/issues): Report issues or request features.
 
 
-
-What is cuVS?
-#############
+## What is cuVS?
 
 cuVS contains state-of-the-art implementations of several algorithms for running approximate and exact nearest neighbors and clustering on the GPU. It can be used directly or through the various databases and other libraries that have integrated it. The primary goal of cuVS is to simplify the use of GPUs for vector similarity search and clustering.
 
 Vector search is an information retrieval method that has been growing in popularity over the past few  years, partly because of the rising importance of multimedia embeddings created from unstructured data and the need to perform semantic search on the embeddings to find items which are semantically similar to each other.
 
-Vector search is also used in *data mining and machine learning* tasks and comprises an important step in many *clustering* and *visualization* algorithms like `UMAP <https://arxiv.org/abs/2008.00325>`_, `t-SNE <https://lvdmaaten.github.io/tsne/>`_, K-means, and `HDBSCAN <https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html>`_.
+Vector search is also used in *data mining and machine learning* tasks and comprises an important step in many *clustering* and *visualization* algorithms like [UMAP](https://arxiv.org/abs/2008.00325), [t-SNE](https://lvdmaaten.github.io/tsne/), K-means, and [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html).
 
-Finally, faster vector search enables interactions between dense vectors and graphs. Converting a pile of dense vectors into nearest neighbors graphs unlocks the entire world of graph analysis algorithms, such as those found in `GraphBLAS <https://graphblas.org/>`_ and `cuGraph <https://github.com/rapidsai/cugraph>`_.
+Finally, faster vector search enables interactions between dense vectors and graphs. Converting a pile of dense vectors into nearest neighbors graphs unlocks the entire world of graph analysis algorithms, such as those found in [GraphBLAS](https://graphblas.org/) and [cuGraph](https://github.com/rapidsai/cugraph).
 
 Below are some common use-cases for vector search
 
-Semantic search
-~~~~~~~~~~~~~~~
+### Semantic search
 - Generative AI & Retrieval augmented generation (RAG)
 - Recommender systems
 - Computer vision
@@ -41,8 +36,7 @@ Semantic search
 - Model training
 
 
-Data mining
-~~~~~~~~~~~
+### Data mining
 - Clustering algorithms
 - Visualization algorithms
 - Sampling algorithms
@@ -50,8 +44,7 @@ Data mining
 - Ensemble methods
 - k-NN graph construction
 
-Why cuVS?
-#########
+## Why cuVS?
 
 There are several benefits to using cuVS and GPUs for vector search, including
 
@@ -65,28 +58,27 @@ There are several benefits to using cuVS and GPUs for vector search, including
 
 In addition to the items above, cuVS shoulders the responsibility of keeping non-trivial accelerated code up to date as new NVIDIA architectures and CUDA versions are released. This provides a delightful development experience, guaranteeing that any libraries, databases, or applications built on top of it will always be receiving the best performance and scale.
 
-cuVS Technology Stack
-#####################
+## cuVS Technology Stack
 
 cuVS is built on top of the RAPIDS RAFT library of high performance machine learning primitives and provides all the necessary routines for vector search and clustering on the GPU.
 
-.. image:: ../../img/tech_stack.png
-  :width: 600
-  :alt: cuVS is built on top of low-level CUDA libraries and provides many important routines that enable vector search and clustering on the GPU
-
+```{image} ../../img/tech_stack.png
+:width: 600
+:alt: cuVS is built on top of low-level CUDA libraries and provides many important routines that enable vector search and clustering on the GPU
+```
 
+## Contents
 
-Contents
-########
+```{toctree}
+:maxdepth: 4
 
-.. toctree::
-   :maxdepth: 4
+build.md
+getting_started.md
+integrations.md
+cuvs_bench/index.md
+api_docs.md
+advanced_topics.md
+contributing.md
+developer_guide.md
+```
 
-   build.rst
-   getting_started.rst
-   integrations.rst
-   cuvs_bench/index.rst
-   api_docs.rst
-   advanced_topics.rst
-   contributing.md
-   developer_guide.md
diff --git a/docs/source/integrations.md b/docs/source/integrations.md
new file mode 100644
index 0000000000..dcbb3f61df
--- /dev/null
+++ b/docs/source/integrations.md
@@ -0,0 +1,13 @@
+# Integrations
+
+Aside from using cuVS standalone, it can be consumed through a number of sdk and vector database integrations.
+
+```{toctree}
+:maxdepth: 4
+
+integrations/faiss.md
+integrations/milvus.md
+integrations/lucene.md
+integrations/kinetica.md
+```
+
diff --git a/docs/source/integrations.rst b/docs/source/integrations.rst
deleted file mode 100644
index 760892a98a..0000000000
--- a/docs/source/integrations.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-============
-Integrations
-============
-
-Aside from using cuVS standalone, it can be consumed through a number of sdk and vector database integrations.
-
-.. toctree::
-   :maxdepth: 4
-
-   integrations/faiss.rst
-   integrations/milvus.rst
-   integrations/lucene.rst
-   integrations/kinetica.rst
diff --git a/docs/source/integrations/faiss.rst b/docs/source/integrations/faiss.md
similarity index 73%
rename from docs/source/integrations/faiss.rst
rename to docs/source/integrations/faiss.md
index 1fc88d921c..6f6aee53e0 100644
--- a/docs/source/integrations/faiss.rst
+++ b/docs/source/integrations/faiss.md
@@ -1,6 +1,5 @@
-Faiss
------
+# Faiss
 
 Faiss v1.10.0 and beyond provides a special conda package that enables a cuVS backend for the Flat, IVF-Flat, IVF-PQ and CAGRA indexes on the GPU. Like the classical Faiss GPU indexes, the cuVS backend also enables interoperability between Faiss CPU indexes, allowing an index to be trained on GPU, searched on CPU, and vice versa.
 
-The cuVS backend can be enabled by setting the appropriate cmake flag while building Faiss from source. A pre-compiled conda package can also be installed. Refer to `Faiss installation guidelines <https://github.com/facebookresearch/faiss/blob/main/INSTALL.md>`_ for more information.
+The cuVS backend can be enabled by setting the appropriate cmake flag while building Faiss from source. A pre-compiled conda package can also be installed. Refer to [Faiss installation guidelines](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md) for more information.
diff --git a/docs/source/integrations/kinetica.md b/docs/source/integrations/kinetica.md
new file mode 100644
index 0000000000..e690e6738a
--- /dev/null
+++ b/docs/source/integrations/kinetica.md
@@ -0,0 +1,5 @@
+# Kinetica
+
+Starting with release 7.2, Kinetica supports the graph-based the CAGRA algorithm from RAFT. Kinetica will continue to improve its support over coming versions, while also migrating to cuVS as we work to move the vector search algorithms out of RAFT and into cuVS.
+
+Kinetica currently offers the ability to create a CAGRA index in a SQL `CREATE_TABLE` statement, as outlined in their [vector search indexing docs](https://docs.kinetica.com/7.2/concepts/indexes/#cagra-index). Kinetica is not open source, but the RAFT indexes can be enabled in the developer edition, which can be installed [here](https://www.kinetica.com/try/#download_instructions).
diff --git a/docs/source/integrations/kinetica.rst b/docs/source/integrations/kinetica.rst
deleted file mode 100644
index e74cfe82fd..0000000000
--- a/docs/source/integrations/kinetica.rst
+++ /dev/null
@@ -1,6 +0,0 @@
-Kinetica
---------
-
-Starting with release 7.2, Kinetica supports the graph-based the CAGRA algorithm from RAFT. Kinetica will continue to improve its support over coming versions, while also migrating to cuVS as we work to move the vector search algorithms out of RAFT and into cuVS.
-
-Kinetica currently offers the ability to create a CAGRA index in a SQL `CREATE_TABLE` statement, as outlined in their `vector search indexing docs <https://docs.kinetica.com/7.2/concepts/indexes/#cagra-index>`_. Kinetica is not open source, but the RAFT indexes can be enabled in the developer edition, which can be installed `here <https://www.kinetica.com/try/#download_instructions>`_.
diff --git a/docs/source/integrations/lucene.rst b/docs/source/integrations/lucene.md
similarity index 74%
rename from docs/source/integrations/lucene.rst
rename to docs/source/integrations/lucene.md
index d20052545b..bc60123c3f 100644
--- a/docs/source/integrations/lucene.rst
+++ b/docs/source/integrations/lucene.md
@@ -1,6 +1,5 @@
-Lucene
-------
+# Lucene
 
 An experimental Lucene connector for cuVS enables GPU-accelerated vector search indexes through Lucene. Initial benchmarks are showing that this connector can drastically improve the performance of both indexing and search in Lucene. This connector will continue to be improved over time and any interested developers are encouraged to contribute.
 
-Install and evaluate the `lucene-cuvs` connector on `Github <https://github.com/SearchScale/lucene-cuvs>`_.
+Install and evaluate the `lucene-cuvs` connector on [Github](https://github.com/SearchScale/lucene-cuvs).
diff --git a/docs/source/integrations/milvus.rst b/docs/source/integrations/milvus.md
similarity index 59%
rename from docs/source/integrations/milvus.rst
rename to docs/source/integrations/milvus.md
index 4139cca526..e33ab43b59 100644
--- a/docs/source/integrations/milvus.rst
+++ b/docs/source/integrations/milvus.md
@@ -1,8 +1,7 @@
-Milvus
-------
+# Milvus
 
-In version 2.3, Milvus released support for IVF-Flat and IVF-PQ indexes on the GPU through RAFT. Version 2.4 adds support for brute-force and the graph-based CAGRA index on the GPU. Please refer to the `Milvus documentation <https://milvus.io/docs/install_standalone-docker-compose-gpu.md>`_ to install Milvus with GPU support.
+In version 2.3, Milvus released support for IVF-Flat and IVF-PQ indexes on the GPU through RAFT. Version 2.4 adds support for brute-force and the graph-based CAGRA index on the GPU. Please refer to the [Milvus documentation](https://milvus.io/docs/install_standalone-docker-compose-gpu.md) to install Milvus with GPU support.
 
-The GPU indexes can be enabled by using the index types prefixed with `GPU_`, as outlined in the `Milvus index build guide <https://milvus.io/docs/build_index.md#Prepare-index-parameter>`_.
+The GPU indexes can be enabled by using the index types prefixed with `GPU_`, as outlined in the [Milvus index build guide](https://milvus.io/docs/build_index.md#Prepare-index-parameter).
 
 Milvus will be migrating their GPU support from RAFT to cuVS as we continue to move the vector search algorithms out of RAFT and into cuVS.
diff --git a/docs/source/neighbors/all_neighbors.rst b/docs/source/neighbors/all_neighbors.md
similarity index 91%
rename from docs/source/neighbors/all_neighbors.rst
rename to docs/source/neighbors/all_neighbors.md
index a70414fe06..c1368aafd5 100644
--- a/docs/source/neighbors/all_neighbors.rst
+++ b/docs/source/neighbors/all_neighbors.md
@@ -1,5 +1,4 @@
-All-neighbors
-=============
+# All-neighbors
 
 All-neighbors is a specialized algorithm for building approximate all-neighbors k-NN graphs. Unlike traditional nearest neighbor indexes that are designed for searching, all-neighbors focuses on constructing complete k-NN graphs for entire datasets.
 
@@ -16,10 +15,9 @@ All-neighbors supports multiple underlying algorithms:
 
 The algorithm partitions the dataset into clusters and distributes the work across multiple GPUs when possible, making it suitable for large-scale graph construction tasks.
 
-[ :doc:`C API <../c_api/neighbors_all_neighbors_c>` | :doc:`C++ API <../cpp_api/neighbors_all_neighbors>` | :doc:`Python API <../python_api/neighbors_all_neighbors>` ]
+[ {doc}`C API <../c_api/neighbors_all_neighbors_c>` | {doc}`C++ API <../cpp_api/neighbors_all_neighbors>` | {doc}`Python API <../python_api/neighbors_all_neighbors>` ]
 
-Algorithm Overview
-------------------
+## Algorithm Overview
 
 All-neighbors works by:
 
@@ -33,8 +31,7 @@ This approach enables:
 - **Memory Efficiency**: Processing large datasets that don't fit in single GPU memory
 - **Flexibility**: Choice of underlying algorithm based on accuracy vs. speed requirements
 
-Use Cases
----------
+## Use Cases
 
 **Data Mining and Machine Learning**
 - Clustering algorithms (K-means, HDBSCAN)
@@ -51,8 +48,7 @@ Use Cases
 - Batch processing for distributed computing environments
 - Building graphs for graph databases and analytics
 
-Parameters
-----------
+## Parameters
 
 - **algo**: Underlying algorithm (brute_force, ivf_pq, nn_descent)
 - **overlap_factor**: Number of clusters each point is assigned to
@@ -60,8 +56,7 @@ Parameters
 - **metric**: Distance metric for graph construction
 - **algorithm-specific parameters**: IVF-PQ or NN-Descent specific settings
 
-Performance Characteristics
----------------------------
+## Performance Characteristics
 
 - **Build Time**: Scales with dataset size and chosen algorithm
 - **Memory Usage**: Depends on cluster size and overlap factor
diff --git a/docs/source/neighbors/bruteforce.rst b/docs/source/neighbors/bruteforce.md
similarity index 69%
rename from docs/source/neighbors/bruteforce.rst
rename to docs/source/neighbors/bruteforce.md
index 3dc1155073..230e5bb3c6 100644
--- a/docs/source/neighbors/bruteforce.rst
+++ b/docs/source/neighbors/bruteforce.md
@@ -1,9 +1,8 @@
-Brute-force
-===========
+# Brute-force
 
 Brute-force, or flat index, is the most simple index type, as it ultimately boils down to an exhaustive matrix multiplication.
 
-While it scales with :math:`O(N^2*D)`, brute-force can be a great choice when
+While it scales with $O(N^2*D)$, brute-force can be a great choice when
 
 1. exact nearest neighbors are required, and
 2. when the number of vectors is relatively small (a few thousand to a few million)
@@ -12,10 +11,9 @@ Brute-force can also be a good choice for heavily filtered queries where other a
 when filtering out 90%-95% of the vectors from a search, the IVF methods could struggle to return anything at all with smaller number of probes and
 graph-based algorithms with limited hash table memory could end up skipping over important unfiltered entries.
 
-[ :doc:`C API <../c_api/neighbors_bruteforce_c>` | :doc:`C++ API <../cpp_api/neighbors_bruteforce>` | :doc:`Python API <../python_api/neighbors_brute_force>` | :doc:`Rust API <../rust_api/index>` ]
+[ {doc}`C API <../c_api/neighbors_bruteforce_c>` | {doc}`C++ API <../cpp_api/neighbors_bruteforce>` | {doc}`Python API <../python_api/neighbors_brute_force>` | {doc}`Rust API <../rust_api/index>` ]
 
-Filtering considerations
-------------------------
+## Filtering considerations
 
 Because it is exhaustive, brute-force can quickly become the slowest, albeit most accurate form of search. However, even
 when the number of vectors in an index are very large, brute-force can still be used to search vectors efficiently with a filter.
@@ -25,22 +23,18 @@ inherent in other approximate algorithms would simply not include expected vecto
 brute-force, the computation is inverted so distances are only computed between vectors that pass the filter, significantly reducing
 the amount of computation required.
 
-Configuration parameters
-------------------------
+## Configuration parameters
 
-Build parameters
-~~~~~~~~~~~~~~~~
+### Build parameters
 
 None
 
-Search Parameters
-~~~~~~~~~~~~~~~~~
+### Search Parameters
 
 None
 
 
-Tuning Considerations
----------------------
+## Tuning Considerations
 
 Brute-force is exact but that doesn't always mean it's deterministic. For example, when there are many nearest neighbors with
 the same distances it's possible they might be ordered differently across different runs. This especially becomes apparent in
@@ -48,15 +42,13 @@ cases where there are points with the same distance right near the cutoff of `k`
 to differ from ground truth. This is not often a problem in practice and can usually be mitigated by increasing `k`.
 
 
-Memory footprint
-----------------
+## Memory footprint
 
-:math:`precision` is the number of bytes in each element of each vector (e.g. 32-bit = 4-bytes)
+$precision$ is the number of bytes in each element of each vector (e.g. 32-bit = 4-bytes)
 
 
-Index footprint
-~~~~~~~~~~~~~~~
+### Index footprint
 
-Raw vectors: :math:`n\_vectors * n\_dimensions * precision`
+Raw vectors: $n\_vectors * n\_dimensions * precision$
 
-Vector norms (for distances which require them): :math:`n\_vectors * precision`
+Vector norms (for distances which require them): $n\_vectors * precision$
diff --git a/docs/source/neighbors/cagra.md b/docs/source/neighbors/cagra.md
new file mode 100644
index 0000000000..48c4d0b289
--- /dev/null
+++ b/docs/source/neighbors/cagra.md
@@ -0,0 +1,263 @@
+# CAGRA
+
+CAGRA, or (C)UDA (A)NN (GRA)ph-based, is a graph-based index that is based loosely on the popular navigable small-world graph (NSG) algorithm, but which has been
+built from the ground-up specifically for the GPU. CAGRA constructs a flat graph representation by first building a kNN graph
+of the training points and then removing redundant paths between neighbors.
+
+The CAGRA algorithm has two basic steps-
+* 1. Construct a kNN graph
+* 2. Prune redundant routes from the kNN graph.
+
+I-force could be used to construct the initial kNN graph. This would yield the most accurate graph but would be very slow and
+we find that in practice the kNN graph does not need to be very accurate since the pruning step helps to boost the overall recall of
+the index. cuVS provides IVF-PQ and NN-Descent strategies for building the initial kNN graph and these can be selected in index params object during index construction.
+
+[ {doc}`C API <../c_api/neighbors_cagra_c>` | {doc}`C++ API <../cpp_api/neighbors_cagra>` | {doc}`Python API <../python_api/neighbors_cagra>` | {doc}`Rust API <../rust_api/index>` ]
+
+## Interoperability with HNSW
+
+cuVS provides the capability to convert a CAGRA graph to an HNSW graph, which enables the GPU to be used only for building the index
+while the CPU can be leveraged for search.
+
+## Filtering considerations
+
+CAGRA supports filtered search and has improved multi-CTA algorithm in branch-25.02 to provide reasonable recall and performance for filtering rate as high as 90% or more.
+
+To obtain an appropriate recall in filtered search, it is necessary to set search parameters according to the filtering rate, but since it is difficult for users to do this, CAGRA automatically adjusts `itopk_size` internally according to the filtering rate on a heuristic basis. If you want to disable this automatic adjustment, set `filtering_rate`, one of the search parameters, to `0.0`, and `itopk_size` will not be adjusted automatically.
+
+## Configuration parameters
+
+### Build parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - compression
+  - None
+  - For large datasets, the raw vectors can be compressed using product quantization so they can be placed on device. This comes at the cost of lowering recall, though a refinement reranking step can be used to make up the lost recall after search.
+* - graph_build_algo
+  - 'IVF_PQ'
+  - The graph build algorithm to use for building
+* - graph_build_params
+  - None
+  - Specify explicit build parameters for the corresponding graph build algorithms
+* - graph_degree
+  - 32
+  - The degree of the final CAGRA graph. All vertices in the graph will have this degree. During search, a larger graph degree allows for more exploration of the search space and improves recall but at the expense of searching more vertices.
+* - intermediate_graph_degree
+  - 64
+  - The degree of the initial knn graph before it is optimized into the final CAGRA graph. A larger value increases connectivity of the initial graph so that it performs better once pruned. Larger values come at the cost of increased device memory usage and increases the time of initial knn graph construction.
+* - guarantee_connectivity
+  - False
+  - Uses a degree-constrained minimum spanning tree to guarantee the initial knn graph is connected. This can improve recall on some datasets.
+* - attach_data_on_build
+  - True
+  - Should the dataset be attached to the index after the index is built? Setting this to `False` can improve memory usage and performance, for example if the graph is being serialized to disk or converted to HNSW right after building it.
+```
+
+### Search parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - itopk_size
+  - 64
+  - Number of intermediate search results retained during search. This value needs to be >=k. This is the main knob to tweak search performance.
+* - max_iterations
+  - 0
+  - The maximum number of iterations during search. Default is to auto-select.
+* - max_queries
+  - 0
+  - Max number of search queries to perform concurrently (batch size). Default is to auto-select.
+* - team_size
+  - 0
+  - Number of CUDA threads for calculating each distance. Can be 4, 8, 16, or 32. Default is to auto-select.
+* - search_width
+  - 1
+  - Number of vertices to select as the starting point for the search in each iteration.
+* - min_iterations
+  - 0
+  - Minimum number of search iterations to perform
+```
+
+## Tuning Considerations
+
+The 3 hyper-parameters that are most often tuned are `graph_degree`, `intermediate_graph_degree`, and `itopk_size`.
+
+# Memory footprint
+
+CAGRA builds a nearest-neighbor graph (stored on host) while keeping the original dataset vectors around. During index build, the dataset must reside in device (GPU) memory. After building, the dataset can optionally be detached from the index — for example, when immediately converting the CAGRA graph to a CPU-based format like HNSW for search.
+
+## Baseline Memory Footprint
+
+The baseline memory footprint after index construction:
+
+$$
+\text{dataset_size (device)}
+\;=\;
+\text{number_vectors} \times \text{vector_dimension} \times \text{bytes_per_dimension}
+$$
+
+$$
+\text{graph_size (host)}
+\;=\;
+\text{number_vectors} \times \text{graph_degree} \times \operatorname{sizeof}\!\big(\mathrm{IdxT}\big)
+$$
+
+Note: The dataset must be in GPU memory during index build, but can be detached afterward if not needed for search.
+
+**Example** (1,000,000 vectors, dim = 1024, fp32, graph\_degree = 64, IdxT = int32):
+
+- dataset\_size = 4,096,000,000 B = 3906.25 MB
+- graph\_size   = 256,000,000 B = 244.14 MB
+
+## Build peak memory usage
+
+Index build has two phases: (1) construct a knn graph, then (2) optimize it to remove redundant and unnecessary paths.
+The initial knn graph can be built with IVF-PQ or nn-descent. IVF-PQ has the additional benefit that it supports out-of-core construction, allowing CAGRA to be trained on datasets larger than available GPU memory.
+The steps below are sequential with distinct peak memory consumption. The overall peak memory utilization depends on the configured RMM memory resource.
+
+### knn graph build phase using IVF-PQ
+
+The knn graph can be constructed using the IVF-PQ algorithm, which works in two stages: first, an IVF-PQ index is trained on a subset of vectors to learn cluster centroids; then, the full dataset is queried against this index in batches to find approximate nearest neighbors for each vector.
+
+**IVF-PQ Build (centroid training)** — uses a training subset to compute cluster centroids and PQ codebooks.
+
+$$
+\text{IVFPQ_build_peak}
+\;=\;
+\frac{n_{\text{vectors}}}{\text{train_set_ratio}} \times \text{dim} \times 4
+\;+\;
+n_{\text{clusters}} \times \text{dim} \times 4
+\;+\;
+\frac{n_{\text{vectors}}}{\text{train_set_ratio}} \times \operatorname{sizeof}(\mathrm{uint32\_t})
+$$
+
+**Example** (n = 1e6; dim = 1024; n\_clusters = 1024; train\_set\_ratio = 10): 395.01 MB
+
+**IVF-PQ Search (forms the intermediate graph)** — Constructs the knn graph in batches by querying the IVF-PQ index for the nearest neighbors of all training points.
+
+$$
+\text{IVFPQ_search_peak}
+\;=\;
+\text{batch_size} \times \text{dim} \times 4
+\;+\;
+\text{batch_size} \times \text{intermediate_degree} \times \operatorname{sizeof}(\mathrm{uint32\_t})
+\;+\;
+\text{batch_size} \times \text{intermediate_degree} \times 4
+$$
+
+**Example** (batch = 1024, dim = 1024, intermediate\_degree = 128): 5.00 MB
+
+### knn graph build phase using NN-DESCENT
+
+**Peak device memory:**
+
+$$
+\text{NND_device_peak}
+\;=\;
+n_\text{vectors} \times (n_\text{dims} \times 2 + 276)
+$$
+
+- Data vectors (transferred to device and stored as fp16): $n_\text{dims} \times 2$ bytes per vector
+- Small working graph, locks, edge counters: 276 bytes per vector (fixed)
+- Additional $4$ bytes per vector when using L2 metric (for precomputed norms)
+
+**Peak host memory:**
+
+$$
+\text{NND_host_peak}
+\;=\;
+n_\text{vectors} \times (13 \times \text{intermediate_graph_degree} + 912)
+$$
+
+- Full graph with distances (~1.3x overallocation): $1.3 \times 8 \times \text{intermediate_graph_degree}$ bytes per vector
+- Bloom filter for sampling: $1.3 \times 2 \times \text{intermediate_graph_degree}$ bytes per vector
+- 5 sample buffers (degree 32 each): 640 bytes per vector
+- Graph update buffer (degree 32): 256 bytes per vector
+- Edge counters: 16 bytes per vector
+
+
+### Optimize phase
+
+Pruning/reordering the intermediate graph; peak scales linearly with intermediate degree.
+
+$$
+\text{optimize_peak}
+\;=\;
+n_{\text{vectors}} \times
+\Big( 4 + \big(\operatorname{sizeof}(\mathrm{IdxT}) + 1\big)\times \text{intermediate_degree} \Big)
+$$
+
+**Example** (n = 1e6, intermediate\_degree = 128, IdxT = int32): 614.17 MB
+Out-of-core CAGRA build consists of IVF-PQ build, IVF-PQ search, CAGRA optimization. Note that these steps are performed sequentially, so they are not additive.
+
+### Overall Build Peak Memory Usage
+
+The overall peak memory footprint on the device is the maximum allocation across each sequential step, since RMM's `device_memory_resource` releases memory between steps.
+
+**Using IVF-PQ:**
+
+$$
+\text{build_peak}
+\;=\;
+\text{dataset_size}
+\;+\;
+\max\!\big(\text{IVFPQ_build_peak},\ \text{IVFPQ_search_peak},\ \text{optimize_peak}\big)
+$$
+
+**Example:** 3906.25 + max(395.01, 5.00, 614.17) = 4520.42 MB
+
+**Using NN-Descent:**
+
+$$
+\text{build_peak}
+\;=\;
+\text{dataset_size}^{*}
+\;+\;
+\max\!\big(\text{NND_device_peak},\ \text{optimize_peak}\big)
+$$
+
+$\text{dataset_size}^{*}$ applies only when the user passes data residing in device memory; NN-Descent internally copies the dataset to the device as fp16, so host-memory inputs do not add this term.
+
+## Search peak memory usage
+
+CAGRA search requires the dataset and graph to already be resident in GPU memory. When using CAGRA-Q (compressed/quantized), the original dataset can reside in host memory instead. Additionally, temporary workspace memory is needed to store the search results for a batch of queries.
+If multiple batches are to be launched concurrently or overlapped, separate results buffers will be needed for each.
+The below memory estimate assumes just one batch of queries being run at a time and reusing the buffers.
+
+$$
+\text{search_memory}
+\;=\;
+\text{dataset_size} + \text{graph_size} + \text{workspace_size}
+$$
+
+Where `workspace_size` is the temporary memory used for query vectors and result storage:
+
+$$
+\text{query_size}
+\;=\;
+\text{batch_size} \times \text{dim} \times \operatorname{sizeof}(\mathrm{float})
+$$
+
+$$
+\text{result_size}
+\;=\;
+\text{batch_size} \times \text{topk} \times
+\big(\operatorname{sizeof}(\mathrm{IdxT}) + \operatorname{sizeof}(\mathrm{float})\big)
+$$
+
+**Example** (dim = 1024, batch\_size = 100, topk = 10, IdxT = int32):
+
+- query\_size  = 409,600 B = 0.39 MB
+- result\_size = 8,000 B = 0.0076 MB
+- workspace\_size = query\_size + result\_size = 0.40 MB
+- Total search memory ≈ 3906.25 + 244.14 + 0.40 = 4150.79 MB
diff --git a/docs/source/neighbors/cagra.rst b/docs/source/neighbors/cagra.rst
deleted file mode 100644
index 471f3a915a..0000000000
--- a/docs/source/neighbors/cagra.rst
+++ /dev/null
@@ -1,276 +0,0 @@
-CAGRA
-=====
-
-CAGRA, or (C)UDA (A)NN (GRA)ph-based, is a graph-based index that is based loosely on the popular navigable small-world graph (NSG) algorithm, but which has been
-built from the ground-up specifically for the GPU. CAGRA constructs a flat graph representation by first building a kNN graph
-of the training points and then removing redundant paths between neighbors.
-
-The CAGRA algorithm has two basic steps-
-* 1. Construct a kNN graph
-* 2. Prune redundant routes from the kNN graph.
-
-I-force could be used to construct the initial kNN graph. This would yield the most accurate graph but would be very slow and
-we find that in practice the kNN graph does not need to be very accurate since the pruning step helps to boost the overall recall of
-the index. cuVS provides IVF-PQ and NN-Descent strategies for building the initial kNN graph and these can be selected in index params object during index construction.
-
-[ :doc:`C API <../c_api/neighbors_cagra_c>` | :doc:`C++ API <../cpp_api/neighbors_cagra>` | :doc:`Python API <../python_api/neighbors_cagra>` | :doc:`Rust API <../rust_api/index>` ]
-
-Interoperability with HNSW
---------------------------
-
-cuVS provides the capability to convert a CAGRA graph to an HNSW graph, which enables the GPU to be used only for building the index
-while the CPU can be leveraged for search.
-
-Filtering considerations
-------------------------
-
-CAGRA supports filtered search and has improved multi-CTA algorithm in branch-25.02 to provide reasonable recall and performance for filtering rate as high as 90% or more.
-
-To obtain an appropriate recall in filtered search, it is necessary to set search parameters according to the filtering rate, but since it is difficult for users to do this, CAGRA automatically adjusts `itopk_size` internally according to the filtering rate on a heuristic basis. If you want to disable this automatic adjustment, set `filtering_rate`, one of the search parameters, to `0.0`, and `itopk_size` will not be adjusted automatically.
-
-Configuration parameters
-------------------------
-
-Build parameters
-~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - compression
-     - None
-     - For large datasets, the raw vectors can be compressed using product quantization so they can be placed on device. This comes at the cost of lowering recall, though a refinement reranking step can be used to make up the lost recall after search.
-   * - graph_build_algo
-     - 'IVF_PQ'
-     - The graph build algorithm to use for building
-   * - graph_build_params
-     - None
-     - Specify explicit build parameters for the corresponding graph build algorithms
-   * - graph_degree
-     - 32
-     - The degree of the final CAGRA graph. All vertices in the graph will have this degree. During search, a larger graph degree allows for more exploration of the search space and improves recall but at the expense of searching more vertices.
-   * - intermediate_graph_degree
-     - 64
-     - The degree of the initial knn graph before it is optimized into the final CAGRA graph. A larger value increases connectivity of the initial graph so that it performs better once pruned. Larger values come at the cost of increased device memory usage and increases the time of initial knn graph construction.
-   * - guarantee_connectivity
-     - False
-     - Uses a degree-constrained minimum spanning tree to guarantee the initial knn graph is connected. This can improve recall on some datasets.
-   * - attach_data_on_build
-     - True
-     - Should the dataset be attached to the index after the index is built? Setting this to `False` can improve memory usage and performance, for example if the graph is being serialized to disk or converted to HNSW right after building it.
-
-Search parameters
-~~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - itopk_size
-     - 64
-     - Number of intermediate search results retained during search. This value needs to be >=k. This is the main knob to tweak search performance.
-   * - max_iterations
-     - 0
-     - The maximum number of iterations during search. Default is to auto-select.
-   * - max_queries
-     - 0
-     - Max number of search queries to perform concurrently (batch size). Default is to auto-select.
-   * - team_size
-     - 0
-     - Number of CUDA threads for calculating each distance. Can be 4, 8, 16, or 32. Default is to auto-select.
-   * - search_width
-     - 1
-     - Number of vertices to select as the starting point for the search in each iteration.
-   * - min_iterations
-     - 0
-     - Minimum number of search iterations to perform
-
-Tuning Considerations
----------------------
-
-The 3 hyper-parameters that are most often tuned are `graph_degree`, `intermediate_graph_degree`, and `itopk_size`.
-
-Memory footprint
-================
-
-CAGRA builds a nearest-neighbor graph (stored on host) while keeping the original dataset vectors around. During index build, the dataset must reside in device (GPU) memory. After building, the dataset can optionally be detached from the index — for example, when immediately converting the CAGRA graph to a CPU-based format like HNSW for search.
-
-Baseline Memory Footprint
--------------------------
-
-The baseline memory footprint after index construction:
-
-.. math::
-
-   \text{dataset_size (device)}
-   \;=\;
-   \text{number_vectors} \times \text{vector_dimension} \times \text{bytes_per_dimension}
-
-.. math::
-
-   \text{graph_size (host)}
-   \;=\;
-   \text{number_vectors} \times \text{graph_degree} \times \operatorname{sizeof}\!\big(\mathrm{IdxT}\big)
-
-Note: The dataset must be in GPU memory during index build, but can be detached afterward if not needed for search.
-
-**Example** (1,000,000 vectors, dim = 1024, fp32, graph\_degree = 64, IdxT = int32):
-
-- dataset\_size = 4,096,000,000 B = 3906.25 MB
-- graph\_size   = 256,000,000 B = 244.14 MB
-
-Build peak memory usage
------------------------
-
-Index build has two phases: (1) construct a knn graph, then (2) optimize it to remove redundant and unnecessary paths.
-The initial knn graph can be built with IVF-PQ or nn-descent. IVF-PQ has the additional benefit that it supports out-of-core construction, allowing CAGRA to be trained on datasets larger than available GPU memory.
-The steps below are sequential with distinct peak memory consumption. The overall peak memory utilization depends on the configured RMM memory resource.
-
-knn graph build phase using IVF-PQ
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The knn graph can be constructed using the IVF-PQ algorithm, which works in two stages: first, an IVF-PQ index is trained on a subset of vectors to learn cluster centroids; then, the full dataset is queried against this index in batches to find approximate nearest neighbors for each vector.
-
-**IVF-PQ Build (centroid training)** — uses a training subset to compute cluster centroids and PQ codebooks.
-
-.. math::
-
-   \text{IVFPQ_build_peak}
-   \;=\;
-   \frac{n_{\text{vectors}}}{\text{train_set_ratio}} \times \text{dim} \times 4
-   \;+\;
-   n_{\text{clusters}} \times \text{dim} \times 4
-   \;+\;
-   \frac{n_{\text{vectors}}}{\text{train_set_ratio}} \times \operatorname{sizeof}(\mathrm{uint32\_t})
-
-**Example** (n = 1e6; dim = 1024; n\_clusters = 1024; train\_set\_ratio = 10): 395.01 MB
-
-**IVF-PQ Search (forms the intermediate graph)** — Constructs the knn graph in batches by querying the IVF-PQ index for the nearest neighbors of all training points.
-
-.. math::
-
-   \text{IVFPQ_search_peak}
-   \;=\;
-   \text{batch_size} \times \text{dim} \times 4
-   \;+\;
-   \text{batch_size} \times \text{intermediate_degree} \times \operatorname{sizeof}(\mathrm{uint32\_t})
-   \;+\;
-   \text{batch_size} \times \text{intermediate_degree} \times 4
-
-**Example** (batch = 1024, dim = 1024, intermediate\_degree = 128): 5.00 MB
-
-knn graph build phase using NN-DESCENT
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-**Peak device memory:**
-
-.. math::
-
-   \text{NND_device_peak}
-   \;=\;
-   n_\text{vectors} \times (n_\text{dims} \times 2 + 276)
-
-- Data vectors (transferred to device and stored as fp16): :math:`n_\text{dims} \times 2` bytes per vector
-- Small working graph, locks, edge counters: 276 bytes per vector (fixed)
-- Additional :math:`4` bytes per vector when using L2 metric (for precomputed norms)
-
-**Peak host memory:**
-
-.. math::
-
-   \text{NND_host_peak}
-   \;=\;
-   n_\text{vectors} \times (13 \times \text{intermediate_graph_degree} + 912)
-
-- Full graph with distances (~1.3x overallocation): :math:`1.3 \times 8 \times \text{intermediate_graph_degree}` bytes per vector
-- Bloom filter for sampling: :math:`1.3 \times 2 \times \text{intermediate_graph_degree}` bytes per vector
-- 5 sample buffers (degree 32 each): 640 bytes per vector
-- Graph update buffer (degree 32): 256 bytes per vector
-- Edge counters: 16 bytes per vector
-
-
-Optimize phase
-~~~~~~~~~~~~~~
-
-Pruning/reordering the intermediate graph; peak scales linearly with intermediate degree.
-
-.. math::
-
-   \text{optimize_peak}
-   \;=\;
-   n_{\text{vectors}} \times
-   \Big( 4 + \big(\operatorname{sizeof}(\mathrm{IdxT}) + 1\big)\times \text{intermediate_degree} \Big)
-
-**Example** (n = 1e6, intermediate\_degree = 128, IdxT = int32): 614.17 MB
-Out-of-core CAGRA build consists of IVF-PQ build, IVF-PQ search, CAGRA optimization. Note that these steps are performed sequentially, so they are not additive.
-
-Overall Build Peak Memory Usage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The overall peak memory footprint on the device is the maximum allocation across each sequential step, since RMM's ``device_memory_resource`` releases memory between steps.
-
-**Using IVF-PQ:**
-
-.. math::
-
-   \text{build_peak}
-   \;=\;
-   \text{dataset_size}
-   \;+\;
-   \max\!\big(\text{IVFPQ_build_peak},\ \text{IVFPQ_search_peak},\ \text{optimize_peak}\big)
-
-**Example:** 3906.25 + max(395.01, 5.00, 614.17) = 4520.42 MB
-
-**Using NN-Descent:**
-
-.. math::
-
-   \text{build_peak}
-   \;=\;
-   \text{dataset_size}^{*}
-   \;+\;
-   \max\!\big(\text{NND_device_peak},\ \text{optimize_peak}\big)
-
-:math:`\text{dataset_size}^{*}` applies only when the user passes data residing in device memory; NN-Descent internally copies the dataset to the device as fp16, so host-memory inputs do not add this term.
-
-Search peak memory usage
-------------------------
-
-CAGRA search requires the dataset and graph to already be resident in GPU memory. When using CAGRA-Q (compressed/quantized), the original dataset can reside in host memory instead. Additionally, temporary workspace memory is needed to store the search results for a batch of queries.
-If multiple batches are to be launched concurrently or overlapped, separate results buffers will be needed for each.
-The below memory estimate assumes just one batch of queries being run at a time and reusing the buffers.
-
-.. math::
-
-   \text{search_memory}
-   \;=\;
-   \text{dataset_size} + \text{graph_size} + \text{workspace_size}
-
-Where ``workspace_size`` is the temporary memory used for query vectors and result storage:
-
-.. math::
-
-   \text{query_size}
-   \;=\;
-   \text{batch_size} \times \text{dim} \times \operatorname{sizeof}(\mathrm{float})
-
-.. math::
-
-   \text{result_size}
-   \;=\;
-   \text{batch_size} \times \text{topk} \times
-   \big(\operatorname{sizeof}(\mathrm{IdxT}) + \operatorname{sizeof}(\mathrm{float})\big)
-
-**Example** (dim = 1024, batch\_size = 100, topk = 10, IdxT = int32):
-
-- query\_size  = 409,600 B = 0.39 MB
-- result\_size = 8,000 B = 0.0076 MB
-- workspace\_size = query\_size + result\_size = 0.40 MB
-- Total search memory ≈ 3906.25 + 244.14 + 0.40 = 4150.79 MB
diff --git a/docs/source/neighbors/ivfflat.md b/docs/source/neighbors/ivfflat.md
new file mode 100644
index 0000000000..04febe28dd
--- /dev/null
+++ b/docs/source/neighbors/ivfflat.md
@@ -0,0 +1,106 @@
+# IVF-Flat
+
+IVF-Flat is an inverted file index (IVF) algorithm, which in the context of nearest neighbors means that data points are
+partitioned into clusters. At search time, brute-force is performed only in a (user-defined) subset of the closest clusters.
+In practice, this algorithm can search the index much faster than brute-force and often still maintain an acceptable
+recall, though this comes with the drawback that the index itself copies the original training vectors into a memory layout
+that is optimized for fast memory reads and adds some additional memory storage overheads. Once the index is trained,
+this algorithm no longer requires the original raw training vectors.
+
+IVF-Flat tends to be a great choice when
+
+1. like brute-force, there is enough device memory available to fit all of the vectors
+in the index, and
+2. exact recall is not needed. as with the other index types, the tuning parameters are used to trade-off recall for search latency / throughput.
+
+[ {doc}`C API <../c_api/neighbors_ivf_flat_c>` | {doc}`C++ API <../cpp_api/neighbors_ivf_flat>` | {doc}`Python API <../python_api/neighbors_ivf_flat>` | {doc}`Rust API <../rust_api/index>` ]
+
+## Filtering considerations
+
+IVF methods only apply filters to the lists which are probed for each query point. As a result, the results of a filtered query will likely differ significantly from the results of a filtering applid to an exact method like brute-force. For example. imagine you have 3 IVF lists each containing 2 vectors and you perform a query against only the closest 2 lists but you filter out all but 1 element. If that remaining element happens to be in one of the lists which was not proved, it will not be considered at all in the search results. It's important to consider this when using any of the IVF methods in your applications.
+
+
+## Configuration parameters
+
+### Build parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - n_lists
+  - sqrt(n)
+  - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
+* - add_data_on_build
+  - True
+  - Should the training points be added to the index after the index is built?
+* - kmeans_train_iters
+  - 20
+  - Max number of iterations for k-means training before convergence is assumed. Note that convergence could happen before this number of iterations.
+* - kmeans_trainset_fraction
+  - 0.5
+  - Fraction of points that should be subsampled from the original dataset to train the k-means clusters. Default is 1/2 the training dataset. This can often be reduced for very large datasets to improve both cluster quality and the build time.
+* - adaptive_centers
+  - false
+  - Should the existing trained centroids adapt to new points that are added to the index? This provides a trade-off between improving recall at the expense of having to compute new centroids for clusters when new points are added. When points are added in large batches, the performance cost may not be noticeable.
+* - conservative_memory_allocation
+  - false
+  - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
+```
+
+### Search parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - n_probes
+  - 20
+  - Number of closest IVF lists to scan for each query point.
+```
+
+## Tuning Considerations
+
+Since IVF methods use clustering to establish spatial locality and partition data points into individual lists, there's an inherent
+assumption that the number of lists, and thus the max size of the data in the index is known up front. For some use-cases, this
+might not matter. For example, most vector databases build many smaller physical approximate nearest neighbors indexes, each from
+fixed-size or maximum-sized immutable segments and so the number of lists can be tuned based on the number of vectors in the indexes.
+
+Empirically, we've found $\sqrt{n\_index\_vectors}$ to be a good starting point for the $n\_lists$ hyper-parameter. Remember, having more
+lists means less points to search within each list, but it could also mean more $n\_probes$ are needed at search time to reach an acceptable
+recall.
+
+
+## Memory footprint
+
+Each cluster is padded to at least 32 vectors (but potentially up to 1024). Assuming uniform random distribution of vectors/list, we would have
+$cluster\_overhead = (conservative\_memory\_allocation ? 16 : 512 ) * dim * sizeof_{float}$
+
+Note that each cluster is allocated as a separate allocation. If we use a `cuda_memory_resource`, that would grab memory in 1 MiB chunks, so on average we might have 0.5 MiB overhead per cluster. If we us 10s of thousands of clusters, it becomes essential to use pool allocator to avoid this overhead.
+
+$cluster\_overhead =  0.5 MiB$ // if we do not use pool allocator
+
+
+### Index (device memory):
+
+$$
+n\_vectors * n\_dimensions * sizeof(T) +
+
+n\_vectors  * sizeof(int_type) +
+
+n\_clusters * n\_dimensions * sizeof(T) +
+
+n\_clusters * cluster_overhead`
+$$
+
+### Peak device memory usage for index build:
+
+$workspace = min(1GB, n\_queries * [(n\_lists + 1 + n\_probes * (k + 1)) * sizeof_{float} + n\_probes * k * sizeof_{idx}])$
+
+$index\_size + workspace$
diff --git a/docs/source/neighbors/ivfflat.rst b/docs/source/neighbors/ivfflat.rst
deleted file mode 100644
index d4c8f03c18..0000000000
--- a/docs/source/neighbors/ivfflat.rst
+++ /dev/null
@@ -1,115 +0,0 @@
-IVF-Flat
-========
-
-IVF-Flat is an inverted file index (IVF) algorithm, which in the context of nearest neighbors means that data points are
-partitioned into clusters. At search time, brute-force is performed only in a (user-defined) subset of the closest clusters.
-In practice, this algorithm can search the index much faster than brute-force and often still maintain an acceptable
-recall, though this comes with the drawback that the index itself copies the original training vectors into a memory layout
-that is optimized for fast memory reads and adds some additional memory storage overheads. Once the index is trained,
-this algorithm no longer requires the original raw training vectors.
-
-IVF-Flat tends to be a great choice when
-
-1. like brute-force, there is enough device memory available to fit all of the vectors
-in the index, and
-2. exact recall is not needed. as with the other index types, the tuning parameters are used to trade-off recall for search latency / throughput.
-
-[ :doc:`C API <../c_api/neighbors_ivf_flat_c>` | :doc:`C++ API <../cpp_api/neighbors_ivf_flat>` | :doc:`Python API <../python_api/neighbors_ivf_flat>` | :doc:`Rust API <../rust_api/index>` ]
-
-Filtering considerations
-------------------------
-
-IVF methods only apply filters to the lists which are probed for each query point. As a result, the results of a filtered query will likely differ significantly from the results of a filtering applid to an exact method like brute-force. For example. imagine you have 3 IVF lists each containing 2 vectors and you perform a query against only the closest 2 lists but you filter out all but 1 element. If that remaining element happens to be in one of the lists which was not proved, it will not be considered at all in the search results. It's important to consider this when using any of the IVF methods in your applications.
-
-
-Configuration parameters
-------------------------
-
-Build parameters
-~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - n_lists
-     - sqrt(n)
-     - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
-   * - add_data_on_build
-     - True
-     - Should the training points be added to the index after the index is built?
-   * - kmeans_train_iters
-     - 20
-     - Max number of iterations for k-means training before convergence is assumed. Note that convergence could happen before this number of iterations.
-   * - kmeans_trainset_fraction
-     - 0.5
-     - Fraction of points that should be subsampled from the original dataset to train the k-means clusters. Default is 1/2 the training dataset. This can often be reduced for very large datasets to improve both cluster quality and the build time.
-   * - adaptive_centers
-     - false
-     - Should the existing trained centroids adapt to new points that are added to the index? This provides a trade-off between improving recall at the expense of having to compute new centroids for clusters when new points are added. When points are added in large batches, the performance cost may not be noticeable.
-   * - conservative_memory_allocation
-     - false
-     - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
-
-
-Search parameters
-~~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - n_probes
-     - 20
-     - Number of closest IVF lists to scan for each query point.
-
-Tuning Considerations
----------------------
-
-Since IVF methods use clustering to establish spatial locality and partition data points into individual lists, there's an inherent
-assumption that the number of lists, and thus the max size of the data in the index is known up front. For some use-cases, this
-might not matter. For example, most vector databases build many smaller physical approximate nearest neighbors indexes, each from
-fixed-size or maximum-sized immutable segments and so the number of lists can be tuned based on the number of vectors in the indexes.
-
-Empirically, we've found :math:`\sqrt{n\_index\_vectors}` to be a good starting point for the :math:`n\_lists` hyper-parameter. Remember, having more
-lists means less points to search within each list, but it could also mean more :math:`n\_probes` are needed at search time to reach an acceptable
-recall.
-
-
-Memory footprint
-----------------
-
-Each cluster is padded to at least 32 vectors (but potentially up to 1024). Assuming uniform random distribution of vectors/list, we would have
-:math:`cluster\_overhead = (conservative\_memory\_allocation ? 16 : 512 ) * dim * sizeof_{float}`
-
-Note that each cluster is allocated as a separate allocation. If we use a `cuda_memory_resource`, that would grab memory in 1 MiB chunks, so on average we might have 0.5 MiB overhead per cluster. If we us 10s of thousands of clusters, it becomes essential to use pool allocator to avoid this overhead.
-
-:math:`cluster\_overhead =  0.5 MiB` // if we do not use pool allocator
-
-
-Index (device memory):
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. math::
-
-   n\_vectors * n\_dimensions * sizeof(T) +
-
-   n\_vectors  * sizeof(int_type) +
-
-   n\_clusters * n\_dimensions * sizeof(T) +
-
-   n\_clusters * cluster_overhead`
-
-
-Peak device memory usage for index build:
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-:math:`workspace = min(1GB, n\_queries * [(n\_lists + 1 + n\_probes * (k + 1)) * sizeof_{float} + n\_probes * k * sizeof_{idx}])`
-
-:math:`index\_size + workspace`
diff --git a/docs/source/neighbors/ivfpq.md b/docs/source/neighbors/ivfpq.md
new file mode 100644
index 0000000000..893dd53a23
--- /dev/null
+++ b/docs/source/neighbors/ivfpq.md
@@ -0,0 +1,126 @@
+# IVF-PQ
+
+IVF-PQ is an inverted file index (IVF) algorithm, which is an extension to the IVF-Flat algorithm (e.g. data points are first
+partitioned into clusters) where product quantization is performed within each cluster in order to shrink the memory footprint
+of the index. Product quantization is a lossy compression method and it is capable of storing larger number of vectors
+on the GPU by offloading the original vectors to main memory, however higher compression levels often lead to reduced recall.
+Often a strategy called refinement reranking is employed to make up for the lost recall by querying the IVF-PQ index for a larger
+`k` than desired and performing a reordering and reduction to `k` based on the distances from the unquantized vectors. Unfortunately,
+this does mean that the unquantized raw vectors need to be available and often this can be done efficiently using multiple CPU threads.
+
+[ {doc}`C API <../c_api/neighbors_ivf_pq_c>` | {doc}`C++ API <../cpp_api/neighbors_ivf_pq>` | {doc}`Python API <../python_api/neighbors_ivf_pq>` | {doc}`Rust API <../rust_api/index>` ]
+
+
+## Configuration parameters
+
+### Build parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - n_lists
+  - sqrt(n)
+  - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
+* - kmeans_n_iters
+  - 20
+  - The number of iterations when searching for k-means centers
+* - kmeans_trainset_fraction
+  - 0.5
+  - The fraction of training data to use for iterative k-means building
+* - pq_bits
+  - 8
+  - The bit length of each vector element after compressing with PQ. Possible values are any integer between 4 and 8.
+* - pq_dim
+  - 0
+  - The dimensionality of each vector after compressing with PQ. When 0, the dim is set heuristically.
+* - codebook_kind
+  - per_subspace
+  - How codebooks are created. `per_subspace` trains kmeans on some number of sub-dimensions while `per_cluster`
+* - force_random_rotation
+  - false
+  - Apply a random rotation matrix on the input data and queries even if `dim % pq_dim == 0`
+* - conservative_memory_allocation
+  - false
+  - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
+* - add_data_on_build
+  - True
+  - Should the training points be added to the index after the index is built?
+* - max_train_points_per_pq_code
+  - 256
+  - The max number of data points to use per PQ code during PQ codebook training.
+```
+
+### Search parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - n_probes
+  - 20
+  - Number of closest IVF lists to scan for each query point.
+* - lut_dtype
+  - cuda_r_32f
+  - Datatype to store the pq lookup tables. Can also use cuda_r_16f for half-precision and cuda_r_8u for 8-bit precision. Smaller lookup tables can fit into shared memory and significantly improve search times.
+* - internal_distance_dtype
+  - cuda_r_32f
+  - Storage data type for distance/similarity computed at search time. Can also use cuda_r_16f for half-precision.
+* - preferred_smem_carveout
+  - 1.0
+  - Preferred fraction of SM's unified memory / L1 cache to be used as shared memory. Default is 100%
+```
+
+## Tuning Considerations
+
+IVF-PQ has similar tuning considerations to IVF-flat, though the PQ compression ratio adds an additional variable to trade-off index size for search quality.
+
+It's important to note that IVF-PQ becomes very lossy very quickly, and so refinement reranking is often needed to get a reasonable recall. This step usually consists of searching initially for more k-neighbors than needed and then reducing the resulting neighborhoods down to k by computing exact distances. This step can be performed efficiently on CPU or GPU and generally has only a marginal impact on search latency.
+
+## Memory footprint
+
+### Index (device memory):
+
+Simple approximate formula: $n\_vectors * (pq\_dim * \frac{pq\_bits}{8} + sizeof_{idx}) + n\_clusters$
+
+The IVF lists end up being represented by a sparse data structure that stores the pointers to each list, an indices array that contains the indexes of each vector in each list, and an array with the encoded (and interleaved) data for each list.
+
+IVF list pointers: $n\_clusters * sizeof_{uint32\_t}$
+
+Indices: $n\_vectors * sizeof_{idx}$
+
+Encoded data (interleaved): $n\_vectors * pq\_dim * \frac{pq\_bits}{8}$
+
+Per subspace method: $4 * pq\_dim * pq\_len * 2^{pq\_bits}$
+
+Per cluster method: $4 * n\_clusters * pq\_len * 2^{pq\_bits}$
+
+Extras: $n\_clusters * (20 + 8 * dim)$
+
+### Index (host memory):
+
+When refinement is used with the dataset on host, the original raw vectors are needed: $n\_vectors * dims * sizeof_{float}$
+
+### Search peak memory usage (device);
+
+Total usage: $index + queries + output\_indices + output\_distances + workspace$
+
+Workspace size is not trivial, a heuristic controls the batch size to make sure the workspace fits the `raft::resource::get_workspace_free_bytes(res)`.
+
+### Build peak memory usage (device):
+
+$$
+\frac{n\_vectors}{trainset\_ratio * dims * sizeof_{float}}
+
++ \frac{n\_vectors}{trainset\_ratio * sizeof_{uint32\_t}}
+
++ n\_clusters * dim * sizeof_{float}
+$$
+
+Note, if there’s not enough space left in the workspace memory resource, IVF-PQ build automatically switches to the managed memory for the training set and labels.
diff --git a/docs/source/neighbors/ivfpq.rst b/docs/source/neighbors/ivfpq.rst
deleted file mode 100644
index e243a5b562..0000000000
--- a/docs/source/neighbors/ivfpq.rst
+++ /dev/null
@@ -1,135 +0,0 @@
-IVF-PQ
-======
-
-IVF-PQ is an inverted file index (IVF) algorithm, which is an extension to the IVF-Flat algorithm (e.g. data points are first
-partitioned into clusters) where product quantization is performed within each cluster in order to shrink the memory footprint
-of the index. Product quantization is a lossy compression method and it is capable of storing larger number of vectors
-on the GPU by offloading the original vectors to main memory, however higher compression levels often lead to reduced recall.
-Often a strategy called refinement reranking is employed to make up for the lost recall by querying the IVF-PQ index for a larger
-`k` than desired and performing a reordering and reduction to `k` based on the distances from the unquantized vectors. Unfortunately,
-this does mean that the unquantized raw vectors need to be available and often this can be done efficiently using multiple CPU threads.
-
-[ :doc:`C API <../c_api/neighbors_ivf_pq_c>` | :doc:`C++ API <../cpp_api/neighbors_ivf_pq>` | :doc:`Python API <../python_api/neighbors_ivf_pq>` | :doc:`Rust API <../rust_api/index>` ]
-
-
-Configuration parameters
-------------------------
-
-Build parameters
-~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - n_lists
-     - sqrt(n)
-     - Number of coarse clusters used to partition the index. A good heuristic for this value is sqrt(n_vectors_in_index)
-   * - kmeans_n_iters
-     - 20
-     - The number of iterations when searching for k-means centers
-   * - kmeans_trainset_fraction
-     - 0.5
-     - The fraction of training data to use for iterative k-means building
-   * - pq_bits
-     - 8
-     - The bit length of each vector element after compressing with PQ. Possible values are any integer between 4 and 8.
-   * - pq_dim
-     - 0
-     - The dimensionality of each vector after compressing with PQ. When 0, the dim is set heuristically.
-   * - codebook_kind
-     - per_subspace
-     - How codebooks are created. `per_subspace` trains kmeans on some number of sub-dimensions while `per_cluster`
-   * - force_random_rotation
-     - false
-     - Apply a random rotation matrix on the input data and queries even if `dim % pq_dim == 0`
-   * - conservative_memory_allocation
-     - false
-     - To support dynamic indexes, where points are expected to be added later, the individual IVF lists can be imtentionally overallocated up front to reduce the amount and impact of increasing list sizes, which requires allocating more memory and copying the old list to the new, larger, list.
-   * - add_data_on_build
-     - True
-     - Should the training points be added to the index after the index is built?
-   * - max_train_points_per_pq_code
-     - 256
-     - The max number of data points to use per PQ code during PQ codebook training.
-
-
-Search parameters
-~~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - n_probes
-     - 20
-     - Number of closest IVF lists to scan for each query point.
-   * - lut_dtype
-     - cuda_r_32f
-     - Datatype to store the pq lookup tables. Can also use cuda_r_16f for half-precision and cuda_r_8u for 8-bit precision. Smaller lookup tables can fit into shared memory and significantly improve search times.
-   * - internal_distance_dtype
-     - cuda_r_32f
-     - Storage data type for distance/similarity computed at search time. Can also use cuda_r_16f for half-precision.
-   * - preferred_smem_carveout
-     - 1.0
-     - Preferred fraction of SM's unified memory / L1 cache to be used as shared memory. Default is 100%
-
-Tuning Considerations
----------------------
-
-IVF-PQ has similar tuning considerations to IVF-flat, though the PQ compression ratio adds an additional variable to trade-off index size for search quality.
-
-It's important to note that IVF-PQ becomes very lossy very quickly, and so refinement reranking is often needed to get a reasonable recall. This step usually consists of searching initially for more k-neighbors than needed and then reducing the resulting neighborhoods down to k by computing exact distances. This step can be performed efficiently on CPU or GPU and generally has only a marginal impact on search latency.
-
-Memory footprint
-----------------
-
-Index (device memory):
-~~~~~~~~~~~~~~~~~~~~~~
-
-Simple approximate formula: :math:`n\_vectors * (pq\_dim * \frac{pq\_bits}{8} + sizeof_{idx}) + n\_clusters`
-
-The IVF lists end up being represented by a sparse data structure that stores the pointers to each list, an indices array that contains the indexes of each vector in each list, and an array with the encoded (and interleaved) data for each list.
-
-IVF list pointers: :math:`n\_clusters * sizeof_{uint32\_t}`
-
-Indices: :math:`n\_vectors * sizeof_{idx}`
-
-Encoded data (interleaved): :math:`n\_vectors * pq\_dim * \frac{pq\_bits}{8}`
-
-Per subspace method: :math:`4 * pq\_dim * pq\_len * 2^{pq\_bits}`
-
-Per cluster method: :math:`4 * n\_clusters * pq\_len * 2^{pq\_bits}`
-
-Extras: :math:`n\_clusters * (20 + 8 * dim)`
-
-Index (host memory):
-~~~~~~~~~~~~~~~~~~~~
-
-When refinement is used with the dataset on host, the original raw vectors are needed: :math:`n\_vectors * dims * sizeof_{float}`
-
-Search peak memory usage (device);
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Total usage: :math:`index + queries + output\_indices + output\_distances + workspace`
-
-Workspace size is not trivial, a heuristic controls the batch size to make sure the workspace fits the `raft::resource::get_workspace_free_bytes(res)``.
-
-Build peak memory usage (device):
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. math::
-
-   \frac{n\_vectors}{trainset\_ratio * dims * sizeof_{float}}
-
-   + \frac{n\_vectors}{trainset\_ratio * sizeof_{uint32\_t}}
-
-   + n\_clusters * dim * sizeof_{float}
-
-Note, if there’s not enough space left in the workspace memory resource, IVF-PQ build automatically switches to the managed memory for the training set and labels.
diff --git a/docs/source/neighbors/neighbors.md b/docs/source/neighbors/neighbors.md
new file mode 100644
index 0000000000..a1c436caa7
--- /dev/null
+++ b/docs/source/neighbors/neighbors.md
@@ -0,0 +1,19 @@
+# Nearest Neighbor
+
+```{toctree}
+:maxdepth: 3
+:caption: Contents:
+
+bruteforce.md
+cagra.md
+ivfflat.md
+ivfpq.md
+vamana.md
+all_neighbors.md
+```
+
+# Indices and tables
+
+* {ref}`genindex`
+* {ref}`modindex`
+* {ref}`search`
diff --git a/docs/source/neighbors/neighbors.rst b/docs/source/neighbors/neighbors.rst
deleted file mode 100644
index f66b73f867..0000000000
--- a/docs/source/neighbors/neighbors.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-Nearest Neighbor
-================
-
-.. toctree::
-   :maxdepth: 3
-   :caption: Contents:
-
-   bruteforce.rst
-   cagra.rst
-   ivfflat.rst
-   ivfpq.rst
-   vamana.rst
-   all_neighbors.rst
-
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
diff --git a/docs/source/neighbors/vamana.rst b/docs/source/neighbors/vamana.md
similarity index 55%
rename from docs/source/neighbors/vamana.rst
rename to docs/source/neighbors/vamana.md
index 4f5c2eb5f0..7761f57654 100644
--- a/docs/source/neighbors/vamana.rst
+++ b/docs/source/neighbors/vamana.md
@@ -1,5 +1,4 @@
-Vamana
-======
+# Vamana
 
 VAMANA is the underlying graph construction algorithm used to construct indexes for the DiskANN vector search solution. DiskANN and the Vamana algorithm are described in detail in the `published paper <https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf>`, and a highly optimized `open-source repository <https://github.com/microsoft/DiskANN>`  includes many features for index construction and search. In cuVS, we provide a version of the Vamana algorithm optimized for GPU architectures to accelreate graph construction to build DiskANN idnexes. At a high level, the Vamana algorithm operates as follows:
 
@@ -11,65 +10,60 @@ There are many algorithmic details that are outlined in the `paper <https://pape
 
 The current implementation of DiskANN in cuVS only includes the 'in-memory' graph construction and a serialization step that writes the index to a file. This index file can be then used by the `open-source DiskANN <https://github.com/microsoft/DiskANN>` library to perform efficient search. Additional DiskANN functionality, including GPU-accelerated search and 'ssd' index build are planned for future cuVS releases.
 
-[ :doc:`C++ API <../cpp_api/neighbors_vamana>` ]
+[ {doc}`C++ API <../cpp_api/neighbors_vamana>` ]
 
-Interoperability with CPU DiskANN
----------------------------------
+## Interoperability with CPU DiskANN
 
 The 'vamana::serialize' API calls writes the index to a file with a format that is compatible with the `open-source DiskANN repositoriy <https://github.com/microsoft/DiskANN>`. This allows cuVS to be used to accelerate index construction while leveraging the efficient CPU-based search currently available.
 
-Configuration parameters
-------------------------
-
-Build parameters
-~~~~~~~~~~~~~~~~
-
-.. list-table::
-   :widths: 25 25 50
-   :header-rows: 1
-
-   * - Name
-     - Default
-     - Description
-   * - graph_degree
-     - 32
-     - The maximum degre of the final Vamana graph. The internal representation of the graph includes this many edges for every node, but serialize will compress the graph into a 'CSR' format with, potentially, fewer edges.
-   * - visited_size
-     - 64
-     - Maximum number of visited nodes saved during each traversal to insert a new node. This corresponds to the 'L' parameter in the paper.
-   * - vamana_iters
-     - 1
-     - Number of iterations ran to improve the graph. Each iteration involves inserting every vector in the dataset.
-   * - alpha
-     - 1.2
-     - Alpha parameter that defines how aggressively to prune edges.
-   * - max_fraction
-     - 0.06
-     - Maximum fraction of the dataset that will be inserted as a single batch. Larger max batch size decreases graph quality but improves speed.
-   * - batch_base
-     - 2
-     - Base of growth rate of batch sizes. Insertion batch sizes increase exponentially based on this parameter until max_fraction is reached.
-   * - queue_size
-     - 127
-     - Size of the candidate queue structure used during graph traversal. Must be (2^x)-1 for some x, and must be > visited_size.
-
-Tuning Considerations
----------------------
+## Configuration parameters
+
+### Build parameters
+
+```{list-table}
+:widths: 25 25 50
+:header-rows: 1
+
+* - Name
+  - Default
+  - Description
+* - graph_degree
+  - 32
+  - The maximum degre of the final Vamana graph. The internal representation of the graph includes this many edges for every node, but serialize will compress the graph into a 'CSR' format with, potentially, fewer edges.
+* - visited_size
+  - 64
+  - Maximum number of visited nodes saved during each traversal to insert a new node. This corresponds to the 'L' parameter in the paper.
+* - vamana_iters
+  - 1
+  - Number of iterations ran to improve the graph. Each iteration involves inserting every vector in the dataset.
+* - alpha
+  - 1.2
+  - Alpha parameter that defines how aggressively to prune edges.
+* - max_fraction
+  - 0.06
+  - Maximum fraction of the dataset that will be inserted as a single batch. Larger max batch size decreases graph quality but improves speed.
+* - batch_base
+  - 2
+  - Base of growth rate of batch sizes. Insertion batch sizes increase exponentially based on this parameter until max_fraction is reached.
+* - queue_size
+  - 127
+  - Size of the candidate queue structure used during graph traversal. Must be (2^x)-1 for some x, and must be > visited_size.
+```
+
+## Tuning Considerations
 
 The 2 hyper-parameters that are most often tuned are `graph_degree` and `visited_size`. The time needed to create a graph increases dramatically when increasing `graph_degree`, in particular. However, larger graphs may be needed to achieve very high recall search, especially for large datasets.
 
-Memory footprint
-----------------
+## Memory footprint
 
 Vamana builds a graph that is stored in device memory. However, in order to serialize the index and write it to a file for later use, it must be moved into host memory. If the `include_dataset` parameter is also set, then the dataset must be resident in host memory when calling serialize as well.
 
-Device memory usage
-~~~~~~~~~~~~~~~~~~~
+### Device memory usage
 
-The built index represents the graph as fixed degree, storing a total of :math:`graph\_degree * n\_index\_vectors` edges. Graph construction also requires the dataset be in device memory (or it copies it to device during build). In addition, device memory is used during construction to sort and create the reverse edges. Thus, the amount of device memory needed depends on the dataset itself, but it is bounded by a maximum sum of:
+The built index represents the graph as fixed degree, storing a total of $graph\_degree * n\_index\_vectors$ edges. Graph construction also requires the dataset be in device memory (or it copies it to device during build). In addition, device memory is used during construction to sort and create the reverse edges. Thus, the amount of device memory needed depends on the dataset itself, but it is bounded by a maximum sum of:
 
-- vector dataset: :math:`n\_index\_vectors * n\_dims * sizeof(T)`
-- output graph: :math:`graph\_degree * n\_index\_vectors * sizeof(IdxT)`
-- scratch memory: :math:`n\_index\_vectors * max\_fraction * (2 + graph\_degree) * sizeof(IdxT)`
+- vector dataset: $n\_index\_vectors * n\_dims * sizeof(T)$
+- output graph: $graph\_degree * n\_index\_vectors * sizeof(IdxT)$
+- scratch memory: $n\_index\_vectors * max\_fraction * (2 + graph\_degree) * sizeof(IdxT)$
 
 Reduction in scratch device memory requirements are planned for upcoming releases of cuVS.
diff --git a/docs/source/python_api.md b/docs/source/python_api.md
new file mode 100644
index 0000000000..dcc3da0607
--- /dev/null
+++ b/docs/source/python_api.md
@@ -0,0 +1,13 @@
+# Python API Documentation
+
+(api)=
+
+```{toctree}
+:maxdepth: 4
+
+python_api/cluster.md
+python_api/distance.md
+python_api/neighbors.md
+python_api/preprocessing.md
+```
+
diff --git a/docs/source/python_api.rst b/docs/source/python_api.rst
deleted file mode 100644
index 4c8fc47820..0000000000
--- a/docs/source/python_api.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~~~~
-Python API Documentation
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. _api:
-
-.. toctree::
-   :maxdepth: 4
-
-   python_api/cluster.rst
-   python_api/distance.rst
-   python_api/neighbors.rst
-   python_api/preprocessing.rst
diff --git a/docs/source/python_api/cluster.md b/docs/source/python_api/cluster.md
new file mode 100644
index 0000000000..0ba3911def
--- /dev/null
+++ b/docs/source/python_api/cluster.md
@@ -0,0 +1,9 @@
+# Cluster
+
+```{toctree}
+:maxdepth: 1
+:caption: Contents:
+
+cluster_kmeans.md
+```
+
diff --git a/docs/source/python_api/cluster.rst b/docs/source/python_api/cluster.rst
deleted file mode 100644
index b5c0ab957c..0000000000
--- a/docs/source/python_api/cluster.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Cluster
-========
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Contents:
-
-   cluster_kmeans.rst
diff --git a/docs/source/python_api/cluster_kmeans.md b/docs/source/python_api/cluster_kmeans.md
new file mode 100644
index 0000000000..e17a1b1553
--- /dev/null
+++ b/docs/source/python_api/cluster_kmeans.md
@@ -0,0 +1,23 @@
+# K-Means
+
+## K-Means Parameters
+
+```{autoclass} cuvs.cluster.kmeans.KMeansParams
+:members:
+```
+
+## K-Means Fit
+
+```{autofunction} cuvs.cluster.kmeans.fit
+```
+
+## K-Means Predict
+
+```{autofunction} cuvs.cluster.kmeans.predict
+```
+
+## K-Means Cluster Cost
+
+```{autofunction} cuvs.cluster.kmeans.cluster_cost
+```
+
diff --git a/docs/source/python_api/cluster_kmeans.rst b/docs/source/python_api/cluster_kmeans.rst
deleted file mode 100644
index 8fda17f80d..0000000000
--- a/docs/source/python_api/cluster_kmeans.rst
+++ /dev/null
@@ -1,27 +0,0 @@
-K-Means
-=======
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-K-Means Parameters
-##################
-
-.. autoclass:: cuvs.cluster.kmeans.KMeansParams
-    :members:
-
-K-Means Fit
-###########
-
-.. autofunction:: cuvs.cluster.kmeans.fit
-
-K-Means Predict
-###############
-
-.. autofunction:: cuvs.cluster.kmeans.predict
-
-K-Means Cluster Cost
-####################
-
-.. autofunction:: cuvs.cluster.kmeans.cluster_cost
diff --git a/docs/source/python_api/distance.md b/docs/source/python_api/distance.md
new file mode 100644
index 0000000000..feb7c2fcc4
--- /dev/null
+++ b/docs/source/python_api/distance.md
@@ -0,0 +1,7 @@
+# Distance
+
+## Pairwise Distance
+
+```{autofunction} cuvs.distance.pairwise_distance
+```
+
diff --git a/docs/source/python_api/distance.rst b/docs/source/python_api/distance.rst
deleted file mode 100644
index debd82953c..0000000000
--- a/docs/source/python_api/distance.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Distance
-========
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-
-Pairwise Distance
-#################
-
-.. autofunction:: cuvs.distance.pairwise_distance
diff --git a/docs/source/python_api/neighbors.md b/docs/source/python_api/neighbors.md
new file mode 100644
index 0000000000..ab528d90c5
--- /dev/null
+++ b/docs/source/python_api/neighbors.md
@@ -0,0 +1,16 @@
+# Nearest Neighbors
+
+```{toctree}
+:maxdepth: 2
+:caption: Contents:
+
+neighbors_all_neighbors.md
+neighbors_brute_force.md
+neighbors_cagra.md
+neighbors_hnsw.md
+neighbors_ivf_flat.md
+neighbors_ivf_pq.md
+neighbors_multi_gpu.md
+neighbors_nn_decent.md
+```
+
diff --git a/docs/source/python_api/neighbors.rst b/docs/source/python_api/neighbors.rst
deleted file mode 100644
index a9914fa44f..0000000000
--- a/docs/source/python_api/neighbors.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-Nearest Neighbors
-=================
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Contents:
-
-   neighbors_all_neighbors.rst
-   neighbors_brute_force.rst
-   neighbors_cagra.rst
-   neighbors_hnsw.rst
-   neighbors_ivf_flat.rst
-   neighbors_ivf_pq.rst
-   neighbors_multi_gpu.rst
-   neighbors_nn_decent.rst
diff --git a/docs/source/python_api/neighbors_all_neighbors.md b/docs/source/python_api/neighbors_all_neighbors.md
new file mode 100644
index 0000000000..d13aabfe64
--- /dev/null
+++ b/docs/source/python_api/neighbors_all_neighbors.md
@@ -0,0 +1,15 @@
+# All-Neighbors
+
+All-Neighbors allows building an approximate all-neighbors knn graph. Given a full dataset, it finds nearest neighbors for all the training vectors in the dataset.
+
+## Build Parameters
+
+```{autoclass} cuvs.neighbors.all_neighbors.AllNeighborsParams
+:members:
+```
+
+## Build
+
+```{autofunction} cuvs.neighbors.all_neighbors.build
+```
+
diff --git a/docs/source/python_api/neighbors_all_neighbors.rst b/docs/source/python_api/neighbors_all_neighbors.rst
deleted file mode 100644
index 89ba0f8020..0000000000
--- a/docs/source/python_api/neighbors_all_neighbors.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-All-Neighbors
-=============
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-All-Neighbors allows building an approximate all-neighbors knn graph. Given a full dataset, it finds nearest neighbors for all the training vectors in the dataset.
-
-Build Parameters
-################
-
-.. autoclass:: cuvs.neighbors.all_neighbors.AllNeighborsParams
-    :members:
-
-Build
-#####
-
-.. autofunction:: cuvs.neighbors.all_neighbors.build
diff --git a/docs/source/python_api/neighbors_brute_force.md b/docs/source/python_api/neighbors_brute_force.md
new file mode 100644
index 0000000000..db5cb87b27
--- /dev/null
+++ b/docs/source/python_api/neighbors_brute_force.md
@@ -0,0 +1,28 @@
+# Brute Force KNN
+
+## Index
+
+```{autoclass} cuvs.neighbors.brute_force.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.brute_force.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.brute_force.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.brute_force.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.brute_force.load
+```
+
diff --git a/docs/source/python_api/neighbors_brute_force.rst b/docs/source/python_api/neighbors_brute_force.rst
deleted file mode 100644
index d756a6c802..0000000000
--- a/docs/source/python_api/neighbors_brute_force.rst
+++ /dev/null
@@ -1,32 +0,0 @@
-Brute Force KNN
-===============
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.brute_force.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.brute_force.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.brute_force.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.brute_force.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.brute_force.load
diff --git a/docs/source/python_api/neighbors_cagra.md b/docs/source/python_api/neighbors_cagra.md
new file mode 100644
index 0000000000..e947f5ae10
--- /dev/null
+++ b/docs/source/python_api/neighbors_cagra.md
@@ -0,0 +1,47 @@
+# CAGRA
+
+CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.cagra.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.cagra.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.cagra.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.cagra.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.cagra.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.cagra.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.cagra.load
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.cagra.extend
+```
+
diff --git a/docs/source/python_api/neighbors_cagra.rst b/docs/source/python_api/neighbors_cagra.rst
deleted file mode 100644
index 42647914f2..0000000000
--- a/docs/source/python_api/neighbors_cagra.rst
+++ /dev/null
@@ -1,51 +0,0 @@
-CAGRA
-=====
-
-CAGRA is a graph-based nearest neighbors algorithm that was built from the ground up for GPU acceleration. CAGRA demonstrates state-of-the art index build and query performance for both small- and large-batch sized search.
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.cagra.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.cagra.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.cagra.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.cagra.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.cagra.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.cagra.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.cagra.load
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.cagra.extend
diff --git a/docs/source/python_api/neighbors_hnsw.md b/docs/source/python_api/neighbors_hnsw.md
new file mode 100644
index 0000000000..fce52090d8
--- /dev/null
+++ b/docs/source/python_api/neighbors_hnsw.md
@@ -0,0 +1,41 @@
+# HNSW
+
+This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.hnsw.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.hnsw.Index
+:members:
+```
+
+## Index Conversion
+
+```{autofunction} cuvs.neighbors.hnsw.from_cagra
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.hnsw.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.hnsw.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.hnsw.load
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.hnsw.extend
+```
+
diff --git a/docs/source/python_api/neighbors_hnsw.rst b/docs/source/python_api/neighbors_hnsw.rst
deleted file mode 100644
index 40f3a1de7e..0000000000
--- a/docs/source/python_api/neighbors_hnsw.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-HNSW
-====
-
-This is a wrapper for hnswlib, to load a CAGRA index as an immutable HNSW index. The loaded HNSW index is only compatible in cuVS, and can be searched using wrapper functions.
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.hnsw.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.hnsw.Index
-    :members:
-
-Index Conversion
-################
-
-.. autofunction:: cuvs.neighbors.hnsw.from_cagra
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.hnsw.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.hnsw.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.hnsw.load
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.hnsw.extend
diff --git a/docs/source/python_api/neighbors_ivf_flat.md b/docs/source/python_api/neighbors_ivf_flat.md
new file mode 100644
index 0000000000..317aff17ed
--- /dev/null
+++ b/docs/source/python_api/neighbors_ivf_flat.md
@@ -0,0 +1,45 @@
+# IVF-Flat
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.ivf_flat.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.ivf_flat.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.ivf_flat.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.ivf_flat.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.ivf_flat.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.ivf_flat.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.ivf_flat.load
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.ivf_flat.extend
+```
+
diff --git a/docs/source/python_api/neighbors_ivf_flat.rst b/docs/source/python_api/neighbors_ivf_flat.rst
deleted file mode 100644
index d0846b0d67..0000000000
--- a/docs/source/python_api/neighbors_ivf_flat.rst
+++ /dev/null
@@ -1,49 +0,0 @@
-IVF-Flat
-========
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.ivf_flat.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.ivf_flat.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.ivf_flat.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.ivf_flat.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.ivf_flat.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.ivf_flat.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.ivf_flat.load
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.ivf_flat.extend
diff --git a/docs/source/python_api/neighbors_ivf_pq.md b/docs/source/python_api/neighbors_ivf_pq.md
new file mode 100644
index 0000000000..a3ee7c4ffe
--- /dev/null
+++ b/docs/source/python_api/neighbors_ivf_pq.md
@@ -0,0 +1,45 @@
+# IVF-PQ
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.ivf_pq.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.ivf_pq.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.ivf_pq.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.ivf_pq.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.ivf_pq.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.ivf_pq.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.ivf_pq.load
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.ivf_pq.extend
+```
+
diff --git a/docs/source/python_api/neighbors_ivf_pq.rst b/docs/source/python_api/neighbors_ivf_pq.rst
deleted file mode 100644
index ec4cfdff6a..0000000000
--- a/docs/source/python_api/neighbors_ivf_pq.rst
+++ /dev/null
@@ -1,49 +0,0 @@
-IVF-PQ
-======
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.ivf_pq.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.ivf_pq.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.ivf_pq.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.ivf_pq.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.ivf_pq.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.ivf_pq.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.ivf_pq.load
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.ivf_pq.extend
diff --git a/docs/source/python_api/neighbors_mg_cagra.md b/docs/source/python_api/neighbors_mg_cagra.md
new file mode 100644
index 0000000000..cd3f409985
--- /dev/null
+++ b/docs/source/python_api/neighbors_mg_cagra.md
@@ -0,0 +1,52 @@
+# Multi-GPU CAGRA
+
+Multi-GPU CAGRA extends the graph-based CAGRA algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
+
+```{note}
+**IMPORTANT**: Multi-GPU CAGRA requires all data (datasets, queries, output arrays) to be in host memory (CPU).
+If using CuPy/device arrays, transfer to host with `array.get()` or `cp.asnumpy(array)` before use.
+```
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.mg.cagra.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.mg.cagra.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.mg.cagra.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.mg.cagra.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.mg.cagra.search
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.mg.cagra.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.mg.cagra.load
+```
+
+## Index distribute
+
+```{autofunction} cuvs.neighbors.mg.cagra.distribute
+```
+
diff --git a/docs/source/python_api/neighbors_mg_cagra.rst b/docs/source/python_api/neighbors_mg_cagra.rst
deleted file mode 100644
index 763e0e2157..0000000000
--- a/docs/source/python_api/neighbors_mg_cagra.rst
+++ /dev/null
@@ -1,55 +0,0 @@
-Multi-GPU CAGRA
-===============
-
-Multi-GPU CAGRA extends the graph-based CAGRA algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-.. note::
-   **IMPORTANT**: Multi-GPU CAGRA requires all data (datasets, queries, output arrays) to be in host memory (CPU).
-   If using CuPy/device arrays, transfer to host with ``array.get()`` or ``cp.asnumpy(array)`` before use.
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.mg.cagra.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.mg.cagra.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.mg.cagra.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.mg.cagra.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.mg.cagra.search
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.mg.cagra.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.mg.cagra.load
-
-Index distribute
-################
-
-.. autofunction:: cuvs.neighbors.mg.cagra.distribute
diff --git a/docs/source/python_api/neighbors_mg_ivf_flat.md b/docs/source/python_api/neighbors_mg_ivf_flat.md
new file mode 100644
index 0000000000..e1909c1e73
--- /dev/null
+++ b/docs/source/python_api/neighbors_mg_ivf_flat.md
@@ -0,0 +1,57 @@
+# Multi-GPU IVF-Flat
+
+Multi-GPU IVF-Flat extends the IVF-Flat algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
+
+```{note}
+**IMPORTANT**: Multi-GPU IVF-Flat requires all data (datasets, queries, output arrays) to be in host memory (CPU).
+If using CuPy/device arrays, transfer to host with `array.get()` or `cp.asnumpy(array)` before use.
+```
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.mg.ivf_flat.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.mg.ivf_flat.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.mg.ivf_flat.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.search
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.extend
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.load
+```
+
+## Index distribute
+
+```{autofunction} cuvs.neighbors.mg.ivf_flat.distribute
+```
+
diff --git a/docs/source/python_api/neighbors_mg_ivf_flat.rst b/docs/source/python_api/neighbors_mg_ivf_flat.rst
deleted file mode 100644
index 68eea86fec..0000000000
--- a/docs/source/python_api/neighbors_mg_ivf_flat.rst
+++ /dev/null
@@ -1,60 +0,0 @@
-Multi-GPU IVF-Flat
-==================
-
-Multi-GPU IVF-Flat extends the IVF-Flat algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-.. note::
-   **IMPORTANT**: Multi-GPU IVF-Flat requires all data (datasets, queries, output arrays) to be in host memory (CPU).
-   If using CuPy/device arrays, transfer to host with ``array.get()`` or ``cp.asnumpy(array)`` before use.
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.mg.ivf_flat.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.mg.ivf_flat.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.mg.ivf_flat.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.search
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.extend
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.load
-
-Index distribute
-################
-
-.. autofunction:: cuvs.neighbors.mg.ivf_flat.distribute
diff --git a/docs/source/python_api/neighbors_mg_ivf_pq.md b/docs/source/python_api/neighbors_mg_ivf_pq.md
new file mode 100644
index 0000000000..368a77f8f9
--- /dev/null
+++ b/docs/source/python_api/neighbors_mg_ivf_pq.md
@@ -0,0 +1,57 @@
+# Multi-GPU IVF-PQ
+
+Multi-GPU IVF-PQ extends the IVF-PQ (Inverted File with Product Quantization) algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
+
+```{note}
+**IMPORTANT**: Multi-GPU IVF-PQ requires all data (datasets, queries, output arrays) to be in host memory (CPU).
+If using CuPy/device arrays, transfer to host with `array.get()` or `cp.asnumpy(array)` before use.
+```
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.mg.ivf_pq.IndexParams
+:members:
+```
+
+## Index search parameters
+
+```{autoclass} cuvs.neighbors.mg.ivf_pq.SearchParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.mg.ivf_pq.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.build
+```
+
+## Index search
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.search
+```
+
+## Index extend
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.extend
+```
+
+## Index save
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.save
+```
+
+## Index load
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.load
+```
+
+## Index distribute
+
+```{autofunction} cuvs.neighbors.mg.ivf_pq.distribute
+```
+
diff --git a/docs/source/python_api/neighbors_mg_ivf_pq.rst b/docs/source/python_api/neighbors_mg_ivf_pq.rst
deleted file mode 100644
index 8343d59753..0000000000
--- a/docs/source/python_api/neighbors_mg_ivf_pq.rst
+++ /dev/null
@@ -1,60 +0,0 @@
-Multi-GPU IVF-PQ
-================
-
-Multi-GPU IVF-PQ extends the IVF-PQ (Inverted File with Product Quantization) algorithm to work across multiple GPUs, providing improved scalability and performance for large-scale vector search. It supports both replicated and sharded distribution modes.
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-.. note::
-   **IMPORTANT**: Multi-GPU IVF-PQ requires all data (datasets, queries, output arrays) to be in host memory (CPU).
-   If using CuPy/device arrays, transfer to host with ``array.get()`` or ``cp.asnumpy(array)`` before use.
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.mg.ivf_pq.IndexParams
-    :members:
-
-Index search parameters
-#######################
-
-.. autoclass:: cuvs.neighbors.mg.ivf_pq.SearchParams
-    :members:
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.mg.ivf_pq.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.build
-
-Index search
-############
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.search
-
-Index extend
-############
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.extend
-
-Index save
-##########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.save
-
-Index load
-##########
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.load
-
-Index distribute
-################
-
-.. autofunction:: cuvs.neighbors.mg.ivf_pq.distribute
diff --git a/docs/source/python_api/neighbors_multi_gpu.rst b/docs/source/python_api/neighbors_multi_gpu.md
similarity index 51%
rename from docs/source/python_api/neighbors_multi_gpu.rst
rename to docs/source/python_api/neighbors_multi_gpu.md
index bb3a5a07ed..04cea5bc7b 100644
--- a/docs/source/python_api/neighbors_multi_gpu.rst
+++ b/docs/source/python_api/neighbors_multi_gpu.md
@@ -1,14 +1,8 @@
-Multi-GPU Nearest Neighbors
-===========================
+# Multi-GPU Nearest Neighbors
 
 Multi-GPU support in cuVS enables scaling ANN (Approximate Nearest Neighbors) algorithms across multiple GPUs on a single node, providing improved performance and the ability to handle larger datasets.
 
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Overview
---------
+## Overview
 
 The multi-GPU implementations extend the single-GPU algorithms to work across multiple GPUs using two main distribution strategies:
 
@@ -16,25 +10,24 @@ The multi-GPU implementations extend the single-GPU algorithms to work across mu
 
 - **Sharded Mode**: The index is partitioned (sharded) across GPUs. This mode allows handling larger datasets that don't fit on a single GPU by distributing the data across multiple GPUs.
 
-Important Notes
----------------
+## Important Notes
 
-.. warning::
-   **Memory Requirements**: Multi-GPU algorithms require all data to be in host memory (CPU). This is different from single-GPU algorithms that typically work with device memory.
+```{warning}
+**Memory Requirements**: Multi-GPU algorithms require all data to be in host memory (CPU). This is different from single-GPU algorithms that typically work with device memory.
+```
 
-.. note::
-   **Supported Algorithms**: Currently, multi-GPU support is available for:
+```{note}
+**Supported Algorithms**: Currently, multi-GPU support is available for:
 
-   - CAGRA (Graph-based ANN)
-   - IVF-Flat (Inverted File with Flat storage)
-   - IVF-PQ (Inverted File with Product Quantization)
-   - All-neighbors (multi-GPU is built into its unified API via ``MultiGpuResources``)
+- CAGRA (Graph-based ANN)
+- IVF-Flat (Inverted File with Flat storage)
+- IVF-PQ (Inverted File with Product Quantization)
+- All-neighbors (multi-GPU is built into its unified API via `MultiGpuResources`)
+```
 
-Configuration Options
----------------------
+## Configuration Options
 
-Distribution Modes
-^^^^^^^^^^^^^^^^^^
+### Distribution Modes
 
 - **Replicated Mode**
 
@@ -52,8 +45,7 @@ Distribution Modes
   - Requires coordination between GPUs during search operations
   - Is ideal for scenarios where the dataset is too large for a single GPU
 
-Search Modes
-^^^^^^^^^^^^
+### Search Modes
 
 - **Load Balancer**
 
@@ -63,8 +55,7 @@ Search Modes
 
   Distributes queries evenly across GPUs in a rotating sequence, ensuring balanced workload allocation. This mode is best suited for frequent, small-scale search operations.
 
-Merge Modes
-^^^^^^^^^^^
+### Merge Modes
 
 - **Merge on Root Rank**
 
@@ -74,45 +65,44 @@ Merge Modes
 
   Results are merged in a tree-like fashion across GPUs to reduce communication overhead.
 
-Usage Examples
---------------
+## Usage Examples
 
-Basic Multi-GPU Usage
-^^^^^^^^^^^^^^^^^^^^^^
+### Basic Multi-GPU Usage
 
-.. code-block:: python
+```python
+import numpy as np
+from cuvs.neighbors import mg_cagra
 
-   import numpy as np
-   from cuvs.neighbors import mg_cagra
+# Create dataset in host memory
+n_samples = 100000
+n_features = 128
+dataset = np.random.random_sample((n_samples, n_features), dtype=np.float32)
 
-   # Create dataset in host memory
-   n_samples = 100000
-   n_features = 128
-   dataset = np.random.random_sample((n_samples, n_features), dtype=np.float32)
+# Build multi-GPU index
+build_params = mg_cagra.IndexParams(
+    distribution_mode="sharded",
+    metric="sqeuclidean"
+)
+index = mg_cagra.build(build_params, dataset)
 
-   # Build multi-GPU index
-   build_params = mg_cagra.IndexParams(
-       distribution_mode="sharded",
-       metric="sqeuclidean"
-   )
-   index = mg_cagra.build(build_params, dataset)
+# Search with multi-GPU
+queries = np.random.random_sample((1000, n_features), dtype=np.float32)
+search_params = mg_cagra.SearchParams(
+    search_mode="load_balancer",
+    merge_mode="merge_on_root_rank"
+)
+distances, neighbors = mg_cagra.search(search_params, index, queries, k=10)
+```
 
-   # Search with multi-GPU
-   queries = np.random.random_sample((1000, n_features), dtype=np.float32)
-   search_params = mg_cagra.SearchParams(
-       search_mode="load_balancer",
-       merge_mode="merge_on_root_rank"
-   )
-   distances, neighbors = mg_cagra.search(search_params, index, queries, k=10)
+## Algorithm-Specific Documentation
 
-Algorithm-Specific Documentation
---------------------------------
+```{toctree}
+:maxdepth: 2
+:caption: Multi-GPU Algorithms:
 
-.. toctree::
-   :maxdepth: 2
-   :caption: Multi-GPU Algorithms:
+neighbors_all_neighbors.md
+neighbors_mg_cagra.md
+neighbors_mg_ivf_flat.md
+neighbors_mg_ivf_pq.md
+```
 
-   neighbors_all_neighbors.rst
-   neighbors_mg_cagra.rst
-   neighbors_mg_ivf_flat.rst
-   neighbors_mg_ivf_pq.rst
diff --git a/docs/source/python_api/neighbors_nn_decent.md b/docs/source/python_api/neighbors_nn_decent.md
new file mode 100644
index 0000000000..8aa09b7242
--- /dev/null
+++ b/docs/source/python_api/neighbors_nn_decent.md
@@ -0,0 +1,19 @@
+# NN-Descent
+
+## Index build parameters
+
+```{autoclass} cuvs.neighbors.nn_descent.IndexParams
+:members:
+```
+
+## Index
+
+```{autoclass} cuvs.neighbors.nn_descent.Index
+:members:
+```
+
+## Index build
+
+```{autofunction} cuvs.neighbors.nn_descent.build
+```
+
diff --git a/docs/source/python_api/neighbors_nn_decent.rst b/docs/source/python_api/neighbors_nn_decent.rst
deleted file mode 100644
index 01e9e196c9..0000000000
--- a/docs/source/python_api/neighbors_nn_decent.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-NN-Descent
-==========
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-Index build parameters
-######################
-
-.. autoclass:: cuvs.neighbors.nn_descent.IndexParams
-    :members:
-
-
-Index
-#####
-
-.. autoclass:: cuvs.neighbors.nn_descent.Index
-    :members:
-
-Index build
-###########
-
-.. autofunction:: cuvs.neighbors.nn_descent.build
diff --git a/docs/source/python_api/preprocessing.md b/docs/source/python_api/preprocessing.md
new file mode 100644
index 0000000000..323752ae79
--- /dev/null
+++ b/docs/source/python_api/preprocessing.md
@@ -0,0 +1,63 @@
+# Preprocessing
+
+## PCA (Principal Component Analysis)
+
+```{autoclass} cuvs.preprocessing.pca.Params
+:members:
+```
+
+```{autofunction} cuvs.preprocessing.pca.fit
+```
+
+```{autofunction} cuvs.preprocessing.pca.fit_transform
+```
+
+```{autofunction} cuvs.preprocessing.pca.transform
+```
+
+```{autofunction} cuvs.preprocessing.pca.inverse_transform
+```
+
+## Binary Quantizer
+
+```{autofunction} cuvs.preprocessing.quantize.binary.transform
+```
+
+## Product Quantizer
+
+```{autoclass} cuvs.preprocessing.quantize.pq.Quantizer
+:members:
+```
+
+```{autoclass} cuvs.preprocessing.quantize.pq.QuantizerParams
+:members:
+```
+
+```{autofunction} cuvs.preprocessing.quantize.pq.build
+```
+
+```{autofunction} cuvs.preprocessing.quantize.pq.transform
+```
+
+```{autofunction} cuvs.preprocessing.quantize.pq.inverse_transform
+```
+
+## Scalar Quantizer
+
+```{autoclass} cuvs.preprocessing.quantize.scalar.Quantizer
+:members:
+```
+
+```{autoclass} cuvs.preprocessing.quantize.scalar.QuantizerParams
+:members:
+```
+
+```{autofunction} cuvs.preprocessing.quantize.scalar.train
+```
+
+```{autofunction} cuvs.preprocessing.quantize.scalar.transform
+```
+
+```{autofunction} cuvs.preprocessing.quantize.scalar.inverse_transform
+```
+
diff --git a/docs/source/python_api/preprocessing.rst b/docs/source/python_api/preprocessing.rst
deleted file mode 100644
index bbf1337710..0000000000
--- a/docs/source/python_api/preprocessing.rst
+++ /dev/null
@@ -1,55 +0,0 @@
-Preprocessing
-=============
-
-.. role:: py(code)
-   :language: python
-   :class: highlight
-
-PCA (Principal Component Analysis)
-###################################
-
-.. autoclass:: cuvs.preprocessing.pca.Params
-    :members:
-
-.. autofunction:: cuvs.preprocessing.pca.fit
-
-.. autofunction:: cuvs.preprocessing.pca.fit_transform
-
-.. autofunction:: cuvs.preprocessing.pca.transform
-
-.. autofunction:: cuvs.preprocessing.pca.inverse_transform
-
-Binary Quantizer
-################
-
-.. autofunction:: cuvs.preprocessing.quantize.binary.transform
-
-Product Quantizer
-#################
-
-.. autoclass:: cuvs.preprocessing.quantize.pq.Quantizer
-    :members:
-
-.. autoclass:: cuvs.preprocessing.quantize.pq.QuantizerParams
-    :members:
-
-.. autofunction:: cuvs.preprocessing.quantize.pq.build
-
-.. autofunction:: cuvs.preprocessing.quantize.pq.transform
-
-.. autofunction:: cuvs.preprocessing.quantize.pq.inverse_transform
-
-Scalar Quantizer
-################
-
-.. autoclass:: cuvs.preprocessing.quantize.scalar.Quantizer
-    :members:
-
-.. autoclass:: cuvs.preprocessing.quantize.scalar.QuantizerParams
-    :members:
-
-.. autofunction:: cuvs.preprocessing.quantize.scalar.train
-
-.. autofunction:: cuvs.preprocessing.quantize.scalar.transform
-
-.. autofunction:: cuvs.preprocessing.quantize.scalar.inverse_transform
diff --git a/docs/source/rust_api/index.md b/docs/source/rust_api/index.md
new file mode 100644
index 0000000000..7f728e5d40
--- /dev/null
+++ b/docs/source/rust_api/index.md
@@ -0,0 +1,14 @@
+# Rust API Documentation
+
+```{raw} html
+<iframe src="../_static/rust/cuvs/index.html" height="720px" width="100%"></iframe>
+
+<!-- hide the 'view source' section here, since it doesn't work with the iframe
+and we want the iframe to use the space -->
+<style type="text/css">
+    .bd-sidebar-secondary {
+        display: none;
+    }
+</style>
+```
+
diff --git a/docs/source/rust_api/index.rst b/docs/source/rust_api/index.rst
deleted file mode 100644
index f79d04fdf8..0000000000
--- a/docs/source/rust_api/index.rst
+++ /dev/null
@@ -1,15 +0,0 @@
-~~~~~~~~~~~~~~~~~~~~~~
-Rust API Documentation
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. raw:: html
-
-    <iframe src="../_static/rust/cuvs/index.html" height="720px" width="100%"></iframe>
-
-    <!-- hide the 'view source' section here, since it doesn't work with the iframe
-    and we want the iframe to use the space -->
-    <style type="text/css">
-        .bd-sidebar-secondary {
-            display: none;
-        }
-    </style>
diff --git a/docs/source/tuning_guide.rst b/docs/source/tuning_guide.md
similarity index 78%
rename from docs/source/tuning_guide.rst
rename to docs/source/tuning_guide.md
index fd54fc42ae..d9cfb4f187 100644
--- a/docs/source/tuning_guide.rst
+++ b/docs/source/tuning_guide.md
@@ -1,23 +1,18 @@
-~~~~~~~~~~~~~~~~~~~~~~
-Automated tuning Guide
-~~~~~~~~~~~~~~~~~~~~~~
+# Automated tuning Guide
 
-Introduction
-============
+## Introduction
 
-A Method for tuning and evaluating Vector Search Indexes At Scale in Locally Indexed Vector Databases. For more information on the differences between locally and globally indexed vector databases, please see :doc:`this guide <vector_databases_vs_vector_search>`. The goal of this guide is to give users a scalable and effective approach for tuning a vector search index, no matter how large.  Evaluation of a vector search index “model” that measures recall in proportion to build time so that it penalizes the recall when the build time is really high (should ultimately optimize for finding a lower build time and higher recall).
+A Method for tuning and evaluating Vector Search Indexes At Scale in Locally Indexed Vector Databases. For more information on the differences between locally and globally indexed vector databases, please see {doc}`this guide <vector_databases_vs_vector_search>`. The goal of this guide is to give users a scalable and effective approach for tuning a vector search index, no matter how large.  Evaluation of a vector search index “model” that measures recall in proportion to build time so that it penalizes the recall when the build time is really high (should ultimately optimize for finding a lower build time and higher recall).
 
-For more information on the various different types of vector search indexes, please see our :doc:`guide to choosing vector search indexes <choosing_and_configuring_indexes>`
+For more information on the various different types of vector search indexes, please see our {doc}`guide to choosing vector search indexes <choosing_and_configuring_indexes>`
 
-Why automated tuning?
-=====================
+## Why automated tuning?
 
-As much as 75% of users have told us they will not be able to tune a vector database beyond one or two simple knobs and we suggest that an ideal “knob” would be to balance training time and search time with search quality. The more time, the higher the quality, and the more needed to find an acceptable search performance. Even the 25% of users that want to tune are still asking for simple tools for doing so. These users also ask for some simple guidelines for setting tuning parameters, like :doc:`this guide <neighbors/neighbors>`.
+As much as 75% of users have told us they will not be able to tune a vector database beyond one or two simple knobs and we suggest that an ideal “knob” would be to balance training time and search time with search quality. The more time, the higher the quality, and the more needed to find an acceptable search performance. Even the 25% of users that want to tune are still asking for simple tools for doing so. These users also ask for some simple guidelines for setting tuning parameters, like {doc}`this guide <neighbors/neighbors>`.
 
-Since vector search indexes are more closely related to machine learning models than traditional databases indexes, one option for easing the parameter tuning burden is to use hyper-parameter optimization tools like `Ray Tune <https://medium.com/rapids-ai/30x-faster-hyperparameter-search-with-raytune-and-rapids-403013fbefc5>`_ and `Optuna <https://docs.rapids.ai/deployment/stable/examples/rapids-optuna-hpo/notebook/>`_. to verify this.
+Since vector search indexes are more closely related to machine learning models than traditional databases indexes, one option for easing the parameter tuning burden is to use hyper-parameter optimization tools like [Ray Tune](https://medium.com/rapids-ai/30x-faster-hyperparameter-search-with-raytune-and-rapids-403013fbefc5) and [Optuna](https://docs.rapids.ai/deployment/stable/examples/rapids-optuna-hpo/notebook/). to verify this.
 
-How to tune?
-============
+## How to tune?
 
 But how would this work when we have an index that's massively large- like 1TB?
 
@@ -27,30 +22,28 @@ Because many databases use this sub-sampling trick, it's possible to perform an
 
 GPUs are naturally great at performing massively parallel tasks, especially when they are largely independent tasks, such as training and evaluating models with different hyper-parameter settings in parallel. Hyper-parameter optimization also lends itself well to distributed processing, such as multi-node multi-GPU operation.
 
-Steps to achieve automated tuning
-=================================
+## Steps to achieve automated tuning
 
 More formally, an automated parameter tuning workflow with monte-carlo cross-validation looks something like this:
 
-#. Ingest a large dataset into the vector database of your choice
+1. Ingest a large dataset into the vector database of your choice
 
-#. Choose an index size based on number of vectors. This should usually align with the average number of vectors the database will end up putting in a single ANN sub-index model.
+1. Choose an index size based on number of vectors. This should usually align with the average number of vectors the database will end up putting in a single ANN sub-index model.
 
-#. Uniformly random sample the number of vectors specified above from the database for a training set. This is often accomplished by generating some number of random (unique) numbers up to the dataset size.
+1. Uniformly random sample the number of vectors specified above from the database for a training set. This is often accomplished by generating some number of random (unique) numbers up to the dataset size.
 
-#. Uniformly sample some number of vectors for a test set and do this again for an evaluation set. 1-10% of the vectors in the training set.
+1. Uniformly sample some number of vectors for a test set and do this again for an evaluation set. 1-10% of the vectors in the training set.
 
-#. Use the test set to compute ground truth on the vectors from prior step against all vectors in the training set.
+1. Use the test set to compute ground truth on the vectors from prior step against all vectors in the training set.
 
-#. Start the HPO tuning process for the training set, using the test vectors for the query set. It's important to make sure your HPO is multi-objective and optimizes for: a) low build time, b) high throughput or low latency search (depending on needs), and c) acceptable recall.
+1. Start the HPO tuning process for the training set, using the test vectors for the query set. It's important to make sure your HPO is multi-objective and optimizes for: a) low build time, b) high throughput or low latency search (depending on needs), and c) acceptable recall.
 
-#. Use the evaluation dataset to test that the optimal hyper-parameters generalize to unseen points that were not used in the optimization process.
+1. Use the evaluation dataset to test that the optimal hyper-parameters generalize to unseen points that were not used in the optimization process.
 
-#. Optionally, the above steps multiple times on different uniform sub-samplings. Optimal parameters can then be combined over the multiple monte-carlo optimization iterations. For example, many hyper-parameters can simply be averaged but care might need to be taken for other parameters.
+1. Optionally, the above steps multiple times on different uniform sub-samplings. Optimal parameters can then be combined over the multiple monte-carlo optimization iterations. For example, many hyper-parameters can simply be averaged but care might need to be taken for other parameters.
 
-#. Create a new index in the database using the ideal params from above that meet the target constraints (e.g. build vs search vs quality)
+1. Create a new index in the database using the ideal params from above that meet the target constraints (e.g. build vs search vs quality)
 
-Conclusion
-==========
+## Conclusion
 
 By the end of this process, you should have a set of parameters that meet your target constraints while demonstrating how well the optimal hyper-parameters generalize across the dataset. The major benefit to this approach is that it breaks a potentially unbounded dataset size down into manageable chunks and accelerates tuning on those chunks. We see this process as a major value add for vector search on the GPU.
diff --git a/docs/source/vector_databases_vs_vector_search.rst b/docs/source/vector_databases_vs_vector_search.md
similarity index 91%
rename from docs/source/vector_databases_vs_vector_search.rst
rename to docs/source/vector_databases_vs_vector_search.md
index 5c43ee5508..d3c1f76e3f 100644
--- a/docs/source/vector_databases_vs_vector_search.rst
+++ b/docs/source/vector_databases_vs_vector_search.md
@@ -1,13 +1,10 @@
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Vector search indexes vs vector databases
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+# Vector search indexes vs vector databases
 
-This guide provides information on the differences between vector search indexes and fully-fledged vector databases. For more information on selecting and configuring vector search indexes, please refer to our :doc:`guide on choosing and configuring indexes <choosing_and_configuring_indexes>`
+This guide provides information on the differences between vector search indexes and fully-fledged vector databases. For more information on selecting and configuring vector search indexes, please refer to our {doc}`guide on choosing and configuring indexes <choosing_and_configuring_indexes>`
 
 One of the primary differences between vector database indexes and traditional database indexes is that vector search often uses approximations to trade-off accuracy of the results for speed. Because of this, while many mature databases offer mechanisms to tune their indexes and achieve better performance, vector database indexes can return completely garbage results if they aren’t tuned for a reasonable level of search quality in addition to performance tuning. This is because vector database indexes are more closely related to machine learning models than they are to traditional database indexes.
 
-What are the differences between vector databases and vector search indexes?
-============================================================================
+## What are the differences between vector databases and vector search indexes?
 
 Vector search in and of itself refers to the objective of finding the closest vectors in an index around a given set of query vectors. At the lowest level, vector search indexes are just machine learning models, which have a build, search, and recall performance that can be traded off, depending on the algorithm and various hyper-parameters.
 
@@ -19,8 +16,7 @@ So what does all this mean to you? Sometimes a simple standalone vector search i
 
 FAISS and cuVS are examples of standalone vector search libraries, which again are more closely related to machine learning libraries than to fully-fledged databases. Milvus is an example of a special-purpose vector database and Elastic, MongoDB, and OpenSearch are examples of general-purpose databases that have added vector search capabilities.
 
-How is vector search used by vector databases?
-==============================================
+## How is vector search used by vector databases?
 
 Within the context of vector databases, there are two primary ways in which vector search indexes are used and it’s important to understand which you are working with because it can have an effect on the behavior of the parameters with respect to the data.
 
@@ -28,16 +24,14 @@ Many vector search algorithms improve scalability while reducing the number of d
 
 This leads us to two core architectural designs that we encounter in vector databases:
 
-Locally partitioned vector search indexes
------------------------------------------
+### Locally partitioned vector search indexes
 
 Most databases follow this design, and vectors are often first written to a write-ahead log for durability. After some number of vectors are written, the write-ahead logs become immutable and may be merged with other write-ahead logs before eventually being converted to a new vector search index.
 
 The search is generally done over each locally partitioned index and the results combined. When setting hyperparameters, only the local vector search indexes need to be considered, though the same hyperparameters are going to be used across all of the local partitions. So, for example, if you’ve ingested 100M vectors but each partition only contains about 10M vectors, the size of the index only needs to consider its local 10M vectors. Details like number of vectors in the index are important, for example, when setting the number of clusters in an IVF-based (inverted file index) method, as I’ll cover below.
 
 
-Globally partitioned vector search indexes
-------------------------------------------
+### Globally partitioned vector search indexes
 
 Some special-purpose vector databases follow this design, such as Yahoo’s Vespa and Google’s Spanner. A global index is trained to partition the entire database’s vectors up front as soon as there are enough vectors to do so (usually these databases are at a large enough scale that a significant number of vectors are bootstrapped initially and so it avoids the cold start problem). Ingested vectors are first run through the global index (clustering, for example, but tree- and graph-based methods have also been used) to determine which partition they belong to and the vectors are then (sent to, and) written  directly to that partition. The individual partitions can contain a graph, tree, or a simple IVF list. These types of indexes have been able to scale to hundreds of billions to trillions of vectors, and since the partitions are themselves often implicitly based on neighborhoods, rather than being based on uniformly random distributed vectors like the locally partitioned architectures, the partitions can be grouped together or intentionally separated to support localized searches or load balancing, depending upon the needs of the system.
 
@@ -47,11 +41,10 @@ Of course, the two approaches outlined above can also be used together (e.g. tra
 
 A challenge with GPUs in vector databases today is that the resulting vector indexes are expected to fit into the memory of available GPUs for fast search. That is to say, there doesn’t exist today an efficient mechanism for offloading or swapping GPU indexes so they can be cached from disk or host memory, for example. We are working on mechanisms to do this, and to also utilize technologies like GPUDirect Storage and GPUDirect RDMA to improve the IO performance further.
 
-Tuning and hyperparameter optimization
-======================================
+## Tuning and hyperparameter optimization
 
 Unfortunately, for large datasets, doing a hyper-parameter optimization on the whole dataset is not always feasible and this is actually where the locally partitioned vector search indexes have an advantage because you can think of each smaller segment of the larger index as a uniform random sample of the total vectors in the dataset. This means that it is possible to perform a hyperparameter optimization on the smaller subsets and find reasonably acceptable parameters that should generalize fairly well to the entire dataset. Generally this hyperparameter optimization will require computing a ground truth on the subset with an exact method like brute-force and then using it to evaluate several searches on randomly sampled vectors.
 
 Full hyper-parameter optimization may also not always be necessary- for example, once you have built a ground truth dataset on a subset, many times you can start by building an index with the default build parameters and then playing around with different search parameters until you get the desired quality and search performance.  For massive indexes that might be multiple terabytes, you could also take this subsampling of, say, 10M vectors, train an index and then tune the search parameters from there. While there might be a small margin of error, the chosen build/search parameters should generalize fairly well for the databases that build locally partitioned indexes.
 
-Refer to our :doc:`tuning guide <tuning_guide>` for more information and examples on how to efficiently and automatically tune your vector search indexes based on your needs.
+Refer to our {doc}`tuning guide <tuning_guide>` for more information and examples on how to efficiently and automatically tune your vector search indexes based on your needs.
diff --git a/docs/source/working_with_ann_indexes.md b/docs/source/working_with_ann_indexes.md
new file mode 100644
index 0000000000..d5ee2568ad
--- /dev/null
+++ b/docs/source/working_with_ann_indexes.md
@@ -0,0 +1,12 @@
+# Working with ANN Indexes
+
+```{toctree}
+:maxdepth: 1
+:caption: Contents:
+
+working_with_ann_indexes_c.md
+working_with_ann_indexes_cpp.md
+working_with_ann_indexes_python.md
+working_with_ann_indexes_rust.md
+```
+
diff --git a/docs/source/working_with_ann_indexes.rst b/docs/source/working_with_ann_indexes.rst
deleted file mode 100644
index 8e91fb4acd..0000000000
--- a/docs/source/working_with_ann_indexes.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-Working with ANN Indexes
-========================
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Contents:
-
-   working_with_ann_indexes_c.rst
-   working_with_ann_indexes_cpp.rst
-   working_with_ann_indexes_python.rst
-   working_with_ann_indexes_rust.rst
diff --git a/docs/source/working_with_ann_indexes_c.md b/docs/source/working_with_ann_indexes_c.md
new file mode 100644
index 0000000000..a65c414234
--- /dev/null
+++ b/docs/source/working_with_ann_indexes_c.md
@@ -0,0 +1,59 @@
+# Working with ANN Indexes in C
+
+- [Building an index](#building-an-index)
+- [Searching an index](#searching-an-index)
+
+## Building an index
+
+```c
+#include <cuvs/neighbors/cagra.h>
+
+cuvsResources_t res;
+cuvsCagraIndexParams_t index_params;
+cuvsCagraIndex_t index;
+
+DLManagedTensor *dataset;
+
+// populate tensor with data
+load_dataset(dataset);
+
+cuvsResourcesCreate(&res);
+cuvsCagraIndexParamsCreate(&index_params);
+cuvsCagraIndexCreate(&index);
+
+cuvsCagraBuild(res, index_params, dataset, index);
+
+cuvsCagraIndexDestroy(index);
+cuvsCagraIndexParamsDestroy(index_params);
+cuvsResourcesDestroy(res);
+```
+
+## Searching an index
+
+```c
+#include <cuvs/neighbors/cagra.h>
+
+cuvsResources_t res;
+cuvsCagraSearchParams_t search_params;
+cuvsCagraIndex_t index;
+
+// ... build index ...
+
+DLManagedTensor *queries;
+
+DLManagedTensor *neighbors;
+DLManagedTensor *distances;
+
+// populate tensor with data
+load_queries(queries);
+
+cuvsResourcesCreate(&res);
+cuvsCagraSearchParamsCreate(&index_params);
+
+cuvsCagraSearch(res, search_params, index, queries, neighbors, distances);
+
+cuvsCagraIndexDestroy(index);
+cuvsCagraIndexParamsDestroy(index_params);
+cuvsResourcesDestroy(res);
+```
+
diff --git a/docs/source/working_with_ann_indexes_c.rst b/docs/source/working_with_ann_indexes_c.rst
deleted file mode 100644
index 1e84141a86..0000000000
--- a/docs/source/working_with_ann_indexes_c.rst
+++ /dev/null
@@ -1,62 +0,0 @@
-Working with ANN Indexes in C
-=============================
-
-- `Building an index`_
-- `Searching an index`_
-
-Building an index
------------------
-
-.. code-block:: c
-
-    #include <cuvs/neighbors/cagra.h>
-
-    cuvsResources_t res;
-    cuvsCagraIndexParams_t index_params;
-    cuvsCagraIndex_t index;
-
-    DLManagedTensor *dataset;
-
-    // populate tensor with data
-    load_dataset(dataset);
-
-    cuvsResourcesCreate(&res);
-    cuvsCagraIndexParamsCreate(&index_params);
-    cuvsCagraIndexCreate(&index);
-
-    cuvsCagraBuild(res, index_params, dataset, index);
-
-    cuvsCagraIndexDestroy(index);
-    cuvsCagraIndexParamsDestroy(index_params);
-    cuvsResourcesDestroy(res);
-
-
-Searching an index
-------------------
-
-.. code-block:: c
-
-    #include <cuvs/neighbors/cagra.h>
-
-    cuvsResources_t res;
-    cuvsCagraSearchParams_t search_params;
-    cuvsCagraIndex_t index;
-
-    // ... build index ...
-
-    DLManagedTensor *queries;
-
-    DLManagedTensor *neighbors;
-    DLManagedTensor *distances;
-
-    // populate tensor with data
-    load_queries(queries);
-
-    cuvsResourcesCreate(&res);
-    cuvsCagraSearchParamsCreate(&index_params);
-
-    cuvsCagraSearch(res, search_params, index, queries, neighbors, distances);
-
-    cuvsCagraIndexDestroy(index);
-    cuvsCagraIndexParamsDestroy(index_params);
-    cuvsResourcesDestroy(res);
diff --git a/docs/source/working_with_ann_indexes_cpp.md b/docs/source/working_with_ann_indexes_cpp.md
new file mode 100644
index 0000000000..6bf9f381a7
--- /dev/null
+++ b/docs/source/working_with_ann_indexes_cpp.md
@@ -0,0 +1,40 @@
+# Working with ANN Indexes in C++
+
+- [Building an index](#building-an-index)
+- [Searching an index](#searching-an-index)
+
+## Building an index
+
+```c++
+#include <cuvs/neighbors/cagra.hpp>
+
+using namespace cuvs::neighbors;
+
+raft::device_matrix_view<float> dataset = load_dataset();
+raft::device_resources res;
+
+cagra::index_params index_params;
+
+auto index = cagra::build(res, index_params, dataset);
+```
+
+## Searching an index
+
+```c++
+#include <cuvs/neighbors/cagra.hpp>
+
+using namespace cuvs::neighbors;
+cagra::index index;
+
+// ... build index ...
+
+raft::device_matrix_view<float> queries = load_queries();
+raft::device_matrix_view<uint32_t> neighbors = make_device_matrix_view<uint32_t>(n_queries, k);
+raft::device_matrix_view<float> distances = make_device_matrix_view<float>(n_queries, k);
+raft::device_resources res;
+
+cagra::search_params search_params;
+
+cagra::search(res, search_params, index, queries, neighbors, distances);
+```
+
diff --git a/docs/source/working_with_ann_indexes_cpp.rst b/docs/source/working_with_ann_indexes_cpp.rst
deleted file mode 100644
index 68578bf848..0000000000
--- a/docs/source/working_with_ann_indexes_cpp.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-Working with ANN Indexes in C++
-===============================
-
-- `Building an index`_
-- `Searching an index`_
-
-Building an index
------------------
-
-.. code-block:: c++
-
-    #include <cuvs/neighbors/cagra.hpp>
-
-    using namespace cuvs::neighbors;
-
-    raft::device_matrix_view<float> dataset = load_dataset();
-    raft::device_resources res;
-
-    cagra::index_params index_params;
-
-    auto index = cagra::build(res, index_params, dataset);
-
-
-Searching an index
-------------------
-
-.. code-block:: c++
-
-    #include <cuvs/neighbors/cagra.hpp>
-
-    using namespace cuvs::neighbors;
-    cagra::index index;
-
-    // ... build index ...
-
-    raft::device_matrix_view<float> queries = load_queries();
-    raft::device_matrix_view<uint32_t> neighbors = make_device_matrix_view<uint32_t>(n_queries, k);
-    raft::device_matrix_view<float> distances = make_device_matrix_view<float>(n_queries, k);
-    raft::device_resources res;
-
-    cagra::search_params search_params;
-
-    cagra::search(res, search_params, index, queries, neighbors, distances);
diff --git a/docs/source/working_with_ann_indexes_python.md b/docs/source/working_with_ann_indexes_python.md
new file mode 100644
index 0000000000..8b7b143f1e
--- /dev/null
+++ b/docs/source/working_with_ann_indexes_python.md
@@ -0,0 +1,30 @@
+# Working with ANN Indexes in Python
+
+- [Building an index](#building-an-index)
+- [Searching an index](#searching-an-index)
+
+## Building an index
+
+```python
+from cuvs.neighbors import cagra
+
+dataset = load_data()
+index_params = cagra.IndexParams()
+
+index = cagra.build(build_params, dataset)
+```
+
+## Searching an index
+
+```python
+from cuvs.neighbors import cagra
+
+queries = load_queries()
+
+search_params = cagra.SearchParams()
+
+index = // ... build index ...
+
+neighbors, distances = cagra.search(search_params, index, queries, k)
+```
+
diff --git a/docs/source/working_with_ann_indexes_python.rst b/docs/source/working_with_ann_indexes_python.rst
deleted file mode 100644
index 0419c47beb..0000000000
--- a/docs/source/working_with_ann_indexes_python.rst
+++ /dev/null
@@ -1,33 +0,0 @@
-Working with ANN Indexes in Python
-==================================
-
-- `Building an index`_
-- `Searching an index`_
-
-Building an index
------------------
-
-.. code-block:: python
-
-    from cuvs.neighbors import cagra
-
-    dataset = load_data()
-    index_params = cagra.IndexParams()
-
-    index = cagra.build(build_params, dataset)
-
-
-Searching an index
-------------------
-
-.. code-block:: python
-
-    from cuvs.neighbors import cagra
-
-    queries = load_queries()
-
-    search_params = cagra.SearchParams()
-
-    index = // ... build index ...
-
-    neighbors, distances = cagra.search(search_params, index, queries, k)
diff --git a/docs/source/working_with_ann_indexes_rust.md b/docs/source/working_with_ann_indexes_rust.md
new file mode 100644
index 0000000000..c102e8e2d5
--- /dev/null
+++ b/docs/source/working_with_ann_indexes_rust.md
@@ -0,0 +1,61 @@
+# Working with ANN Indexes in Rust
+
+- [Building and Searching an index](#building-and-searching-an-index)
+
+## Building and Searching an index
+
+```rust
+use cuvs::cagra::{Index, IndexParams};
+use cuvs::{Resources, Result};
+
+use ndarray_rand::rand_distr::Uniform;
+use ndarray_rand::RandomExt;
+
+/// Example showing how to index and search data with CAGRA
+fn cagra_example() -> Result<()> {
+    let res = Resources::new()?;
+
+    // Create a new random dataset to index
+    let n_datapoints = 65536;
+    let n_features = 512;
+    let dataset =
+        ndarray::Array::<f32, _>::random((n_datapoints, n_features), Uniform::new(0., 1.0));
+
+    // build the cagra index
+    let build_params = IndexParams::new()?;
+    let index = Index::build(&res, &build_params, &dataset)?;
+
+    // use the first 4 points from the dataset as queries : will test that we get them back
+    // as their own nearest neighbor
+    let n_queries = 4;
+    let queries = dataset.slice(s![0..n_queries, ..]);
+
+    let k = 10;
+
+    // CAGRA search API requires queries and outputs to be on device memory
+    // copy query data over, and allocate new device memory for the distances/ neighbors
+    // outputs
+    let queries = ManagedTensor::from(&queries).to_device(&res)?;
+    let mut neighbors_host = ndarray::Array::<u32, _>::zeros((n_queries, k));
+    let neighbors = ManagedTensor::from(&neighbors_host).to_device(&res)?;
+
+    let mut distances_host = ndarray::Array::<f32, _>::zeros((n_queries, k));
+    let distances = ManagedTensor::from(&distances_host).to_device(&res)?;
+
+    let search_params = SearchParams::new()?;
+
+    index.search(&res, &search_params, &queries, &neighbors, &distances)?;
+
+    // Copy back to host memory
+    distances.to_host(&res, &mut distances_host)?;
+    neighbors.to_host(&res, &mut neighbors_host)?;
+
+    // nearest neighbors should be themselves, since queries are from the
+    // dataset
+    println!("Neighbors {:?}", neighbors_host);
+    println!("Distances {:?}", distances_host);
+
+    Ok(())
+}
+```
+
diff --git a/docs/source/working_with_ann_indexes_rust.rst b/docs/source/working_with_ann_indexes_rust.rst
deleted file mode 100644
index 487ad0964b..0000000000
--- a/docs/source/working_with_ann_indexes_rust.rst
+++ /dev/null
@@ -1,62 +0,0 @@
-Working with ANN Indexes in Rust
-================================
-
-- `Building and Searching an index`_
-
-Building and Searching an index
--------------------------------
-
-.. code-block:: rust
-
-    use cuvs::cagra::{Index, IndexParams};
-    use cuvs::{Resources, Result};
-
-    use ndarray_rand::rand_distr::Uniform;
-    use ndarray_rand::RandomExt;
-
-    /// Example showing how to index and search data with CAGRA
-    fn cagra_example() -> Result<()> {
-        let res = Resources::new()?;
-
-        // Create a new random dataset to index
-        let n_datapoints = 65536;
-        let n_features = 512;
-        let dataset =
-            ndarray::Array::<f32, _>::random((n_datapoints, n_features), Uniform::new(0., 1.0));
-
-        // build the cagra index
-        let build_params = IndexParams::new()?;
-        let index = Index::build(&res, &build_params, &dataset)?;
-
-        // use the first 4 points from the dataset as queries : will test that we get them back
-        // as their own nearest neighbor
-        let n_queries = 4;
-        let queries = dataset.slice(s![0..n_queries, ..]);
-
-        let k = 10;
-
-        // CAGRA search API requires queries and outputs to be on device memory
-        // copy query data over, and allocate new device memory for the distances/ neighbors
-        // outputs
-        let queries = ManagedTensor::from(&queries).to_device(&res)?;
-        let mut neighbors_host = ndarray::Array::<u32, _>::zeros((n_queries, k));
-        let neighbors = ManagedTensor::from(&neighbors_host).to_device(&res)?;
-
-        let mut distances_host = ndarray::Array::<f32, _>::zeros((n_queries, k));
-        let distances = ManagedTensor::from(&distances_host).to_device(&res)?;
-
-        let search_params = SearchParams::new()?;
-
-        index.search(&res, &search_params, &queries, &neighbors, &distances)?;
-
-        // Copy back to host memory
-        distances.to_host(&res, &mut distances_host)?;
-        neighbors.to_host(&res, &mut neighbors_host)?;
-
-        // nearest neighbors should be themselves, since queries are from the
-        // dataset
-        println!("Neighbors {:?}", neighbors_host);
-        println!("Distances {:?}", distances_host);
-
-        Ok(())
-    }

From e04035014ead91a984dd08d57aa0c75d8184bbe1 Mon Sep 17 00:00:00 2001
From: "Corey J. Nolet" <cjnolet@gmail.com>
Date: Tue, 5 May 2026 12:58:52 -0400
Subject: [PATCH 2/2] Fixing links

---
 CHANGELOG.md                                  |  2 +-
 docs/source/advanced_topics.md                |  2 +-
 docs/source/api_docs.md                       |  4 +--
 docs/source/comparing_indexes.md              |  4 +--
 docs/source/cuvs_bench/build.md               |  2 +-
 docs/source/cuvs_bench/index.md               | 10 +++---
 docs/source/cuvs_bench/param_tuning.md        |  8 ++---
 docs/source/cuvs_bench/pluggable_backend.md   |  4 +--
 docs/source/cuvs_bench/wiki_all_dataset.md    |  2 +-
 docs/source/developer_guide.md                |  4 +--
 docs/source/filtering.md                      |  5 ++-
 docs/source/getting_started.md                | 34 +++++++++----------
 docs/source/index.md                          |  2 +-
 docs/source/neighbors/all_neighbors.md        |  2 +-
 docs/source/neighbors/bruteforce.md           |  2 +-
 docs/source/neighbors/cagra.md                |  2 +-
 docs/source/neighbors/ivfflat.md              |  2 +-
 docs/source/neighbors/ivfpq.md                |  2 +-
 docs/source/neighbors/neighbors.md            |  6 ++--
 docs/source/neighbors/vamana.md               | 10 +++---
 docs/source/tuning_guide.md                   |  6 ++--
 .../vector_databases_vs_vector_search.md      |  4 +--
 22 files changed, 59 insertions(+), 60 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7d090c0f97..6d07b7fe0f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -576,7 +576,7 @@
 - Vendor RAPIDS.cmake ([#816](https://github.com/rapidsai/cuvs/pull/816)) [@bdice](https://github.com/bdice)
 - Update libcuvs libraft ver to 25.06 in conda env ([#808](https://github.com/rapidsai/cuvs/pull/808)) [@jinsolp](https://github.com/jinsolp)
 - Moving NN Descent class and struct declarations to `nn_descent_gnnd.hpp` ([#803](https://github.com/rapidsai/cuvs/pull/803)) [@jinsolp](https://github.com/jinsolp)
-- Remove `[@rapidsai/cuvs-build-codeowners` ([#783](https://github.com/rapidsai/cuvs/pull/783)) @KyleFromNVIDIA](https://github.com/rapidsai/cuvs-build-codeowners` ([#783](https://github.com/rapidsai/cuvs/pull/783)) @KyleFromNVIDIA)
+- Remove @rapidsai/cuvs-build-codeowners ([#783](https://github.com/rapidsai/cuvs/pull/783)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
 - Moving wheel builds to specified location and uploading build artifacts to Github ([#777](https://github.com/rapidsai/cuvs/pull/777)) [@VenkateshJaya](https://github.com/VenkateshJaya)
 - Remove unused raft cagra header in add_nodes.cuh ([#741](https://github.com/rapidsai/cuvs/pull/741)) [@jiangyinzuo](https://github.com/jiangyinzuo)
 - Expose kmeans to python ([#729](https://github.com/rapidsai/cuvs/pull/729)) [@benfred](https://github.com/benfred)
diff --git a/docs/source/advanced_topics.md b/docs/source/advanced_topics.md
index bd7ab0b709..80565f31c5 100644
--- a/docs/source/advanced_topics.md
+++ b/docs/source/advanced_topics.md
@@ -12,7 +12,7 @@ cuVS uses the Just-in-Time (JIT) [Link-Time Optimization (LTO)](https://develope
 Thus, the JIT compilation is a one-time cost and you can expect no loss in real performance after the first compilation. We recommend that you run a "warmup" to trigger the JIT compilation before the actual usage.
 
 Currently, the following capabilities will trigger a JIT compilation:
-- IVF Flat search APIs: {doc}`cuvs::neighbors::ivf_flat::search() <cpp_api/neighbors_ivf_flat>`
+- IVF Flat search APIs: [cuvs::neighbors::ivf_flat::search()](cpp_api/neighbors_ivf_flat.md)
 
 ```{toctree}
 :maxdepth: 2
diff --git a/docs/source/api_docs.md b/docs/source/api_docs.md
index 5d91e6dbbb..81c2c1b658 100644
--- a/docs/source/api_docs.md
+++ b/docs/source/api_docs.md
@@ -9,5 +9,5 @@ python_api.md
 rust_api/index.md
 ```
 
-* {ref}`genindex`
-* {ref}`search`
+* [Index](genindex.html)
+* [Search](search.html)
diff --git a/docs/source/comparing_indexes.md b/docs/source/comparing_indexes.md
index 3492fdc296..cac0844371 100644
--- a/docs/source/comparing_indexes.md
+++ b/docs/source/comparing_indexes.md
@@ -2,7 +2,7 @@
 
 # Comparing performance of vector indexes
 
-This document provides a brief overview methodology for comparing vector search indexes and models. For guidance on how to choose and configure an index type, please refer to {doc}`this <vector_databases_vs_vector_search>` guide.
+This document provides a brief overview methodology for comparing vector search indexes and models. For guidance on how to choose and configure an index type, please refer to [this](vector_databases_vs_vector_search.md) guide.
 
 Unlike traditional database indexes, which will generally return correct results even without performance tuning, vector search indexes are more closely related to ML models and they can return absolutely garbage results if they have not been tuned.
 
@@ -52,4 +52,4 @@ It turns out that most vector databases, like Milvus for example, make many smal
 
 Please note, however, that there are often caps on the size of each of these smaller indexes, and that needs to be taken into consideration when choosing the size of the sub sample to tune.
 
-Please see {doc}`this guide <tuning_guide>` for more information on the steps one would take to do this subsampling and tuning process.
+Please see [this guide](tuning_guide.md) for more information on the steps one would take to do this subsampling and tuning process.
diff --git a/docs/source/cuvs_bench/build.md b/docs/source/cuvs_bench/build.md
index 88f26c21bf..ba2f7f622e 100644
--- a/docs/source/cuvs_bench/build.md
+++ b/docs/source/cuvs_bench/build.md
@@ -4,7 +4,7 @@
 
 CUDA 12 and a GPU with Volta architecture or later are required to run the benchmarks.
 
-Please refer to the  {doc}`installation docs <../build>` for the base requirements to build cuVS.
+Please refer to the  [installation docs](../build.md) for the base requirements to build cuVS.
 
 In addition to the base requirements for building cuVS, additional dependencies needed to build the ANN benchmarks include:
 
diff --git a/docs/source/cuvs_bench/index.md b/docs/source/cuvs_bench/index.md
index 4ad74fbcc1..91ebc77d18 100644
--- a/docs/source/cuvs_bench/index.md
+++ b/docs/source/cuvs_bench/index.md
@@ -24,9 +24,9 @@ This tool offers several benefits, including
 
 - [Running the benchmarks](#running-the-benchmarks)
 
-  * `End-to-end: smaller-scale benchmarks (<1M to 10M)`_
+  * [End-to-end: smaller-scale benchmarks (<1M to 10M)](#end-to-end-smaller-scale-benchmarks-1m-to-10m)
 
-  * `End-to-end: large-scale benchmarks (>10M vectors)`_
+  * [End-to-end: large-scale benchmarks (>10M vectors)](#end-to-end-large-scale-benchmarks-10m-vectors)
 
   * [Running with Docker containers](#running-with-docker-containers)
 
@@ -68,7 +68,7 @@ conda install -c rapidsai -c conda-forge  cuvs-bench-cpu
 
 The channel `rapidsai` can easily be substituted with `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks.
 
-Please see the {doc}`build instructions <build>` to build the benchmarks from source.
+Please see the [build instructions](build.md) to build the benchmarks from source.
 
 ### Docker
 
@@ -188,7 +188,7 @@ All other python commands mentioned below work as intended once the billion-scal
 
 To download billion-scale datasets, visit [big-ann-benchmarks](http://big-ann-benchmarks.com/neurips21.html)
 
-We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our {doc}`Wiki-all Dataset Guide <wiki_all_dataset>` for more information and to download the dataset.
+We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our [Wiki-all Dataset Guide](wiki_all_dataset.md) for more information and to download the dataset.
 
 
 The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100.
@@ -468,7 +468,7 @@ The config above has 3 fields:
 2. `constraints` - Optional. Python import paths to functions that validate build and search parameter combinations (e.g. `cuvs_bench.config.algos.constraints.cuvs_cagra_build`). Each function returns `True` if the parameters are valid, `False` otherwise; invalid combinations are skipped and not benchmarked.
 3. `groups` - Run groups, each with a set of parameters. Each group defines a cross-product of all hyper-parameter fields for `build` and `search`.
 
-The table below contains all algorithms supported by cuVS. Each unique algorithm will have its own set of `build` and `search` settings. The {doc}`ANN Algorithm Parameter Tuning Guide <param_tuning>` contains detailed instructions on choosing build and search parameters for each supported algorithm.
+The table below contains all algorithms supported by cuVS. Each unique algorithm will have its own set of `build` and `search` settings. The [ANN Algorithm Parameter Tuning Guide](param_tuning.md) contains detailed instructions on choosing build and search parameters for each supported algorithm.
 
 ```{list-table}
 * - Library
diff --git a/docs/source/cuvs_bench/param_tuning.md b/docs/source/cuvs_bench/param_tuning.md
index 1464bc83b3..65f2af6bd9 100644
--- a/docs/source/cuvs_bench/param_tuning.md
+++ b/docs/source/cuvs_bench/param_tuning.md
@@ -1,6 +1,6 @@
 # cuVS Bench Parameter Tuning Guide
 
-This guide outlines the various parameter settings that can be specified in {doc}`cuVS Benchmarks <index>` yaml configuration files and explains the impact they have on corresponding algorithms to help inform their settings for benchmarking across desired levels of recall.
+This guide outlines the various parameter settings that can be specified in [cuVS Benchmarks](index.md) yaml configuration files and explains the impact they have on corresponding algorithms to help inform their settings for benchmarking across desired levels of recall.
 
 ## Benchmark modes
 
@@ -8,7 +8,7 @@ When you run benchmarks with `BenchmarkOrchestrator.run_benchmark()`, you can ch
 
 **Sweep mode (default)**
 
-Pass `mode="sweep"` or omit `mode`. The orchestrator builds the full Cartesian product of all build and search parameter lists defined in the algorithm YAML (see {doc}`Creating and customizing dataset configurations <index>`). Every valid combination (after constraint filtering) is run. Use this for exhaustive comparison across the configured parameter grid.
+Pass `mode="sweep"` or omit `mode`. The orchestrator builds the full Cartesian product of all build and search parameter lists defined in the algorithm YAML (see [Creating and customizing dataset configurations](index.md)). Every valid combination (after constraint filtering) is run. Use this for exhaustive comparison across the configured parameter grid.
 
 **Tune mode**
 
@@ -148,7 +148,7 @@ IVF-pq is an inverted-file index, which partitions the vectors into a series of
   - N
   - [`cluster`, `subspace`]
   - `subspace`
-  - Type of codebook. See {doc}`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
+  - Type of codebook. See [IVF-PQ index overview](../neighbors/ivfpq.md) for more detail
 
 * - `dataset_memory_type`
   - `build`
@@ -363,7 +363,7 @@ To fine tune CAGRA index building we can customize IVF-PQ index builder options
   - N
   - [`cluster`, `subspace`]
   - `subspace`
-  - Type of codebook. See {doc}`IVF-PQ index overview <../neighbors/ivfpq>` for more detail
+  - Type of codebook. See [IVF-PQ index overview](../neighbors/ivfpq.md) for more detail
 
 * - `ivf_pq_build_nprobe`
   - `search`
diff --git a/docs/source/cuvs_bench/pluggable_backend.md b/docs/source/cuvs_bench/pluggable_backend.md
index c53031e2ea..15390292ff 100644
--- a/docs/source/cuvs_bench/pluggable_backend.md
+++ b/docs/source/cuvs_bench/pluggable_backend.md
@@ -40,7 +40,7 @@ The orchestrator calls the config loader's **load()** method with the same argum
 
 - **List[BenchmarkConfig]** – Each **BenchmarkConfig** has:
   - **indexes**: a list of **IndexConfig**. Each **IndexConfig** has `name` (e.g. `"my_algo.param1value"`), `algo` (algorithm name), `build_param` (dict of build parameters), `search_params` (list of dicts, one per search parameter combination to benchmark), and `file` (path or identifier where the index is stored).
-  - **backend_config**: a dict passed to the backend constructor (e.g. `executable_path` for C++, or `host`, `port`, `index_name` for a network backend). The backend receives this as its `config` in `__init__`.
+  - **backend_config**: a dict passed to the backend constructor (e.g. `executable_path` for C++, or `host`, `port`, `index_name` for a network backend). The backend receives this as its `config[in](#in)_init__`.
 
 The following shows how to construct a minimal `DatasetConfig` and one `BenchmarkConfig` (one index, one search param set) so the backend runs a single build and search configuration:
 
@@ -213,7 +213,7 @@ register_config_loader("elasticsearch", ElasticsearchConfigLoader)
 get_registry().register("elasticsearch", ElasticsearchBackend)
 ```
 
-The built-in **CppGoogleBenchmarkBackend** (`backend_type="cpp_gbench"`) is one such pair: **CppGBenchConfigLoader** reads the YAML under `config/datasets` and `config/algos`, expands the Cartesian product, and validates with the constraint functions; the backend runs the C++ benchmark executables and merges results. Adding a new C++ algorithm (see {doc}`index`) only adds another executable and config for this backend; it does not add a new backend.
+The built-in **CppGoogleBenchmarkBackend** (`backend_type="cpp_gbench"`) is one such pair: **CppGBenchConfigLoader** reads the YAML under `config/datasets` and `config/algos`, expands the Cartesian product, and validates with the constraint functions; the backend runs the C++ benchmark executables and merges results. Adding a new C++ algorithm (see [index](index.md)) only adds another executable and config for this backend; it does not add a new backend.
 
 ## Components at a glance
 
diff --git a/docs/source/cuvs_bench/wiki_all_dataset.md b/docs/source/cuvs_bench/wiki_all_dataset.md
index 3e26ca0d9e..fa19eb6fb6 100644
--- a/docs/source/cuvs_bench/wiki_all_dataset.md
+++ b/docs/source/cuvs_bench/wiki_all_dataset.md
@@ -13,7 +13,7 @@ To form the final dataset, the Wiki texts were chunked into 85 million 128-token
 
 ### Full dataset
 
-A version of the dataset is made available in the binary format that can be used directly by the {doc}`cuvs-bench <index>` tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.
+A version of the dataset is made available in the binary format that can be used directly by the [cuvs-bench](index.md) tool. The full 88M dataset is ~251GB and the download link below contains tarballs that have been split into multiple parts.
 
 The following will download all 10 the parts and untar them to a `wiki_all_88M` directory:
 
diff --git a/docs/source/developer_guide.md b/docs/source/developer_guide.md
index c323de0286..5fc14c4317 100644
--- a/docs/source/developer_guide.md
+++ b/docs/source/developer_guide.md
@@ -181,7 +181,7 @@ You can skip these checks with `git commit --no-verify` or with the short versio
 The following section describes some of the core pre-commit hooks used by the repository.
 See `.pre-commit-config.yaml` for a full list.
 
-C++/CUDA is formatted with [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html).
+C++/CUDA is formatted with [clang-format](https://clang.llvm.org/docs/ClangFormat.html).
 
 RAFT relies on `clang-format` to enforce code style across all C++ and CUDA source code. The coding style is based on the [Google style guide](https://google.github.io/styleguide/cppguide.html#Formatting). The only digressions from this style are the following.
 1. Do not split empty functions/records/namespaces.
@@ -189,7 +189,7 @@ RAFT relies on `clang-format` to enforce code style across all C++ and CUDA sour
 3. Disable reflowing of comments.
    The reasons behind these deviations from the Google style guide are given in comments [here](https://github.com/rapidsai/cuvs/blob/main/cpp/.clang-format).
 
-[`doxygen`](https://doxygen.nl/) is used as documentation generator and also as a documentation linter.
+[doxygen](https://doxygen.nl/) is used as documentation generator and also as a documentation linter.
 In order to run doxygen as a linter on C++/CUDA code, run
 
 ```bash
diff --git a/docs/source/filtering.md b/docs/source/filtering.md
index 4cd902f623..36a537b0bb 100644
--- a/docs/source/filtering.md
+++ b/docs/source/filtering.md
@@ -11,13 +11,13 @@ some computation from calculating distances.
 A bitset is an array of bits where each bit can have two possible values: `0` and `1`, which signify in the context of filtering whether
 a sample should be filtered or not. `0` means that the corresponding vector will be filtered, and will therefore not be present in the results of the search.
 This mechanism is optimized to take as little memory space as possible, and is available through the RAFT library
-(check out RAFT's `bitset API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitset/>`). When calling a search function of an ANN index, the
+(check out RAFT's [bitset API documentation](https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitset/)). When calling a search function of an ANN index, the
 bitset length should match the number of vectors present in the database.
 
 ## Bitmap
 
 A bitmap is based on the same principle as a bitset, but in two dimensions. This allows users to provide a different bitset for each query
-being searched. Check out RAFT's `bitmap API documentation <https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitmap/>`.
+being searched. Check out RAFT's [bitmap API documentation](https://docs.rapids.ai/api/raft/stable/cpp_api/core_bitmap/).
 
 ## Examples
 
@@ -106,4 +106,3 @@ brute_force::search(res,
                     distances.view(),
                     bitmap_filter);
 ```
-
diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md
index d108652653..acaf068016 100644
--- a/docs/source/getting_started.md
+++ b/docs/source/getting_started.md
@@ -2,31 +2,31 @@
 
 - [New to vector search?](#new-to-vector-search)
 
-  * {doc}`Primer on vector search <choosing_and_configuring_indexes>`
+  * [Primer on vector search](choosing_and_configuring_indexes.md)
 
-  * {doc}`Vector search indexes vs vector databases <vector_databases_vs_vector_search>`
+  * [Vector search indexes vs vector databases](vector_databases_vs_vector_search.md)
 
-  * {doc}`Index tuning guide <tuning_guide>`
+  * [Index tuning guide](tuning_guide.md)
 
-  * {doc}`Comparing vector search index performance <comparing_indexes>`
+  * [Comparing vector search index performance](comparing_indexes.md)
 
 - [Supported indexes](#supported-indexes)
 
-  * {doc}`Vector search index guide <neighbors/neighbors>`
+  * [Vector search index guide](neighbors/neighbors.md)
 
 - [Using cuVS APIs](#using-cuvs-apis)
 
-  * {doc}`C API Docs <c_api>`
+  * [C API Docs](c_api.md)
 
-  * {doc}`C++ API Docs <cpp_api>`
+  * [C++ API Docs](cpp_api.md)
 
-  * {doc}`Python API Docs <python_api>`
+  * [Python API Docs](python_api.md)
 
-  * {doc}`Rust API Docs <rust_api/index>`
+  * [Rust API Docs](rust_api/index.md)
 
-  * {doc}`API basics <api_basics>`
+  * [API basics](api_basics.md)
 
-  * {doc}`API interoperability <api_interoperability>`
+  * [API interoperability](api_interoperability.md)
 
 - [Where to next?](#where-to-next)
 
@@ -40,9 +40,9 @@
 
 ## New to vector search?
 
-If you are unfamiliar with the basics of vector search or how vector search differs from vector databases, then {doc}`this primer on vector search guide <choosing_and_configuring_indexes>` should provide some good insight. Another good resource for the uninitiated is our {doc}`vector databases vs vector search <vector_databases_vs_vector_search>` guide. As outlined in the primer, vector search as used in vector databases is often closer to machine learning than to traditional databases. This means that while traditional databases can often be slow without any performance tuning, they will usually still yield the correct results. Unfortunately, vector search indexes, like other machine learning models, can yield garbage results if not tuned correctly.
+If you are unfamiliar with the basics of vector search or how vector search differs from vector databases, then [this primer on vector search guide](choosing_and_configuring_indexes.md) should provide some good insight. Another good resource for the uninitiated is our [vector databases vs vector search](vector_databases_vs_vector_search.md) guide. As outlined in the primer, vector search as used in vector databases is often closer to machine learning than to traditional databases. This means that while traditional databases can often be slow without any performance tuning, they will usually still yield the correct results. Unfortunately, vector search indexes, like other machine learning models, can yield garbage results if not tuned correctly.
 
-Fortunately, this opens up the whole world of hyperparameter optimization to improve vector search performance and quality. Please see our {doc}`index tuning guide <tuning_guide>` for more information.
+Fortunately, this opens up the whole world of hyperparameter optimization to improve vector search performance and quality. Please see our [index tuning guide](tuning_guide.md) for more information.
 
 When comparing the performance of vector search indexes, it is important that considerations are made with respect to three main dimensions:
 
@@ -50,20 +50,20 @@ When comparing the performance of vector search indexes, it is important that co
 1. Search quality
 1. Search performance
 
-Please see the {doc}`primer on comparing vector search index performance <comparing_indexes>` for more information on methodologies and how to make a fair apples-to-apples comparison during your evaluations.
+Please see the [primer on comparing vector search index performance](comparing_indexes.md) for more information on methodologies and how to make a fair apples-to-apples comparison during your evaluations.
 
 ## Supported indexes
 
-cuVS supports many of the standard index types with the list continuing to grow and stay current with the state-of-the-art. Please refer to our {doc}`vector search index guide <neighbors/neighbors>` to learn more about each individual index type, when they can be useful on the GPU, the tuning knobs they offer to trade off performance and quality.
+cuVS supports many of the standard index types with the list continuing to grow and stay current with the state-of-the-art. Please refer to our [vector search index guide](neighbors/neighbors.md) to learn more about each individual index type, when they can be useful on the GPU, the tuning knobs they offer to trade off performance and quality.
 
 The primary goal of cuVS is to enable speed, scale, and flexibility (in that order)- and one of the important value propositions is to enhance existing software deployments with extensible GPU capabilities to improve pain points while not interrupting parts of the system that work well today with CPU.
 
 
 ## Using cuVS APIs
 
-cuVS is a C++ library at its core, which is wrapped with a C library and exposed further through various different languages. cuVS currently provides APIs and documentation for {doc}`C <c_api>`, {doc}`C++ <cpp_api>`, {doc}`Python <python_api>`, and {doc}`Rust <rust_api/index>` with more languages in the works. our {doc}`API basics <api_basics>` provides some background and context about the important paradigms and vocabulary types you'll encounter when working with cuVS types.
+cuVS is a C++ library at its core, which is wrapped with a C library and exposed further through various different languages. cuVS currently provides APIs and documentation for [C](c_api.md), [C++](cpp_api.md), [Python](python_api.md), and [Rust](rust_api/index.md) with more languages in the works. our [API basics](api_basics.md) provides some background and context about the important paradigms and vocabulary types you'll encounter when working with cuVS types.
 
-Please refer to the {doc}`guide on API interoperability <api_interoperability>` for more information on how cuVS can work seamlessly with other libraries like numpy, cupy, tensorflow, and pytorch, even without having to copy device memory.
+Please refer to the [guide on API interoperability](api_interoperability.md) for more information on how cuVS can work seamlessly with other libraries like numpy, cupy, tensorflow, and pytorch, even without having to copy device memory.
 
 
 ## Where to next?
diff --git a/docs/source/index.md b/docs/source/index.md
index ed4daad7fd..1349291c18 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -1,6 +1,6 @@
 # cuVS: Vector Search and Clustering on the GPU
 
-Welcome to cuVS, the premier library for GPU-accelerated vector search and clustering! cuVS provides several core building blocks for constructing new algorithms, as well as end-to-end vector search and clustering algorithms for use either standalone or through a growing list of {doc}`integrations <integrations>`.
+Welcome to cuVS, the premier library for GPU-accelerated vector search and clustering! cuVS provides several core building blocks for constructing new algorithms, as well as end-to-end vector search and clustering algorithms for use either standalone or through a growing list of [integrations](integrations.md).
 
 ## Useful Resources
 
diff --git a/docs/source/neighbors/all_neighbors.md b/docs/source/neighbors/all_neighbors.md
index c1368aafd5..69feafa4e6 100644
--- a/docs/source/neighbors/all_neighbors.md
+++ b/docs/source/neighbors/all_neighbors.md
@@ -15,7 +15,7 @@ All-neighbors supports multiple underlying algorithms:
 
 The algorithm partitions the dataset into clusters and distributes the work across multiple GPUs when possible, making it suitable for large-scale graph construction tasks.
 
-[ {doc}`C API <../c_api/neighbors_all_neighbors_c>` | {doc}`C++ API <../cpp_api/neighbors_all_neighbors>` | {doc}`Python API <../python_api/neighbors_all_neighbors>` ]
+[C API](../c_api/neighbors_all_neighbors_c.md) | [C++ API](../cpp_api/neighbors_all_neighbors.md) | [Python API](../python_api/neighbors_all_neighbors.md)
 
 ## Algorithm Overview
 
diff --git a/docs/source/neighbors/bruteforce.md b/docs/source/neighbors/bruteforce.md
index 230e5bb3c6..0a098cf561 100644
--- a/docs/source/neighbors/bruteforce.md
+++ b/docs/source/neighbors/bruteforce.md
@@ -11,7 +11,7 @@ Brute-force can also be a good choice for heavily filtered queries where other a
 when filtering out 90%-95% of the vectors from a search, the IVF methods could struggle to return anything at all with smaller number of probes and
 graph-based algorithms with limited hash table memory could end up skipping over important unfiltered entries.
 
-[ {doc}`C API <../c_api/neighbors_bruteforce_c>` | {doc}`C++ API <../cpp_api/neighbors_bruteforce>` | {doc}`Python API <../python_api/neighbors_brute_force>` | {doc}`Rust API <../rust_api/index>` ]
+[C API](../c_api/neighbors_bruteforce_c.md) | [C++ API](../cpp_api/neighbors_bruteforce.md) | [Python API](../python_api/neighbors_brute_force.md) | [Rust API](../rust_api/index.md)
 
 ## Filtering considerations
 
diff --git a/docs/source/neighbors/cagra.md b/docs/source/neighbors/cagra.md
index 48c4d0b289..dd600148f8 100644
--- a/docs/source/neighbors/cagra.md
+++ b/docs/source/neighbors/cagra.md
@@ -12,7 +12,7 @@ I-force could be used to construct the initial kNN graph. This would yield the m
 we find that in practice the kNN graph does not need to be very accurate since the pruning step helps to boost the overall recall of
 the index. cuVS provides IVF-PQ and NN-Descent strategies for building the initial kNN graph and these can be selected in index params object during index construction.
 
-[ {doc}`C API <../c_api/neighbors_cagra_c>` | {doc}`C++ API <../cpp_api/neighbors_cagra>` | {doc}`Python API <../python_api/neighbors_cagra>` | {doc}`Rust API <../rust_api/index>` ]
+[C API](../c_api/neighbors_cagra_c.md) | [C++ API](../cpp_api/neighbors_cagra.md) | [Python API](../python_api/neighbors_cagra.md) | [Rust API](../rust_api/index.md)
 
 ## Interoperability with HNSW
 
diff --git a/docs/source/neighbors/ivfflat.md b/docs/source/neighbors/ivfflat.md
index 04febe28dd..e873c59891 100644
--- a/docs/source/neighbors/ivfflat.md
+++ b/docs/source/neighbors/ivfflat.md
@@ -13,7 +13,7 @@ IVF-Flat tends to be a great choice when
 in the index, and
 2. exact recall is not needed. as with the other index types, the tuning parameters are used to trade-off recall for search latency / throughput.
 
-[ {doc}`C API <../c_api/neighbors_ivf_flat_c>` | {doc}`C++ API <../cpp_api/neighbors_ivf_flat>` | {doc}`Python API <../python_api/neighbors_ivf_flat>` | {doc}`Rust API <../rust_api/index>` ]
+[C API](../c_api/neighbors_ivf_flat_c.md) | [C++ API](../cpp_api/neighbors_ivf_flat.md) | [Python API](../python_api/neighbors_ivf_flat.md) | [Rust API](../rust_api/index.md)
 
 ## Filtering considerations
 
diff --git a/docs/source/neighbors/ivfpq.md b/docs/source/neighbors/ivfpq.md
index 893dd53a23..3116bd4d9a 100644
--- a/docs/source/neighbors/ivfpq.md
+++ b/docs/source/neighbors/ivfpq.md
@@ -8,7 +8,7 @@ Often a strategy called refinement reranking is employed to make up for the lost
 `k` than desired and performing a reordering and reduction to `k` based on the distances from the unquantized vectors. Unfortunately,
 this does mean that the unquantized raw vectors need to be available and often this can be done efficiently using multiple CPU threads.
 
-[ {doc}`C API <../c_api/neighbors_ivf_pq_c>` | {doc}`C++ API <../cpp_api/neighbors_ivf_pq>` | {doc}`Python API <../python_api/neighbors_ivf_pq>` | {doc}`Rust API <../rust_api/index>` ]
+[C API](../c_api/neighbors_ivf_pq_c.md) | [C++ API](../cpp_api/neighbors_ivf_pq.md) | [Python API](../python_api/neighbors_ivf_pq.md) | [Rust API](../rust_api/index.md)
 
 
 ## Configuration parameters
diff --git a/docs/source/neighbors/neighbors.md b/docs/source/neighbors/neighbors.md
index a1c436caa7..1aae68bc57 100644
--- a/docs/source/neighbors/neighbors.md
+++ b/docs/source/neighbors/neighbors.md
@@ -14,6 +14,6 @@ all_neighbors.md
 
 # Indices and tables
 
-* {ref}`genindex`
-* {ref}`modindex`
-* {ref}`search`
+* [Index](genindex.html)
+* [Module Index](py-modindex.html)
+* [Search](search.html)
diff --git a/docs/source/neighbors/vamana.md b/docs/source/neighbors/vamana.md
index 7761f57654..b8004fef12 100644
--- a/docs/source/neighbors/vamana.md
+++ b/docs/source/neighbors/vamana.md
@@ -1,20 +1,20 @@
 # Vamana
 
-VAMANA is the underlying graph construction algorithm used to construct indexes for the DiskANN vector search solution. DiskANN and the Vamana algorithm are described in detail in the `published paper <https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf>`, and a highly optimized `open-source repository <https://github.com/microsoft/DiskANN>`  includes many features for index construction and search. In cuVS, we provide a version of the Vamana algorithm optimized for GPU architectures to accelreate graph construction to build DiskANN idnexes. At a high level, the Vamana algorithm operates as follows:
+VAMANA is the underlying graph construction algorithm used to construct indexes for the DiskANN vector search solution. DiskANN and the Vamana algorithm are described in detail in the [published paper](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf), and a highly optimized [open-source repository](https://github.com/microsoft/DiskANN) includes many features for index construction and search. In cuVS, we provide a version of the Vamana algorithm optimized for GPU architectures to accelreate graph construction to build DiskANN idnexes. At a high level, the Vamana algorithm operates as follows:
 
 * 1. Starting with an empty graph, select a medoid vector from the D-dimension vector dataset and insert it into the graph.
 * 2. Iteratively insert batches of dataset vectors into the graph, connecting each inserted vector to neighbors based on a graph traversal.
 * 3. For each batch, create reverse edges and prune unnecessary edges.
 
-There are many algorithmic details that are outlined in the `paper <https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf>`, and many GPU-specific optimizations are included in this implementation.
+There are many algorithmic details that are outlined in the [paper](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf), and many GPU-specific optimizations are included in this implementation.
 
-The current implementation of DiskANN in cuVS only includes the 'in-memory' graph construction and a serialization step that writes the index to a file. This index file can be then used by the `open-source DiskANN <https://github.com/microsoft/DiskANN>` library to perform efficient search. Additional DiskANN functionality, including GPU-accelerated search and 'ssd' index build are planned for future cuVS releases.
+The current implementation of DiskANN in cuVS only includes the 'in-memory' graph construction and a serialization step that writes the index to a file. This index file can be then used by the [open-source DiskANN](https://github.com/microsoft/DiskANN) library to perform efficient search. Additional DiskANN functionality, including GPU-accelerated search and 'ssd' index build are planned for future cuVS releases.
 
-[ {doc}`C++ API <../cpp_api/neighbors_vamana>` ]
+[C++ API](../cpp_api/neighbors_vamana.md)
 
 ## Interoperability with CPU DiskANN
 
-The 'vamana::serialize' API calls writes the index to a file with a format that is compatible with the `open-source DiskANN repositoriy <https://github.com/microsoft/DiskANN>`. This allows cuVS to be used to accelerate index construction while leveraging the efficient CPU-based search currently available.
+The 'vamana::serialize' API calls writes the index to a file with a format that is compatible with the [open-source DiskANN repositoriy](https://github.com/microsoft/DiskANN). This allows cuVS to be used to accelerate index construction while leveraging the efficient CPU-based search currently available.
 
 ## Configuration parameters
 
diff --git a/docs/source/tuning_guide.md b/docs/source/tuning_guide.md
index d9cfb4f187..905e96180f 100644
--- a/docs/source/tuning_guide.md
+++ b/docs/source/tuning_guide.md
@@ -2,13 +2,13 @@
 
 ## Introduction
 
-A Method for tuning and evaluating Vector Search Indexes At Scale in Locally Indexed Vector Databases. For more information on the differences between locally and globally indexed vector databases, please see {doc}`this guide <vector_databases_vs_vector_search>`. The goal of this guide is to give users a scalable and effective approach for tuning a vector search index, no matter how large.  Evaluation of a vector search index “model” that measures recall in proportion to build time so that it penalizes the recall when the build time is really high (should ultimately optimize for finding a lower build time and higher recall).
+A Method for tuning and evaluating Vector Search Indexes At Scale in Locally Indexed Vector Databases. For more information on the differences between locally and globally indexed vector databases, please see [this guide](vector_databases_vs_vector_search.md). The goal of this guide is to give users a scalable and effective approach for tuning a vector search index, no matter how large.  Evaluation of a vector search index “model” that measures recall in proportion to build time so that it penalizes the recall when the build time is really high (should ultimately optimize for finding a lower build time and higher recall).
 
-For more information on the various different types of vector search indexes, please see our {doc}`guide to choosing vector search indexes <choosing_and_configuring_indexes>`
+For more information on the various different types of vector search indexes, please see our [guide to choosing vector search indexes](choosing_and_configuring_indexes.md)
 
 ## Why automated tuning?
 
-As much as 75% of users have told us they will not be able to tune a vector database beyond one or two simple knobs and we suggest that an ideal “knob” would be to balance training time and search time with search quality. The more time, the higher the quality, and the more needed to find an acceptable search performance. Even the 25% of users that want to tune are still asking for simple tools for doing so. These users also ask for some simple guidelines for setting tuning parameters, like {doc}`this guide <neighbors/neighbors>`.
+As much as 75% of users have told us they will not be able to tune a vector database beyond one or two simple knobs and we suggest that an ideal “knob” would be to balance training time and search time with search quality. The more time, the higher the quality, and the more needed to find an acceptable search performance. Even the 25% of users that want to tune are still asking for simple tools for doing so. These users also ask for some simple guidelines for setting tuning parameters, like [this guide](neighbors/neighbors.md).
 
 Since vector search indexes are more closely related to machine learning models than traditional databases indexes, one option for easing the parameter tuning burden is to use hyper-parameter optimization tools like [Ray Tune](https://medium.com/rapids-ai/30x-faster-hyperparameter-search-with-raytune-and-rapids-403013fbefc5) and [Optuna](https://docs.rapids.ai/deployment/stable/examples/rapids-optuna-hpo/notebook/). to verify this.
 
diff --git a/docs/source/vector_databases_vs_vector_search.md b/docs/source/vector_databases_vs_vector_search.md
index d3c1f76e3f..f0317a567e 100644
--- a/docs/source/vector_databases_vs_vector_search.md
+++ b/docs/source/vector_databases_vs_vector_search.md
@@ -1,6 +1,6 @@
 # Vector search indexes vs vector databases
 
-This guide provides information on the differences between vector search indexes and fully-fledged vector databases. For more information on selecting and configuring vector search indexes, please refer to our {doc}`guide on choosing and configuring indexes <choosing_and_configuring_indexes>`
+This guide provides information on the differences between vector search indexes and fully-fledged vector databases. For more information on selecting and configuring vector search indexes, please refer to our [guide on choosing and configuring indexes](choosing_and_configuring_indexes.md)
 
 One of the primary differences between vector database indexes and traditional database indexes is that vector search often uses approximations to trade-off accuracy of the results for speed. Because of this, while many mature databases offer mechanisms to tune their indexes and achieve better performance, vector database indexes can return completely garbage results if they aren’t tuned for a reasonable level of search quality in addition to performance tuning. This is because vector database indexes are more closely related to machine learning models than they are to traditional database indexes.
 
@@ -47,4 +47,4 @@ Unfortunately, for large datasets, doing a hyper-parameter optimization on the w
 
 Full hyper-parameter optimization may also not always be necessary- for example, once you have built a ground truth dataset on a subset, many times you can start by building an index with the default build parameters and then playing around with different search parameters until you get the desired quality and search performance.  For massive indexes that might be multiple terabytes, you could also take this subsampling of, say, 10M vectors, train an index and then tune the search parameters from there. While there might be a small margin of error, the chosen build/search parameters should generalize fairly well for the databases that build locally partitioned indexes.
 
-Refer to our {doc}`tuning guide <tuning_guide>` for more information and examples on how to efficiently and automatically tune your vector search indexes based on your needs.
+Refer to our [tuning guide](tuning_guide.md) for more information and examples on how to efficiently and automatically tune your vector search indexes based on your needs.