Skip to content
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c067d52
Add shared load_vectors utility to centralize binary vector file load…
jnke2016 Apr 25, 2026
6e91f0e
Promote shared config-loading methods (load_yaml_file, get_dataset_co…
jnke2016 Apr 25, 2026
99fb908
Add input validation: fail on unsupported vector file extensions, val…
jnke2016 Apr 25, 2026
1e1ef66
update docstrings
jnke2016 Apr 25, 2026
21d8621
Refactor Dataset from dataclass to regular class with transparent vec…
jnke2016 Apr 26, 2026
60f3aeb
Add dtype_from_filename from generate_groundtruth/utils.py and refact…
jnke2016 Apr 26, 2026
39867d9
Fix outdated DummyBackend/AnotherDummyBackend signatures in test_regi…
jnke2016 Apr 26, 2026
1e2de0f
Add test_utils.py with tests for shared backend utilities, Dataset tr…
jnke2016 Apr 26, 2026
9480257
Refactor ConfigLoader.load() to handle shared steps (YAML loading, da…
jnke2016 Apr 26, 2026
3399692
Add shared expand_param_grid() and compute_recall() utilities with te…
jnke2016 Apr 27, 2026
896323d
Add executable_dir support to CppGBenchConfigLoader to match run.py's…
jnke2016 Apr 27, 2026
93b78db
Route CLI through BenchmarkOrchestrator, add --mode, --constraints, -…
jnke2016 Apr 27, 2026
c99cf88
update docs
jnke2016 Apr 27, 2026
4f01af3
Fix executable_dir not being passed through to _prepare_benchmark_con…
jnke2016 Apr 27, 2026
aa913dc
Add CLI tests for --mode, --backend-config, --n-trials, and --constra…
jnke2016 Apr 27, 2026
dddf1d2
Move recall computation from backends to orchestrator, computed trans…
jnke2016 Apr 28, 2026
948fb8a
Switch gather_algorithm_configs filter from exclusion to .yaml/.yml i…
jnke2016 May 4, 2026
4781e32
Centralize parameter expansion through shared expand_param_grid() so …
jnke2016 May 4, 2026
592afc3
Merge remote-tracking branch 'upstream/main' into consolidate-shared-…
jnke2016 May 4, 2026
3e561c3
fix style
jnke2016 May 4, 2026
b31ae14
add input validation, property setters, portable test paths, and fix …
jnke2016 May 4, 2026
46b3cae
fix style
jnke2016 May 5, 2026
4cdd998
Rename base_vectors to training_vectors for clarity.
jnke2016 May 9, 2026
499c9fe
Rename base_vectors to training_vectors for clarity.
jnke2016 May 9, 2026
76d8c80
Add groundtruth_distances to Dataset with same property-based loading…
jnke2016 May 9, 2026
3e15ee4
Rename utils.py to _utils.py to signal internal-only usage by the fra…
jnke2016 May 9, 2026
9030d9e
Merge remote-tracking branch 'upstream/main' into consolidate-shared-…
jnke2016 May 9, 2026
99cceb1
fix style
jnke2016 May 9, 2026
798f44a
Merge remote-tracking branch 'upstream/main' into consolidate-shared-…
jnke2016 May 9, 2026
0dc78a2
Merge branch 'main' into consolidate-shared-utils
cjnolet May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 3 additions & 29 deletions docs/source/cuvs_bench/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,22 +113,8 @@ The steps below demonstrate how to download, install, and run benchmarks on a su
# (1) Prepare dataset.
python -m cuvs_bench.get_dataset --dataset deep-image-96-angular --normalize

.. code-block:: python

# (2) Build and search index.
from cuvs_bench.orchestrator import BenchmarkOrchestrator

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wonderful that we have these ways of running the orchestrator through Python. Is it possible for us to use both approaches? Are there downsides to having one over the other? Can we keep both in the docs?

For example, especially in the "tuning" mode, I think users are going to want to run that inside existing Python code more frequently than from the command line.


orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
results = orchestrator.run_benchmark(
dataset="deep-image-96-inner",
algorithms="cuvs_cagra",
count=10,
batch_size=10,
build=True,
search=True,
)

.. code-block:: bash
python -m cuvs_bench.run --dataset deep-image-96-inner --algorithms cuvs_cagra --batch-size 10 -k 10 --build --search

# (3) Export data.
python -m cuvs_bench.run --data-export --dataset deep-image-96-inner
Expand Down Expand Up @@ -210,22 +196,10 @@ The steps below demonstrate how to download, install, and run benchmarks on a su
python -m cuvs_bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin
# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced

.. code-block:: python
.. code-block:: bash

# (2) Build and search index.
from cuvs_bench.orchestrator import BenchmarkOrchestrator

orchestrator = BenchmarkOrchestrator(backend_type="cpp_gbench")
results = orchestrator.run_benchmark(
dataset="deep-1B",
algorithms="cuvs_cagra",
count=10,
batch_size=10,
build=True,
search=True,
)

.. code-block:: bash
python -m cuvs_bench.run --dataset deep-1B --algorithms cuvs_cagra --batch-size 10 -k 10 --build --search

# (3) Export data.
python -m cuvs_bench.run --data-export --dataset deep-1B
Expand Down
202 changes: 202 additions & 0 deletions python/cuvs_bench/cuvs_bench/backends/_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
#
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION.
# SPDX-License-Identifier: Apache-2.0
#

"""
Shared utilities for cuvs-bench backends.

Provides common functions for Python-native backends (e.g., OpenSearch,
Elasticsearch):
- Vector file I/O: used internally by the Dataset class for transparent
vector loading. The C++ backend does not use these since it passes file
paths directly to the subprocess.
- Parameter expansion: converts YAML param specs into lists of param dicts.
- Recall computation: computes recall@k from neighbors and ground truth.

The dtype_from_filename function originates from generate_groundtruth/utils.py.
Note: generate_groundtruth/utils.py uses .hbin for float16 while the rest of
the codebase (get_dataset/fbin_to_f16bin.py, OpenSearch backend) uses .f16bin.
We standardize on .f16bin here to match the naming convention of the other
formats (.fbin, .i8bin, .u8bin).
"""

import itertools
import os
from typing import Any, Dict, List, Optional

import numpy as np


def dtype_from_filename(filename):
"""Map file extension to numpy dtype.

Parameters
----------
filename : str
Path or filename with a supported extension.

Returns
-------
numpy.dtype
The corresponding numpy dtype.

Raises
------
RuntimeError
If the file extension is not supported.
"""
ext = os.path.splitext(filename)[1]
if ext == ".fbin":
return np.float32
if ext == ".f16bin":
return np.float16
elif ext == ".ibin":
return np.int32
elif ext == ".u8bin":
return np.ubyte
elif ext == ".i8bin":
return np.byte
else:
raise RuntimeError(f"Unsupported file extension: {ext}")


def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
"""
Read a binary vector file into a numpy array.

Supports the standard big-ann-bench binary format used by cuvs-bench
datasets: a 4-byte uint32 ``n_rows``, a 4-byte uint32 ``n_cols``,
followed by ``n_rows * n_cols`` elements of the dtype inferred from
the file extension via ``dtype_from_filename``.

Parameters
----------
path : str
Path to the binary file. The dtype is inferred from the extension:
``.fbin`` (float32), ``.f16bin`` (float16), ``.u8bin`` (uint8),
``.i8bin`` (int8), ``.ibin`` (int32).
subset_size : Optional[int]
If provided, only the first ``subset_size`` rows are loaded.

Returns
-------
np.ndarray
Array of shape ``(n_rows, n_cols)`` with the inferred dtype.

Raises
------
FileNotFoundError
If the file does not exist.
ValueError
If the file extension is unsupported, ``subset_size`` is not positive,
or the file is truncated.
"""
dtype = dtype_from_filename(path)
if subset_size is not None and subset_size < 1:
raise ValueError(
f"subset_size must be a positive integer, got {subset_size}"
)
with open(path, "rb") as f:
header = f.read(8)
if len(header) < 8:
raise ValueError(
f"File too small to contain a valid header (expected 8 bytes, "
f"got {len(header)}): {path}"
)
n_rows = int(np.frombuffer(header[:4], dtype=np.uint32)[0])
n_cols = int(np.frombuffer(header[4:], dtype=np.uint32)[0])
if subset_size is not None:
n_rows = min(n_rows, subset_size)
Comment on lines +96 to +110

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate subset_size type explicitly before arithmetic/comparison.

Lines 96-110 only enforce a lower bound. Non-integer values (e.g., "1000" or 1.5) can fail later with non-actionable errors instead of a clear config validation error.

Proposed fix
 def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
@@
-    if subset_size is not None and subset_size < 1:
-        raise ValueError(
-            f"subset_size must be a positive integer, got {subset_size}"
-        )
+    if subset_size is not None:
+        if not isinstance(subset_size, (int, np.integer)):
+            raise ValueError(
+                f"subset_size must be an integer, got {type(subset_size).__name__}"
+            )
+        if subset_size < 1:
+            raise ValueError(
+                f"subset_size must be a positive integer, got {subset_size}"
+            )

As per coding guidelines, "Ensure missing validation does not cause crashes on invalid input through proper size/type checks."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 96 - 110, The
code currently only checks subset_size < 1, which allows non-integer values to
slip through and error later; add an explicit type validation before the numeric
comparison: ensure subset_size is an integer (e.g., isinstance(subset_size,
(int, np.integer))) and reject bools (e.g., isinstance(subset_size, bool) ->
disallow) and raise a clear TypeError/ValueError if not an integer, then
continue to check subset_size >= 1 and use that validated value when computing
n_rows; reference the subset_size variable in utils.py to locate the change.

expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize
raw = f.read(expected_bytes)
if len(raw) < expected_bytes:
raise ValueError(
f"File is truncated: expected {expected_bytes} bytes of data "
f"({n_rows} rows x {n_cols} cols x {np.dtype(dtype).itemsize} bytes), "
f"got {len(raw)}: {path}"
)
data = np.frombuffer(raw, dtype=dtype)
return data.reshape(n_rows, n_cols)


def expand_param_grid(

@cjnolet cjnolet May 6, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions seem like they should be core to the orchestrator / some other abstraction and shouldn't be merely utilities. Are these intended to be used directly by the backends? I think some sort of abstraction is needed to help force the backends to use them in some standard way. Otherwise, we're going to end up with users creating custom backends that don't follow our methodology and claiming they are getting some crazy different perf profile as a result (when in reality it's because they aren't using our tool in the intended way).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I think there should be a class that centralizes all of the dataset stuff (liek what you have above) but also provides a standardized way to get these params (maybe this new abstraction uses the utilities here, but I think these utilities should be private (_utils.py) and we should favor that standard abstraction layer. Maybe in addition to Dateset class, we have a "BenchmarkRunner" class? Or a "ParamRunner" class that generates the params and evaluates the recall? What we should aim for is for the documentation for adding new backends to have a clean, cohesive, concise, and consistent few lines of code that each backend can execute in order to properly run benchmarks. This will keep everyone doing the same thing as much as possible so we're always comparing apples to apples.

@jnke2016 jnke2016 May 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions seem like they should be core to the orchestrator / some other abstraction and shouldn't be merely utilities. Are these intended to be used directly by the backends? I think some sort of abstraction is needed to help force the backends to use them in some standard way. Otherwise, we're going to end up with users creating custom backends that don't follow our methodology and claiming they are getting some crazy different perf profile as a result (when in reality it's because they aren't using our tool in the intended way).

These functions are not intended to be called directly by backends. They are called transparently by the framework: expand_param_grid is called by the config loader, compute_recall is called by the orchestrator, and load_vectors is called by the Dataset class. Backends only implement build() and search(). The utils.py file is just a centralized place where these functions live, but the framework is what calls them, not the backends.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I think there should be a class that centralizes all of the dataset stuff (liek what you have above) but also provides a standardized way to get these params (maybe this new abstraction uses the utilities here ... Maybe in addition to Dateset class, we have a "BenchmarkRunner" class? Or a "ParamRunner" class that generates the params and evaluates the recall?

This is largely what the current architecture provides. The Dataset class centralizes all dataset handling with transparent vector loading. The base ConfigLoader standardizes parameter expansion via expand_param_grid internally. The BenchmarkOrchestrator covers the role of both a BenchmarkRunner and ParamRunner:

  • Generates params: calls config_loader.load() which internally uses expand_param_grid
  • Evaluates recall: calls compute_recall after backend.search() returns

Backend authors only implement build() and search() for a single configuration. Agreed on making utils.py private (_utils.py) to signal these are internal to the framework and not meant to be called by backends directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes sound great, @jnke2016. Thank you for this! I think it's normal that we're seeing the design and scope evolve as we continue to add new backends.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working to rebase the opensearch PR on top of the changes here

The base ConfigLoader standardizes parameter expansion via expand_param_grid internally.

@jnke2016 I just want to make sure I'm not missing something. From what I can tell it looks like the CppGBenchConfigLoader is using expand_param_grid, not the ConfigLoader base class. Am I understanding correctly that the opensearch config loader should also use expand_param_grid similar to the C++ config loader?

param_spec: Dict[str, List[Any]],
) -> List[Dict[str, Any]]:
"""
Expand a parameter specification into all combinations via Cartesian product.

Takes a dict where each key maps to a list of values (as defined in
algorithm YAML configs) and produces a list of dicts, one per combination.

Parameters
----------
param_spec : Dict[str, List[Any]]
Parameter specification, e.g., {"m": [16, 32], "ef_construction": [100, 200]}

Returns
-------
List[Dict[str, Any]]
List of parameter dicts, e.g.,
[{"m": 16, "ef_construction": 100}, {"m": 16, "ef_construction": 200},
{"m": 32, "ef_construction": 100}, {"m": 32, "ef_construction": 200}]
Returns [{}] if param_spec is empty.
"""
if not param_spec:
return [{}]
keys = list(param_spec.keys())
return [
dict(zip(keys, vals))
for vals in itertools.product(*param_spec.values())
]


def compute_recall(
neighbors: np.ndarray, groundtruth: np.ndarray, k: int
) -> float:
"""
Compute recall@k by comparing returned neighbors against ground truth.

For each query, counts how many of the k returned neighbors appear
in the ground truth set, then averages across all queries.

Parameters
----------
neighbors : np.ndarray
Returned neighbor IDs, shape (n_queries, k)
groundtruth : np.ndarray
Ground truth neighbor IDs, shape (n_queries, gt_k)
k : int
Number of neighbors to evaluate

Returns
-------
float
Recall@k in range [0.0, 1.0]
"""
if not isinstance(neighbors, np.ndarray) or not isinstance(
groundtruth, np.ndarray
):
raise ValueError("neighbors and groundtruth must be numpy ndarrays")
if neighbors.ndim != 2 or groundtruth.ndim != 2:
raise ValueError(
f"neighbors and groundtruth must be 2-D arrays, got "
f"neighbors.shape={neighbors.shape}, groundtruth.shape={groundtruth.shape}"
)
n_queries = neighbors.shape[0]
if groundtruth.shape[0] != n_queries:
raise ValueError(
f"Row count mismatch: neighbors has {n_queries} rows, "
f"groundtruth has {groundtruth.shape[0]} rows"
)
gt_k = min(k, groundtruth.shape[1])
if gt_k == 0 or n_queries == 0:
return 0.0
n_correct = sum(
len(
set(neighbors[i, :k].tolist())
& set(groundtruth[i, :gt_k].tolist())
)
for i in range(n_queries)
)
return n_correct / (n_queries * gt_k)
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment on lines +192 to +202

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject negative k in compute_recall to avoid invalid metrics.

Lines 185-192 compute gt_k = min(k, groundtruth.shape[1]) without validating k. A negative k can yield a negative denominator and invalid recall output.

Proposed fix
 def compute_recall(
     neighbors: np.ndarray, groundtruth: np.ndarray, k: int
 ) -> float:
@@
+    if not isinstance(k, (int, np.integer)):
+        raise ValueError(f"k must be an integer, got {type(k).__name__}")
+    if k < 0:
+        raise ValueError(f"k must be >= 0, got {k}")
+
     n_queries = neighbors.shape[0]
@@
     gt_k = min(k, groundtruth.shape[1])

As per coding guidelines, "Provide clear and actionable error messages that include expected vs actual values where helpful."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 185 - 192, The
compute_recall function currently allows negative k (it computes gt_k = min(k,
groundtruth.shape[1])) which can produce invalid recall values; add an input
validation at the start of compute_recall to reject non-positive or negative k
(e.g., if k < 0) by raising a ValueError with a clear message that includes the
expected range and the actual k, e.g. "k must be non-negative integer, got {k}";
ensure you reference the compute_recall function and variables k and groundtruth
in the check so callers immediately see the failure instead of getting a
negative gt_k or wrong denominator.

Comment on lines +186 to +202

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when neighbors has fewer columns than the evaluated k.

gt_k is derived from k and groundtruth.shape[1], but the numerator uses whatever width neighbors[:, :k] happens to have. If a backend returns too few neighbors, this silently depresses recall instead of surfacing the contract violation.

Suggested fix
     n_queries = neighbors.shape[0]
     if groundtruth.shape[0] != n_queries:
         raise ValueError(
             f"Row count mismatch: neighbors has {n_queries} rows, "
             f"groundtruth has {groundtruth.shape[0]} rows"
         )
-    gt_k = min(k, groundtruth.shape[1])
+    gt_k = min(k, groundtruth.shape[1])
+    if neighbors.shape[1] < gt_k:
+        raise ValueError(
+            f"neighbors must have at least {gt_k} columns for recall@{k}, "
+            f"got {neighbors.shape[1]}"
+        )
     if gt_k == 0 or n_queries == 0:
         return 0.0

As per coding guidelines, Ensure missing validation does not cause crashes on invalid input through proper size/type checks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/backends/_utils.py` around lines 186 - 202, The
code silently slices neighbors[:, :k] even when neighbors has fewer than k
columns; add an explicit validation in the recall computation to check
neighbors.shape[1] >= k and raise a clear ValueError (or similar) if not,
referencing the variables neighbors, k, and groundtruth (and existing gt_k) so
the function fails fast on contract violations rather than silently depressing
recall; ensure the error message mentions the expected k and the actual
neighbors.shape[1] to aid debugging.

Loading
Loading