-
Notifications
You must be signed in to change notification settings - Fork 197
Centralize shared utilities across benchmark backends #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c067d52
6e91f0e
99fb908
1e1ef66
21d8621
60f3aeb
39867d9
1e2de0f
9480257
3399692
896323d
93b78db
c99cf88
4f01af3
aa913dc
dddf1d2
948fb8a
4781e32
592afc3
3e561c3
b31ae14
46b3cae
4cdd998
499c9fe
76d8c80
3e15ee4
9030d9e
99cceb1
798f44a
0dc78a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
|
|
||
| """ | ||
| Shared utilities for cuvs-bench backends. | ||
|
|
||
| Provides common functions for Python-native backends (e.g., OpenSearch, | ||
| Elasticsearch): | ||
| - Vector file I/O: used internally by the Dataset class for transparent | ||
| vector loading. The C++ backend does not use these since it passes file | ||
| paths directly to the subprocess. | ||
| - Parameter expansion: converts YAML param specs into lists of param dicts. | ||
| - Recall computation: computes recall@k from neighbors and ground truth. | ||
|
|
||
| The dtype_from_filename function originates from generate_groundtruth/utils.py. | ||
| Note: generate_groundtruth/utils.py uses .hbin for float16 while the rest of | ||
| the codebase (get_dataset/fbin_to_f16bin.py, OpenSearch backend) uses .f16bin. | ||
| We standardize on .f16bin here to match the naming convention of the other | ||
| formats (.fbin, .i8bin, .u8bin). | ||
| """ | ||
|
|
||
| import itertools | ||
| import os | ||
| from typing import Any, Dict, List, Optional | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| def dtype_from_filename(filename): | ||
| """Map file extension to numpy dtype. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| filename : str | ||
| Path or filename with a supported extension. | ||
|
|
||
| Returns | ||
| ------- | ||
| numpy.dtype | ||
| The corresponding numpy dtype. | ||
|
|
||
| Raises | ||
| ------ | ||
| RuntimeError | ||
| If the file extension is not supported. | ||
| """ | ||
| ext = os.path.splitext(filename)[1] | ||
| if ext == ".fbin": | ||
| return np.float32 | ||
| if ext == ".f16bin": | ||
| return np.float16 | ||
| elif ext == ".ibin": | ||
| return np.int32 | ||
| elif ext == ".u8bin": | ||
| return np.ubyte | ||
| elif ext == ".i8bin": | ||
| return np.byte | ||
| else: | ||
| raise RuntimeError(f"Unsupported file extension: {ext}") | ||
|
|
||
|
|
||
| def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray: | ||
| """ | ||
| Read a binary vector file into a numpy array. | ||
|
|
||
| Supports the standard big-ann-bench binary format used by cuvs-bench | ||
| datasets: a 4-byte uint32 ``n_rows``, a 4-byte uint32 ``n_cols``, | ||
| followed by ``n_rows * n_cols`` elements of the dtype inferred from | ||
| the file extension via ``dtype_from_filename``. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| path : str | ||
| Path to the binary file. The dtype is inferred from the extension: | ||
| ``.fbin`` (float32), ``.f16bin`` (float16), ``.u8bin`` (uint8), | ||
| ``.i8bin`` (int8), ``.ibin`` (int32). | ||
| subset_size : Optional[int] | ||
| If provided, only the first ``subset_size`` rows are loaded. | ||
|
|
||
| Returns | ||
| ------- | ||
| np.ndarray | ||
| Array of shape ``(n_rows, n_cols)`` with the inferred dtype. | ||
|
|
||
| Raises | ||
| ------ | ||
| FileNotFoundError | ||
| If the file does not exist. | ||
| ValueError | ||
| If the file extension is unsupported, ``subset_size`` is not positive, | ||
| or the file is truncated. | ||
| """ | ||
| dtype = dtype_from_filename(path) | ||
| if subset_size is not None and subset_size < 1: | ||
| raise ValueError( | ||
| f"subset_size must be a positive integer, got {subset_size}" | ||
| ) | ||
| with open(path, "rb") as f: | ||
| header = f.read(8) | ||
| if len(header) < 8: | ||
| raise ValueError( | ||
| f"File too small to contain a valid header (expected 8 bytes, " | ||
| f"got {len(header)}): {path}" | ||
| ) | ||
| n_rows = int(np.frombuffer(header[:4], dtype=np.uint32)[0]) | ||
| n_cols = int(np.frombuffer(header[4:], dtype=np.uint32)[0]) | ||
| if subset_size is not None: | ||
| n_rows = min(n_rows, subset_size) | ||
|
Comment on lines
+96
to
+110
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Validate Lines 96-110 only enforce a lower bound. Non-integer values (e.g., Proposed fix def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
@@
- if subset_size is not None and subset_size < 1:
- raise ValueError(
- f"subset_size must be a positive integer, got {subset_size}"
- )
+ if subset_size is not None:
+ if not isinstance(subset_size, (int, np.integer)):
+ raise ValueError(
+ f"subset_size must be an integer, got {type(subset_size).__name__}"
+ )
+ if subset_size < 1:
+ raise ValueError(
+ f"subset_size must be a positive integer, got {subset_size}"
+ )As per coding guidelines, "Ensure missing validation does not cause crashes on invalid input through proper size/type checks." 🤖 Prompt for AI Agents |
||
| expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize | ||
| raw = f.read(expected_bytes) | ||
| if len(raw) < expected_bytes: | ||
| raise ValueError( | ||
| f"File is truncated: expected {expected_bytes} bytes of data " | ||
| f"({n_rows} rows x {n_cols} cols x {np.dtype(dtype).itemsize} bytes), " | ||
| f"got {len(raw)}: {path}" | ||
| ) | ||
| data = np.frombuffer(raw, dtype=dtype) | ||
| return data.reshape(n_rows, n_cols) | ||
|
|
||
|
|
||
| def expand_param_grid( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These functions seem like they should be core to the orchestrator / some other abstraction and shouldn't be merely utilities. Are these intended to be used directly by the backends? I think some sort of abstraction is needed to help force the backends to use them in some standard way. Otherwise, we're going to end up with users creating custom backends that don't follow our methodology and claiming they are getting some crazy different perf profile as a result (when in reality it's because they aren't using our tool in the intended way).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, I think there should be a class that centralizes all of the dataset stuff (liek what you have above) but also provides a standardized way to get these params (maybe this new abstraction uses the utilities here, but I think these utilities should be private (_utils.py) and we should favor that standard abstraction layer. Maybe in addition to Dateset class, we have a "BenchmarkRunner" class? Or a "ParamRunner" class that generates the params and evaluates the recall? What we should aim for is for the documentation for adding new backends to have a clean, cohesive, concise, and consistent few lines of code that each backend can execute in order to properly run benchmarks. This will keep everyone doing the same thing as much as possible so we're always comparing apples to apples.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
These functions are not intended to be called directly by backends. They are called transparently by the framework:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is largely what the current architecture provides. The Dataset class centralizes all dataset handling with transparent vector loading. The base
Backend authors only implement
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These changes sound great, @jnke2016. Thank you for this! I think it's normal that we're seeing the design and scope evolve as we continue to add new backends.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm working to rebase the opensearch PR on top of the changes here
@jnke2016 I just want to make sure I'm not missing something. From what I can tell it looks like the |
||
| param_spec: Dict[str, List[Any]], | ||
| ) -> List[Dict[str, Any]]: | ||
| """ | ||
| Expand a parameter specification into all combinations via Cartesian product. | ||
|
|
||
| Takes a dict where each key maps to a list of values (as defined in | ||
| algorithm YAML configs) and produces a list of dicts, one per combination. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| param_spec : Dict[str, List[Any]] | ||
| Parameter specification, e.g., {"m": [16, 32], "ef_construction": [100, 200]} | ||
|
|
||
| Returns | ||
| ------- | ||
| List[Dict[str, Any]] | ||
| List of parameter dicts, e.g., | ||
| [{"m": 16, "ef_construction": 100}, {"m": 16, "ef_construction": 200}, | ||
| {"m": 32, "ef_construction": 100}, {"m": 32, "ef_construction": 200}] | ||
| Returns [{}] if param_spec is empty. | ||
| """ | ||
| if not param_spec: | ||
| return [{}] | ||
| keys = list(param_spec.keys()) | ||
| return [ | ||
| dict(zip(keys, vals)) | ||
| for vals in itertools.product(*param_spec.values()) | ||
| ] | ||
|
|
||
|
|
||
| def compute_recall( | ||
| neighbors: np.ndarray, groundtruth: np.ndarray, k: int | ||
| ) -> float: | ||
| """ | ||
| Compute recall@k by comparing returned neighbors against ground truth. | ||
|
|
||
| For each query, counts how many of the k returned neighbors appear | ||
| in the ground truth set, then averages across all queries. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| neighbors : np.ndarray | ||
| Returned neighbor IDs, shape (n_queries, k) | ||
| groundtruth : np.ndarray | ||
| Ground truth neighbor IDs, shape (n_queries, gt_k) | ||
| k : int | ||
| Number of neighbors to evaluate | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Recall@k in range [0.0, 1.0] | ||
| """ | ||
| if not isinstance(neighbors, np.ndarray) or not isinstance( | ||
| groundtruth, np.ndarray | ||
| ): | ||
| raise ValueError("neighbors and groundtruth must be numpy ndarrays") | ||
| if neighbors.ndim != 2 or groundtruth.ndim != 2: | ||
| raise ValueError( | ||
| f"neighbors and groundtruth must be 2-D arrays, got " | ||
| f"neighbors.shape={neighbors.shape}, groundtruth.shape={groundtruth.shape}" | ||
| ) | ||
| n_queries = neighbors.shape[0] | ||
| if groundtruth.shape[0] != n_queries: | ||
| raise ValueError( | ||
| f"Row count mismatch: neighbors has {n_queries} rows, " | ||
| f"groundtruth has {groundtruth.shape[0]} rows" | ||
| ) | ||
| gt_k = min(k, groundtruth.shape[1]) | ||
| if gt_k == 0 or n_queries == 0: | ||
| return 0.0 | ||
| n_correct = sum( | ||
| len( | ||
| set(neighbors[i, :k].tolist()) | ||
| & set(groundtruth[i, :gt_k].tolist()) | ||
| ) | ||
| for i in range(n_queries) | ||
| ) | ||
| return n_correct / (n_queries * gt_k) | ||
|
coderabbitai[bot] marked this conversation as resolved.
Comment on lines
+192
to
+202
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reject negative Lines 185-192 compute Proposed fix def compute_recall(
neighbors: np.ndarray, groundtruth: np.ndarray, k: int
) -> float:
@@
+ if not isinstance(k, (int, np.integer)):
+ raise ValueError(f"k must be an integer, got {type(k).__name__}")
+ if k < 0:
+ raise ValueError(f"k must be >= 0, got {k}")
+
n_queries = neighbors.shape[0]
@@
gt_k = min(k, groundtruth.shape[1])As per coding guidelines, "Provide clear and actionable error messages that include expected vs actual values where helpful." 🤖 Prompt for AI Agents
Comment on lines
+186
to
+202
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fail fast when
Suggested fix n_queries = neighbors.shape[0]
if groundtruth.shape[0] != n_queries:
raise ValueError(
f"Row count mismatch: neighbors has {n_queries} rows, "
f"groundtruth has {groundtruth.shape[0]} rows"
)
- gt_k = min(k, groundtruth.shape[1])
+ gt_k = min(k, groundtruth.shape[1])
+ if neighbors.shape[1] < gt_k:
+ raise ValueError(
+ f"neighbors must have at least {gt_k} columns for recall@{k}, "
+ f"got {neighbors.shape[1]}"
+ )
if gt_k == 0 or n_queries == 0:
return 0.0As per coding guidelines, Ensure missing validation does not cause crashes on invalid input through proper size/type checks. 🤖 Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wonderful that we have these ways of running the orchestrator through Python. Is it possible for us to use both approaches? Are there downsides to having one over the other? Can we keep both in the docs?
For example, especially in the "tuning" mode, I think users are going to want to run that inside existing Python code more frequently than from the command line.