-
Notifications
You must be signed in to change notification settings - Fork 199
Centralize shared utilities across benchmark backends #2040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 22 commits
c067d52
6e91f0e
99fb908
1e1ef66
21d8621
60f3aeb
39867d9
1e2de0f
9480257
3399692
896323d
93b78db
c99cf88
4f01af3
aa913dc
dddf1d2
948fb8a
4781e32
592afc3
3e561c3
b31ae14
46b3cae
4cdd998
499c9fe
76d8c80
3e15ee4
9030d9e
99cceb1
798f44a
0dc78a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,18 +18,28 @@ | |
| from ..orchestrator.config_loaders import IndexConfig | ||
|
|
||
|
|
||
| @dataclass | ||
| class Dataset: | ||
| """ | ||
| Dataset representation for benchmarking. | ||
|
|
||
| Attributes | ||
| Supports two usage patterns: | ||
|
|
||
| 1. File-path based (C++ backend): The orchestrator passes file paths | ||
| and the backend reads them directly via ``base_file``, ``query_file``, | ||
| etc. Vectors are never loaded into Python. | ||
|
|
||
| 2. Array based (Python-native backends like OpenSearch, Elasticsearch): | ||
| Backends access ``base_vectors``, ``query_vectors``, etc. If vectors | ||
| were not provided directly but a file path exists, they are loaded | ||
| lazily on first access. This keeps file I/O invisible to backends. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| name : str | ||
| Dataset name (e.g., "glove-100-inner") | ||
| base_vectors : np.ndarray | ||
| base_vectors : Optional[np.ndarray] | ||
| Base vectors for index building, shape (n_vectors, dims) | ||
| query_vectors : np.ndarray | ||
| query_vectors : Optional[np.ndarray] | ||
| Query vectors for search, shape (n_queries, dims) | ||
| groundtruth_neighbors : Optional[np.ndarray] | ||
| Ground truth neighbor IDs, shape (n_queries, k_gt) | ||
|
|
@@ -38,25 +48,108 @@ class Dataset: | |
| distance_metric : str | ||
| Distance metric ("euclidean", "inner_product", "cosine") | ||
| base_file : Optional[str] | ||
| Path to base vectors file (for C++ backend compatibility) | ||
| Path to base vectors file | ||
| query_file : Optional[str] | ||
| Path to query vectors file (for C++ backend compatibility) | ||
| Path to query vectors file | ||
| groundtruth_neighbors_file : Optional[str] | ||
| Path to ground truth neighbors file (for C++ backend compatibility) | ||
| metadata : Dict[str, Any] | ||
| Additional dataset metadata like {"source": "ann-benchmarks"} | ||
| Path to ground truth neighbors file | ||
| metadata : Optional[Dict[str, Any]] | ||
| Additional dataset metadata like {"subset_size": 10000} | ||
| """ | ||
|
|
||
| name: str | ||
| base_vectors: np.ndarray | ||
| query_vectors: np.ndarray | ||
| groundtruth_neighbors: Optional[np.ndarray] = None | ||
| groundtruth_distances: Optional[np.ndarray] = None | ||
| distance_metric: str = "euclidean" | ||
| base_file: Optional[str] = None | ||
| query_file: Optional[str] = None | ||
| groundtruth_neighbors_file: Optional[str] = None | ||
| metadata: Dict[str, Any] = field(default_factory=dict) | ||
| def __init__( | ||
| self, | ||
| name: str, | ||
| base_vectors: Optional[np.ndarray] = None, | ||
| query_vectors: Optional[np.ndarray] = None, | ||
| groundtruth_neighbors: Optional[np.ndarray] = None, | ||
| groundtruth_distances: Optional[np.ndarray] = None, | ||
| distance_metric: str = "euclidean", | ||
| base_file: Optional[str] = None, | ||
| query_file: Optional[str] = None, | ||
| groundtruth_neighbors_file: Optional[str] = None, | ||
| metadata: Optional[Dict[str, Any]] = None, | ||
| ): | ||
| self.name = name | ||
| # Vectors are stored privately to support lazy loading. | ||
| # If the caller provides vectors directly, they are used as-is. | ||
| # If empty and a file path is provided, the corresponding property | ||
| # loads vectors from the file on first access. This keeps the | ||
| # loading logic invisible to backends: they just access | ||
| # dataset.base_vectors and get a numpy array regardless of whether | ||
| # it was passed in or loaded from disk. | ||
| self._base_vectors = ( | ||
| base_vectors if base_vectors is not None else np.empty((0, 0)) | ||
| ) | ||
| self._query_vectors = ( | ||
| query_vectors if query_vectors is not None else np.empty((0, 0)) | ||
| ) | ||
| self._groundtruth_neighbors = groundtruth_neighbors | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However, I can refactor both to match |
||
| self.groundtruth_distances = groundtruth_distances | ||
| self.distance_metric = distance_metric | ||
| self.base_file = base_file | ||
| self.query_file = query_file | ||
| self.groundtruth_neighbors_file = groundtruth_neighbors_file | ||
| self.metadata = metadata or {} | ||
|
|
||
| @property | ||
| def base_vectors(self) -> np.ndarray: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe instead of base vectors, can we call this training_vectors for clarity? It's a good contrast to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. I addressed this and it will be included in my next commit |
||
| """Base vectors for index building. | ||
|
|
||
| Loaded from base_file on first access if not provided directly. | ||
| """ | ||
| if self._base_vectors.size == 0 and self.base_file: | ||
| from .utils import load_vectors | ||
|
|
||
| self._base_vectors = load_vectors( | ||
| self.base_file, self.metadata.get("subset_size") | ||
| ) | ||
| return self._base_vectors | ||
|
|
||
| @base_vectors.setter | ||
| def base_vectors(self, value: Optional[np.ndarray]) -> None: | ||
| """Set base vectors directly.""" | ||
| self._base_vectors = value if value is not None else np.empty((0, 0)) | ||
|
|
||
| @property | ||
| def query_vectors(self) -> np.ndarray: | ||
| """Query vectors for search. | ||
|
|
||
| Loaded from query_file on first access if not provided directly. | ||
| """ | ||
| if self._query_vectors.size == 0 and self.query_file: | ||
| from .utils import load_vectors | ||
|
|
||
| self._query_vectors = load_vectors(self.query_file) | ||
| return self._query_vectors | ||
|
|
||
| @query_vectors.setter | ||
| def query_vectors(self, value: Optional[np.ndarray]) -> None: | ||
| """Set query vectors directly.""" | ||
| self._query_vectors = value if value is not None else np.empty((0, 0)) | ||
|
|
||
| @property | ||
| def groundtruth_neighbors(self) -> Optional[np.ndarray]: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should definitely store the neighbors AND the distances. At the very least for debugging purposes, we should have access to the distances so that users can print them out if needed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I agree, we shouldn't necessarily make them required, but if a dataset has them, we should use them, I think). |
||
| """Ground truth neighbor IDs. | ||
|
|
||
| Loaded from groundtruth_neighbors_file on first access if not | ||
| provided directly. | ||
| """ | ||
| if ( | ||
| self._groundtruth_neighbors is None | ||
| and self.groundtruth_neighbors_file | ||
| ): | ||
| from .utils import load_vectors | ||
|
|
||
| self._groundtruth_neighbors = load_vectors( | ||
| self.groundtruth_neighbors_file | ||
| ) | ||
| return self._groundtruth_neighbors | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| @groundtruth_neighbors.setter | ||
| def groundtruth_neighbors(self, value: Optional[np.ndarray]) -> None: | ||
| """Set ground truth neighbors directly.""" | ||
| self._groundtruth_neighbors = value | ||
|
|
||
| @property | ||
| def dims(self) -> int: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wonderful that we have these ways of running the orchestrator through Python. Is it possible for us to use both approaches? Are there downsides to having one over the other? Can we keep both in the docs?
For example, especially in the "tuning" mode, I think users are going to want to run that inside existing Python code more frequently than from the command line.