-
Notifications
You must be signed in to change notification settings - Fork 1
Add type-safe index parameter system with dynamic IVF calculation based on join window vector count #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main-dev
Are you sure you want to change the base?
Conversation
…ed on join window vector count (#4) * Initial plan * Add IVF index parameter passing and window-based calculation - Added getWindowSize() method to ThreadSafeSlidingWindow and JoinFunction - Added new create_index overload accepting IVF parameters (nlist, rebuild_threshold, nprobes) - Added new createIndexPair overload in JoinOperator accepting IVF parameters - Updated JoinOperator constructor to calculate nlist as 4*sqrt(window_size) - Set default rebuild_threshold to 1.5 and nprobes to 10 for IVF indexes - All existing tests pass successfully Co-authored-by: ZeroJustMe <[email protected]> * Refactor index parameter passing with type-safe variant system - Replace index-specific parameters with IndexParameters variant - Add IVFParameters, HNSWParameters, and NoParameters structs - Update create_index to accept IndexParameters instead of individual params - Type-safe parameter passing prevents incorrect usage - Extensible design supports all current and future index types - Backward compatible with default parameters - All tests pass Co-authored-by: ZeroJustMe <[email protected]> * Update IVF nlist calculation to use window_size/step_size - Changed nlist calculation from sqrt(window_size) to sqrt(window_size/step_size) - Now calculates based on actual vector count in window - Added getStepSize() method to ThreadSafeSlidingWindow and JoinFunction - More accurate clustering based on data volume - All tests pass Co-authored-by: ZeroJustMe <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: ZeroJustMe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a type-safe parameter system for vector indexes, enabling dynamic configuration of index-specific parameters (particularly for IVF and HNSW indexes) based on runtime conditions. The main enhancement is the dynamic calculation of IVF parameters based on the join window's vector count, along with improved thread-safety for accessing window properties.
Key Changes:
- Introduced
IndexParametersvariant type with index-specific parameter structures (IVFParameters,HNSWParameters,NoParameters) for type-safe index configuration - Added dynamic IVF parameter calculation in
JoinOperatorthat computesnlistas4 * sqrt(vector_count)wherevector_count = window_size / step_size - Added thread-safe getter methods for window size and step size in join functions to support parameter calculations
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
include/index/index.h |
Defines parameter structures and IndexParameters variant type for type-safe index configuration |
include/concurrency/concurrency_manager.h |
Adds overloaded create_index method accepting IndexParameters |
src/concurrency/concurrency_manager.cpp |
Implements parameterized index creation with type checking via std::get_if |
include/function/join_function.h |
Adds thread-safe getters for window size and step size in ThreadSafeSlidingWindow and JoinFunction |
src/function/join_function.cpp |
Implements window parameter getter methods |
include/operator/join_operator.h |
Adds overloaded createIndexPair accepting IndexParameters |
src/operator/join_operator.cpp |
Implements dynamic IVF parameter calculation and uses parameterized index creation |
src/utils/monitoring.cpp |
Replaces std::cerr with structured logging macros |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| int64_t window_size = join_func_->getWindowSize(); | ||
| int64_t step_size = join_func_->getStepSize(); | ||
| // Calculate actual vector count in window | ||
| int64_t vector_count = (step_size > 0) ? (window_size / step_size) : window_size; |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback calculation vector_count = window_size when step_size <= 0 is incorrect. When step_size is 0 or negative, the window configuration is invalid, and this should either return an error or use a more appropriate default. Using window_size directly as vector count doesn't align with the formula's intent of calculating the number of vectors based on the step interval.
src/operator/join_operator.cpp
Outdated
| index_kind_ = InternalIndexKind::IVF; | ||
| if (createIndexPair(IndexType::IVF, "join_ivf")) { | ||
| // Calculate IVF parameters based on window size | ||
| // nlist = 4 * sqrt(window_size/step_size), rebuild_threshold = 1.5, nprobes = 10 |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states the formula uses window_size/step_size, but the actual implementation on line 82 uses vector_count which is calculated differently. The comment should accurately reflect that vector_count represents the actual number of vectors in the window, not just the raw division result.
| // nlist = 4 * sqrt(window_size/step_size), rebuild_threshold = 1.5, nprobes = 10 | |
| // nlist = 4 * sqrt(vector_count), where vector_count = (window_size / step_size) if step_size > 0, else window_size |
… scenarios (#5) * Checkpoint from VS Code for coding agent session * Initial analysis of parallel join algorithm issues Co-authored-by: ZeroJustMe <[email protected]> * Fix parallel join algorithm issues for IVF method - Fix IVF index race condition by copying candidate lists under lock - Fix expected_matches UID calculation by using consistent modulo base - Fix eager join wait logic to not require window draining - All parallelism levels (1,2,4,8,16,32) now pass with recall >= 0.93 Co-authored-by: ZeroJustMe <[email protected]> * Fix IVF nprobes calculation for better recall at higher parallelism - Increase nprobes to search 30% of clusters (was 8%) for better recall - Addresses recall degradation issue with large datasets and high parallelism - All parallelism levels (1-32) now achieve recall >= 0.926 with 1000 records Co-authored-by: ZeroJustMe <[email protected]> * Fix single-instance concurrency in join operator - Acquire both window locks before getting candidates and executing joins - Prevents race condition where one slot expires records while other queries - Add helper methods that work with locks already held - Significant recall improvement at high parallelism (p8: 0.983, p16: 0.989, p32: 0.985) Co-authored-by: ZeroJustMe <[email protected]> * 优化数据源模式性能测试中的预期匹配计算和等待时间 --------- Co-authored-by: ZeroJustMe <[email protected]> Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: ZeroJustMe <[email protected]>
|
This pull request introduces several enhancements and fixes focused on index parameterization, performance metrics, and thread safety in windowed join operations. The most significant changes are the addition of index-specific parameter structures, dynamic parameter calculation for IVF indexes, expanded performance metrics, and improvements to thread safety and concurrency handling. Index parameterization and creation:
Performance metrics enhancements:
Thread safety and concurrency improvements:
Configuration and testing:
API and internal refactoring:
These changes collectively improve the flexibility, observability, and robustness of the join operator and index management subsystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| // Calculate nprobes to search at least 30% of clusters for better recall in high-concurrency scenarios | ||
| // Cap at 60% to maintain reasonable performance | ||
| int nprobes = std::max(15, std::min(nlist * 60 / 100, nlist * 30 / 100)); |
Copilot
AI
Oct 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating nprobes is incorrect. The expression std::min(nlist * 60 / 100, nlist * 30 / 100) will always evaluate to nlist * 30 / 100 since 30% is always less than 60%. This should likely be std::max(nlist * 30 / 100, std::min(nlist * 60 / 100, 15)) to cap between 30-60% with a minimum of 15, or std::max(15, std::min(nlist * 60 / 100, nlist)) to ensure we don't exceed nlist.
| int nprobes = std::max(15, std::min(nlist * 60 / 100, nlist * 30 / 100)); | |
| int nprobes = std::max(15, std::min(nlist * 60 / 100, std::max(nlist * 30 / 100, 15))); |
|
加个测试,保证线程数量不影响recall。 |
好,我刚刚测IVF每个并行度下都能跑到99的recall,之后调一下测试配置放到CI里 |
* Checkpoint from VS Code for coding agent session * Initial analysis of parallel join algorithm issues Co-authored-by: ZeroJustMe <[email protected]> * Fix parallel join algorithm issues for IVF method - Fix IVF index race condition by copying candidate lists under lock - Fix expected_matches UID calculation by using consistent modulo base - Fix eager join wait logic to not require window draining - All parallelism levels (1,2,4,8,16,32) now pass with recall >= 0.93 Co-authored-by: ZeroJustMe <[email protected]> * Fix IVF nprobes calculation for better recall at higher parallelism - Increase nprobes to search 30% of clusters (was 8%) for better recall - Addresses recall degradation issue with large datasets and high parallelism - All parallelism levels (1-32) now achieve recall >= 0.926 with 1000 records Co-authored-by: ZeroJustMe <[email protected]> * Fix single-instance concurrency in join operator - Acquire both window locks before getting candidates and executing joins - Prevents race condition where one slot expires records while other queries - Add helper methods that work with locks already held - Significant recall improvement at high parallelism (p8: 0.983, p16: 0.989, p32: 0.985) Co-authored-by: ZeroJustMe <[email protected]> * 优化数据源模式性能测试中的预期匹配计算和等待时间 * [WIP] Add new performance test file for ci (#6) * Checkpoint from VS Code for coding agent session * Add CI performance testing workflow and configuration - Created perf-ci-tests.yml workflow for parallel join performance testing - Created perf_join_datasource_modes_ci.toml with smaller dataset (1000 records) - Tests parallelism levels 1, 2, 4, 8, 16, 32 for both bruteforce_eager and ivf_eager - Verified locally: tests pass with recall > 0.98 (above 0.85 threshold) Co-authored-by: ZeroJustMe <[email protected]> --------- Co-authored-by: ZeroJustMe <[email protected]> Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: ZeroJustMe <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: ZeroJustMe <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This pull request introduces support for index-specific parameterization, especially for IVF and HNSW index types, and enables dynamic calculation and passing of these parameters during index creation. It also adds thread-safe accessors for window parameters in join functions, and improves logging in the performance monitoring utilities.
Index Parameterization and Dynamic Index Creation:
IVFParameters,HNSWParameters,NoParameters) and a variant typeIndexParametersto encapsulate index-specific configuration. This allows flexible and type-safe passing of parameters to index constructors.create_indexmethods inConcurrencyManagerandJoinOperatorto acceptIndexParameters, enabling the creation of indexes with custom parameters. [1] [2] [3] [4]nlist) based on the window size and step size, and to use the new parameterized index creation methods.Join Function Enhancements:
JoinFunctionandThreadSafeSlidingWindow, allowing other components to query these properties safely. [1] [2] [3]Utility and Logging Improvements:
std::cerrcalls with the project's structured logging macros in the performance monitoring code, ensuring consistent and centralized logging. [1] [2]Additional Minor Improvements:
<variant>,<optional>,<cmath>) to support new features and calculations. [1] [2]