Skip to content

Conversation

@ZeroJustMe
Copy link
Contributor

This pull request introduces support for index-specific parameterization, especially for IVF and HNSW index types, and enables dynamic calculation and passing of these parameters during index creation. It also adds thread-safe accessors for window parameters in join functions, and improves logging in the performance monitoring utilities.

Index Parameterization and Dynamic Index Creation:

  • Introduced parameter structures (IVFParameters, HNSWParameters, NoParameters) and a variant type IndexParameters to encapsulate index-specific configuration. This allows flexible and type-safe passing of parameters to index constructors.
  • Added overloaded create_index methods in ConcurrencyManager and JoinOperator to accept IndexParameters, enabling the creation of indexes with custom parameters. [1] [2] [3] [4]
  • Updated the join operator to dynamically calculate IVF index parameters (such as nlist) based on the window size and step size, and to use the new parameterized index creation methods.

Join Function Enhancements:

  • Added thread-safe getter methods for window size and step size in JoinFunction and ThreadSafeSlidingWindow, allowing other components to query these properties safely. [1] [2] [3]

Utility and Logging Improvements:

  • Replaced direct std::cerr calls with the project's structured logging macros in the performance monitoring code, ensuring consistent and centralized logging. [1] [2]

Additional Minor Improvements:

  • Included necessary headers (<variant>, <optional>, <cmath>) to support new features and calculations. [1] [2]

ZeroJustMe and others added 2 commits October 14, 2025 11:37
…ed on join window vector count (#4)

* Initial plan

* Add IVF index parameter passing and window-based calculation

- Added getWindowSize() method to ThreadSafeSlidingWindow and JoinFunction
- Added new create_index overload accepting IVF parameters (nlist, rebuild_threshold, nprobes)
- Added new createIndexPair overload in JoinOperator accepting IVF parameters
- Updated JoinOperator constructor to calculate nlist as 4*sqrt(window_size)
- Set default rebuild_threshold to 1.5 and nprobes to 10 for IVF indexes
- All existing tests pass successfully

Co-authored-by: ZeroJustMe <[email protected]>

* Refactor index parameter passing with type-safe variant system

- Replace index-specific parameters with IndexParameters variant
- Add IVFParameters, HNSWParameters, and NoParameters structs
- Update create_index to accept IndexParameters instead of individual params
- Type-safe parameter passing prevents incorrect usage
- Extensible design supports all current and future index types
- Backward compatible with default parameters
- All tests pass

Co-authored-by: ZeroJustMe <[email protected]>

* Update IVF nlist calculation to use window_size/step_size

- Changed nlist calculation from sqrt(window_size) to sqrt(window_size/step_size)
- Now calculates based on actual vector count in window
- Added getStepSize() method to ThreadSafeSlidingWindow and JoinFunction
- More accurate clustering based on data volume
- All tests pass

Co-authored-by: ZeroJustMe <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: ZeroJustMe <[email protected]>
@ZeroJustMe ZeroJustMe requested a review from Copilot October 14, 2025 08:00
@ZeroJustMe ZeroJustMe self-assigned this Oct 14, 2025
@ZeroJustMe ZeroJustMe added the enhancement New feature or request label Oct 14, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a type-safe parameter system for vector indexes, enabling dynamic configuration of index-specific parameters (particularly for IVF and HNSW indexes) based on runtime conditions. The main enhancement is the dynamic calculation of IVF parameters based on the join window's vector count, along with improved thread-safety for accessing window properties.

Key Changes:

  • Introduced IndexParameters variant type with index-specific parameter structures (IVFParameters, HNSWParameters, NoParameters) for type-safe index configuration
  • Added dynamic IVF parameter calculation in JoinOperator that computes nlist as 4 * sqrt(vector_count) where vector_count = window_size / step_size
  • Added thread-safe getter methods for window size and step size in join functions to support parameter calculations

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
include/index/index.h Defines parameter structures and IndexParameters variant type for type-safe index configuration
include/concurrency/concurrency_manager.h Adds overloaded create_index method accepting IndexParameters
src/concurrency/concurrency_manager.cpp Implements parameterized index creation with type checking via std::get_if
include/function/join_function.h Adds thread-safe getters for window size and step size in ThreadSafeSlidingWindow and JoinFunction
src/function/join_function.cpp Implements window parameter getter methods
include/operator/join_operator.h Adds overloaded createIndexPair accepting IndexParameters
src/operator/join_operator.cpp Implements dynamic IVF parameter calculation and uses parameterized index creation
src/utils/monitoring.cpp Replaces std::cerr with structured logging macros

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

int64_t window_size = join_func_->getWindowSize();
int64_t step_size = join_func_->getStepSize();
// Calculate actual vector count in window
int64_t vector_count = (step_size > 0) ? (window_size / step_size) : window_size;
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback calculation vector_count = window_size when step_size <= 0 is incorrect. When step_size is 0 or negative, the window configuration is invalid, and this should either return an error or use a more appropriate default. Using window_size directly as vector count doesn't align with the formula's intent of calculating the number of vectors based on the step interval.

Copilot uses AI. Check for mistakes.
index_kind_ = InternalIndexKind::IVF;
if (createIndexPair(IndexType::IVF, "join_ivf")) {
// Calculate IVF parameters based on window size
// nlist = 4 * sqrt(window_size/step_size), rebuild_threshold = 1.5, nprobes = 10
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states the formula uses window_size/step_size, but the actual implementation on line 82 uses vector_count which is calculated differently. The comment should accurately reflect that vector_count represents the actual number of vectors in the window, not just the raw division result.

Suggested change
// nlist = 4 * sqrt(window_size/step_size), rebuild_threshold = 1.5, nprobes = 10
// nlist = 4 * sqrt(vector_count), where vector_count = (window_size / step_size) if step_size > 0, else window_size

Copilot uses AI. Check for mistakes.
… scenarios (#5)

* Checkpoint from VS Code for coding agent session

* Initial analysis of parallel join algorithm issues

Co-authored-by: ZeroJustMe <[email protected]>

* Fix parallel join algorithm issues for IVF method

- Fix IVF index race condition by copying candidate lists under lock
- Fix expected_matches UID calculation by using consistent modulo base
- Fix eager join wait logic to not require window draining
- All parallelism levels (1,2,4,8,16,32) now pass with recall >= 0.93

Co-authored-by: ZeroJustMe <[email protected]>

* Fix IVF nprobes calculation for better recall at higher parallelism

- Increase nprobes to search 30% of clusters (was 8%) for better recall
- Addresses recall degradation issue with large datasets and high parallelism
- All parallelism levels (1-32) now achieve recall >= 0.926 with 1000 records

Co-authored-by: ZeroJustMe <[email protected]>

* Fix single-instance concurrency in join operator

- Acquire both window locks before getting candidates and executing joins
- Prevents race condition where one slot expires records while other queries
- Add helper methods that work with locks already held
- Significant recall improvement at high parallelism (p8: 0.983, p16: 0.989, p32: 0.985)

Co-authored-by: ZeroJustMe <[email protected]>

* 优化数据源模式性能测试中的预期匹配计算和等待时间

---------

Co-authored-by: ZeroJustMe <[email protected]>
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: ZeroJustMe <[email protected]>
@ZeroJustMe
Copy link
Contributor Author

#56

@ZeroJustMe ZeroJustMe requested a review from Copilot October 20, 2025 12:54
@ZeroJustMe
Copy link
Contributor Author

This pull request introduces several enhancements and fixes focused on index parameterization, performance metrics, and thread safety in windowed join operations. The most significant changes are the addition of index-specific parameter structures, dynamic parameter calculation for IVF indexes, expanded performance metrics, and improvements to thread safety and concurrency handling.

Index parameterization and creation:

  • Introduced index-specific parameter structures (IVFParameters, HNSWParameters, NoParameters) and a variant type IndexParameters to allow passing custom parameters to index constructors. The ConcurrencyManager and JoinOperator now support creating indexes with these parameters, enabling dynamic configuration. [1] [2] [3] [4] [5]
  • In JoinOperator, IVF index parameters (nlist, rebuild_threshold, nprobes) are now dynamically calculated based on window size and step size, improving adaptability to workload and concurrency.

Performance metrics enhancements:

  • Added new atomic counters to JoinMetrics for tracking the number of records expired from the left and right windows (window_records_left_completed, window_records_right_completed), and included them in metric resets and TSV dumps. [1] [2] [3]
  • Updated the join operator logic to increment these counters when records are expired from the window. [1] [2] [3]

Thread safety and concurrency improvements:

  • Improved thread safety in the IVF index by copying candidate lists under lock before processing, preventing race conditions. [1] [2]
  • Added getter methods for window size and step size with proper locking in JoinFunction and ThreadSafeSlidingWindow, ensuring safe concurrent access. [1] [2] [3]

Configuration and testing:

  • Updated the performance test configuration to focus on the ivf_eager method, reduced test sizes, and expanded parallelism options. Commented out unused test modes for clarity. [1] [2]

API and internal refactoring:

  • Added several helper methods to JoinOperator for candidate retrieval and join execution under lock, improving code organization and future extensibility. [1] [2]

These changes collectively improve the flexibility, observability, and robustness of the join operator and index management subsystems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


// Calculate nprobes to search at least 30% of clusters for better recall in high-concurrency scenarios
// Cap at 60% to maintain reasonable performance
int nprobes = std::max(15, std::min(nlist * 60 / 100, nlist * 30 / 100));
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for calculating nprobes is incorrect. The expression std::min(nlist * 60 / 100, nlist * 30 / 100) will always evaluate to nlist * 30 / 100 since 30% is always less than 60%. This should likely be std::max(nlist * 30 / 100, std::min(nlist * 60 / 100, 15)) to cap between 30-60% with a minimum of 15, or std::max(15, std::min(nlist * 60 / 100, nlist)) to ensure we don't exceed nlist.

Suggested change
int nprobes = std::max(15, std::min(nlist * 60 / 100, nlist * 30 / 100));
int nprobes = std::max(15, std::min(nlist * 60 / 100, std::max(nlist * 30 / 100, 15)));

Copilot uses AI. Check for mistakes.
@ZeroJustMe ZeroJustMe moved this from Todo to In Progress in SAGE-Middleware Oct 20, 2025
@ShuhaoZhangTony
Copy link
Member

加个测试,保证线程数量不影响recall。

@ZeroJustMe
Copy link
Contributor Author

加个测试,保证线程数量不影响recall。

好,我刚刚测IVF每个并行度下都能跑到99的recall,之后调一下测试配置放到CI里

ZeroJustMe and others added 3 commits October 20, 2025 23:14
* Checkpoint from VS Code for coding agent session

* Initial analysis of parallel join algorithm issues

Co-authored-by: ZeroJustMe <[email protected]>

* Fix parallel join algorithm issues for IVF method

- Fix IVF index race condition by copying candidate lists under lock
- Fix expected_matches UID calculation by using consistent modulo base
- Fix eager join wait logic to not require window draining
- All parallelism levels (1,2,4,8,16,32) now pass with recall >= 0.93

Co-authored-by: ZeroJustMe <[email protected]>

* Fix IVF nprobes calculation for better recall at higher parallelism

- Increase nprobes to search 30% of clusters (was 8%) for better recall
- Addresses recall degradation issue with large datasets and high parallelism
- All parallelism levels (1-32) now achieve recall >= 0.926 with 1000 records

Co-authored-by: ZeroJustMe <[email protected]>

* Fix single-instance concurrency in join operator

- Acquire both window locks before getting candidates and executing joins
- Prevents race condition where one slot expires records while other queries
- Add helper methods that work with locks already held
- Significant recall improvement at high parallelism (p8: 0.983, p16: 0.989, p32: 0.985)

Co-authored-by: ZeroJustMe <[email protected]>

* 优化数据源模式性能测试中的预期匹配计算和等待时间

* [WIP] Add new performance test file for ci (#6)

* Checkpoint from VS Code for coding agent session

* Add CI performance testing workflow and configuration

- Created perf-ci-tests.yml workflow for parallel join performance testing
- Created perf_join_datasource_modes_ci.toml with smaller dataset (1000 records)
- Tests parallelism levels 1, 2, 4, 8, 16, 32 for both bruteforce_eager and ivf_eager
- Verified locally: tests pass with recall > 0.98 (above 0.85 threshold)

Co-authored-by: ZeroJustMe <[email protected]>

---------

Co-authored-by: ZeroJustMe <[email protected]>
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: ZeroJustMe <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: ZeroJustMe <[email protected]>
@ZeroJustMe ZeroJustMe requested a review from Copilot October 27, 2025 03:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants