Skip to content

Add duplicate filtering by document ID in HNSWlib search #623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

grussdorian
Copy link

This commit modifies HNSWlib to filter duplicate document IDs during KNN search, ensuring only one embedding per unique document ID is returned. Key changes include:

  • Added internal_id_to_doc_id_ vector to HierarchicalNSW to map internal IDs to document IDs, populated in addPoint.
  • Introduced getMetadata method to retrieve document IDs.
  • Extended VisitedList with seen_doc_ids set to track seen document IDs thread-locally, avoiding mutex contention.
  • Updated searchBaseLayerST to skip candidates with already-seen document IDs using vl->is_doc_seen(doc_id).
  • Removed unused visited_metadata_ and visited_metadata_lock_ as filtering is now handled by VisitedList. The duplicate filtering works as intended, though knnQuery may raise a RuntimeError if k exceeds the number of unique document IDs due to result array shape constraints. Tests for basic filtering, single ID, and large datasets pass, while empty index and insufficient IDs cases require further handling.

Files modified:

  • hnswalg.h: Added duplicate filtering logic and mappings.
  • visited_list_pool.h: Enhanced VisitedList for document ID tracking.

This commit modifies HNSWlib to filter duplicate document IDs during KNN
search, ensuring only one embedding per unique document ID is returned.
Key changes include:

- Added `internal_id_to_doc_id_` vector to `HierarchicalNSW` to map internal
  IDs to document IDs, populated in `addPoint`.
- Introduced `getMetadata` method to retrieve document IDs.
- Extended `VisitedList` with `seen_doc_ids` set to track seen document IDs
  thread-locally, avoiding mutex contention.
- Updated `searchBaseLayerST` to skip candidates with already-seen document
  IDs using `vl->is_doc_seen(doc_id)`.
- Removed unused `visited_metadata_` and `visited_metadata_lock_` as filtering
  is now handled by `VisitedList`.
The duplicate filtering works as intended, though `knnQuery` may raise a
`RuntimeError` if `k` exceeds the number of unique document IDs due to
result array shape constraints. Tests for basic filtering, single ID, and
large datasets pass, while empty index and insufficient IDs cases require
further handling.

Files modified:
- hnswalg.h: Added duplicate filtering logic and mappings.
- visited_list_pool.h: Enhanced `VisitedList` for document ID tracking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant