Add duplicate filtering by document ID in HNSWlib search #623

grussdorian · 2025-04-18T14:48:32Z

This commit modifies HNSWlib to filter duplicate document IDs during KNN search, ensuring only one embedding per unique document ID is returned. Key changes include:

Added internal_id_to_doc_id_ vector to HierarchicalNSW to map internal IDs to document IDs, populated in addPoint.
Introduced getMetadata method to retrieve document IDs.
Extended VisitedList with seen_doc_ids set to track seen document IDs thread-locally, avoiding mutex contention.
Updated searchBaseLayerST to skip candidates with already-seen document IDs using vl->is_doc_seen(doc_id).
Removed unused visited_metadata_ and visited_metadata_lock_ as filtering is now handled by VisitedList. The duplicate filtering works as intended, though knnQuery may raise a RuntimeError if k exceeds the number of unique document IDs due to result array shape constraints. Tests for basic filtering, single ID, and large datasets pass, while empty index and insufficient IDs cases require further handling.

Files modified:

hnswalg.h: Added duplicate filtering logic and mappings.
visited_list_pool.h: Enhanced VisitedList for document ID tracking.

This commit modifies HNSWlib to filter duplicate document IDs during KNN search, ensuring only one embedding per unique document ID is returned. Key changes include: - Added `internal_id_to_doc_id_` vector to `HierarchicalNSW` to map internal IDs to document IDs, populated in `addPoint`. - Introduced `getMetadata` method to retrieve document IDs. - Extended `VisitedList` with `seen_doc_ids` set to track seen document IDs thread-locally, avoiding mutex contention. - Updated `searchBaseLayerST` to skip candidates with already-seen document IDs using `vl->is_doc_seen(doc_id)`. - Removed unused `visited_metadata_` and `visited_metadata_lock_` as filtering is now handled by `VisitedList`. The duplicate filtering works as intended, though `knnQuery` may raise a `RuntimeError` if `k` exceeds the number of unique document IDs due to result array shape constraints. Tests for basic filtering, single ID, and large datasets pass, while empty index and insufficient IDs cases require further handling. Files modified: - hnswalg.h: Added duplicate filtering logic and mappings. - visited_list_pool.h: Enhanced `VisitedList` for document ID tracking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add duplicate filtering by document ID in HNSWlib search #623

Add duplicate filtering by document ID in HNSWlib search #623

Uh oh!

grussdorian commented Apr 18, 2025

Uh oh!

Uh oh!

Add duplicate filtering by document ID in HNSWlib search #623

Are you sure you want to change the base?

Add duplicate filtering by document ID in HNSWlib search #623

Uh oh!

Conversation

grussdorian commented Apr 18, 2025

Uh oh!

Uh oh!