improve blocking with missing payload keys #230

ericbuckley · 2025-02-27T22:54:53Z

Summary

Update the blocking query to skip optionally skip over missing blocking keys.

Acceptance Criteria

A reimplementation of get_block_data
Add documentation to the site/design.md about how blocking uses log odds
Add new algorithm configuration variable, "defaults/compare_minimum_percentage" with a default of 0.7.

Details / Tasks

Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the default/compare_minimum_percentage keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.

Background / Context

When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.

Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.

Related Issues/PRs

#223

The text was updated successfully, but these errors were encountered:

ericbuckley changed the title ~~improve blocking with missing keys~~ improve blocking with missing payload keys Feb 28, 2025

ericbuckley added the feature New feature or request label Feb 28, 2025

ericbuckley mentioned this issue Feb 28, 2025

improve blocking by exluding incorrect matches #231

Closed

2 tasks

ericbuckley added this to the v25.5.0 milestone Feb 28, 2025

ericbuckley mentioned this issue Mar 6, 2025

Improve evaluation of missing data fields #235

Open

4 tasks

ericbuckley modified the milestones: v25.5.0, v25.4.0 Mar 12, 2025

ericbuckley self-assigned this Mar 18, 2025

ericbuckley linked a pull request Mar 18, 2025 that will close this issue

missing blocking keys #254

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve blocking with missing payload keys #230

improve blocking with missing payload keys #230

ericbuckley commented Feb 27, 2025 •

edited

Loading

improve blocking with missing payload keys #230

improve blocking with missing payload keys #230

Comments

ericbuckley commented Feb 27, 2025 • edited Loading

Summary

Acceptance Criteria

Details / Tasks

Background / Context

Related Issues/PRs

ericbuckley commented Feb 27, 2025 •

edited

Loading