Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve blocking with missing payload keys #230

Open
3 tasks
ericbuckley opened this issue Feb 27, 2025 · 0 comments · May be fixed by #254
Open
3 tasks

improve blocking with missing payload keys #230

ericbuckley opened this issue Feb 27, 2025 · 0 comments · May be fixed by #254
Assignees
Labels
feature New feature or request
Milestone

Comments

@ericbuckley
Copy link
Collaborator

ericbuckley commented Feb 27, 2025

Summary

Update the blocking query to skip optionally skip over missing blocking keys.

Acceptance Criteria

  • A reimplementation of get_block_data
  • Add documentation to the site/design.md about how blocking uses log odds
  • Add new algorithm configuration variable, "defaults/compare_minimum_percentage" with a default of 0.7.

Details / Tasks

Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the default/compare_minimum_percentage keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.

Background / Context

When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.

Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.

Related Issues/PRs

#223

@ericbuckley ericbuckley changed the title improve blocking with missing keys improve blocking with missing payload keys Feb 28, 2025
@ericbuckley ericbuckley added the feature New feature or request label Feb 28, 2025
@ericbuckley ericbuckley added this to the v25.5.0 milestone Feb 28, 2025
@ericbuckley ericbuckley modified the milestones: v25.5.0, v25.4.0 Mar 12, 2025
@ericbuckley ericbuckley self-assigned this Mar 18, 2025
@ericbuckley ericbuckley linked a pull request Mar 18, 2025 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant