You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update the blocking query to skip optionally skip over missing blocking keys.
Acceptance Criteria
A reimplementation of get_block_data
Add documentation to the site/design.md about how blocking uses log odds
Add new algorithm configuration variable, "defaults/compare_minimum_percentage" with a default of 0.7.
Details / Tasks
Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the default/compare_minimum_percentage keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.
Background / Context
When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.
Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.
Summary
Update the blocking query to skip optionally skip over missing blocking keys.
Acceptance Criteria
Details / Tasks
Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the
default/compare_minimum_percentage
keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.Background / Context
When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.
Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.
Related Issues/PRs
#223
The text was updated successfully, but these errors were encountered: