Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve blocking by exluding incorrect matches #231

Closed
2 tasks
ericbuckley opened this issue Feb 28, 2025 · 1 comment
Closed
2 tasks

improve blocking by exluding incorrect matches #231

ericbuckley opened this issue Feb 28, 2025 · 1 comment
Assignees
Labels
feature New feature or request
Milestone

Comments

@ericbuckley
Copy link
Collaborator

Summary

The blocking algorithm is too liberal in including all patients from overlapping person clusters. We should include those patients that don't match because of missing key values, but we should exclude patients with different values.

Acceptance Criteria

  • re-implement the get_block_data method to exclude patients that don't match the blocking criteria
  • new test cases to show that patients with incorrect matches are being excluded

Details / Tasks

There are likely two ways to solve this, a) modify the query b) keep the query as is and filter the dataset in memory. It's not obvious which strategy is better from a performance standpoint, please consider both when implementing and include some documentation / comments on why that strategy was taken.

Background / Context

Take a simple blocking example of finding all patients with a matching BIRTHDATE of '1999-01-01'. Currently, we find all patient records in the database with that matching BIRTHDATE in the blocking values table, we then reduce that to a unique set of person clusters (based on what clusters those records belong to). We then return all patients in those matching person clusters, this creates 3 groups of data in the result set.

  • Patients that exactly match on '1999-01-01'
  • Patients that have no BIRTHDATE value
  • Patients that have a BIRTHDATE value, but does not match '1999-01-01'

The last group in the result set is problematic, it clearly doesn't block, and therefore shouldn't be present in our evaluation. Removing this group from our result set, should provide more accurate calculations on cluster membership.

Related Issues/PRs

#230

@ericbuckley ericbuckley added the feature New feature or request label Feb 28, 2025
@ericbuckley ericbuckley changed the title improve blocking with missing MPI keys improve blocking by exluding incorrect matches Feb 28, 2025
@ericbuckley
Copy link
Collaborator Author

ericbuckley commented Feb 28, 2025

@bamader @m-goggins two questions for you both

  1. Does the description above match your expectations from our discussion this week?
  2. Does this change benefit BOTH the belongingness ratio algorithm and them relative match score one we're working on? If it's only the latter, I only want to implement this work post our weighted confidence implementation.

@ericbuckley ericbuckley added this to the v25.5.0 milestone Feb 28, 2025
@bamader bamader self-assigned this Mar 10, 2025
@ericbuckley ericbuckley modified the milestones: v25.5.0, v25.4.0 Mar 12, 2025
bamader added a commit that referenced this issue Mar 14, 2025
## Description
This PR adds a filter function to the `get_block_data` function in the
`mpi_service`. This filter function removes from consideration all
candidates from the MPI who are part of a person cluster containing a
patient who satisfied blocking criteria, but which themselves have
present values (i.e. are not missing fields) that disagree with incoming
blocking keys. This will make matching more precise and reinforce
validity in the blocking step.

## Related Issues
#231 

## Additional Notes
N/A
bamader added a commit that referenced this issue Mar 19, 2025
## Description
This PR adds a filter function to the `get_block_data` function in the
`mpi_service`. This filter function removes from consideration all
candidates from the MPI who are part of a person cluster containing a
patient who satisfied blocking criteria, but which themselves have
present values (i.e. are not missing fields) that disagree with incoming
blocking keys. This will make matching more precise and reinforce
validity in the blocking step.

## Related Issues
#231 

## Additional Notes
N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants