-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange Error On Some CSV Inputs #58
Comments
The problem that was described in the issue that was filed was that the data that was provided did not have the proper fields that the tool was set to look for (SSN specifically). This is a problem for how the tool works on the back-end because the tool uses a strategy of using both deterministic and probabilistic rules to find duplicates and it can’t do that if none of the data satisfies any of the deterministic rules. To solve this, the user must change the deterministic rules so that they match at least 5 data points in order to be accepted. This is done through editing |
I think i've hit the same bug. So has roughly 15000 lines. When i run it , i get an out of range error (this is the entire output): Estimated u probabilities using random sampling Your model is not yet fully trained. Missing estimates for: ----- Starting EM training session ----- Estimating the m probabilities of the model by blocking on: Parameter estimates will be made for the following comparison(s): Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: Iteration 1: Largest change in params was 1 in probability_two_random_records_match EM converged after 2 iterations Your model is not yet fully trained. Missing estimates for: ----- Starting EM training session ----- Estimating the m probabilities of the model by blocking on: Parameter estimates will be made for the following comparison(s): Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: Iteration 1: Largest change in params was -0.922 in the m_probability of street_address1, level EM converged after 5 iterations Your model is not yet fully trained. Missing estimates for: ----- Starting EM training session ----- Estimating the m probabilities of the model by blocking on: Parameter estimates will be made for the following comparison(s): Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: WARNING: Iteration 1: Largest change in params was 0.294 in probability_two_random_records_match EM converged after 2 iterations Your model is not yet fully trained. Missing estimates for: The above exception was the direct cause of the following exception: Traceback (most recent call last): Error was: Out of Range Error: cannot take logarithm of zero ---------------- End dump data ------ When i run each of this file separately, i get no error - the script is generating the result with no error. It looks like there might be some limit of records that can be checked for duplication? |
Describe the bug
Ran into out of range error when using tool with this CSV input:
This input has no SSN or truth value which may have something to do with the error.
However when i run the script:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv
Things get very bad (please note that running with the same data i initially sent is working fine) :
To Reproduce
Use the above input with this CLI call:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv
Expected behavior
The CLI should output the deduplicated data normally.
Actual behavior
The tool throws an out of range error
The text was updated successfully, but these errors were encountered: