Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Error On Some CSV Inputs #58

Open
IsaacMilarky opened this issue May 22, 2024 · 2 comments
Open

Strange Error On Some CSV Inputs #58

IsaacMilarky opened this issue May 22, 2024 · 2 comments
Assignees
Labels
dedupliFHIR DedupliFHIR repo tickets

Comments

@IsaacMilarky
Copy link
Collaborator

Describe the bug
Ran into out of range error when using tool with this CSV input:

"id","truth_value","family_name","given_name","gender","birth_date","phone","street_address","city","state","postal_code","SSN"
"4_Arlene_Oliver.xml","","oliver","arlene","female","05/26/1972","434-228-8487","845 Collier Pike Center","North Katheryn","WA","99033",""
"21_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"22_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"29_Terrance_Weber.xml","","weber","terrance","male","09/19/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""
"30_Terrance_Weber.xml","","weber","terrance","male","09/12/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""

This input has no SSN or truth value which may have something to do with the error.

However when i run the script:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv

Things get very bad (please note that running with the same data i initially sent is working fine) :

Format is CSV
Data dir is /tmp/x.csv
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - family_name (no m values are trained).
    - birth_date (no m values are trained).
    - postal_code (some u values are not trained, no m values are trained).
    - street_address (no m values are trained).
    - phone (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - family_name
    - birth_date
    - postal_code
    - street_address
    - phone

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 

WARNING:
Level Within 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Within 10 years on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Exact match Postcode District on comparison postal_code not observed in dataset, unable to train m value

WARNING:
Level Exact match Postcode Area on comparison postal_code not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.904 in the m_probability of street_address, level `All other comparisons`
Iteration 2: Largest change in params was 0.197 in probability_two_random_records_match
Iteration 3: Largest change in params was 1.06e-09 in probability_two_random_records_match

EM converged after 3 iterations
m probability not trained for birth_date - Within 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Within 10 years (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - birth_date (some m values are not trained).
    - postal_code (some u values are not trained, some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - family_name
    - postal_code
    - street_address
    - phone

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - birth_date

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 691, in _log_and_run_sql_execution
    return self._run_sql_execution(final_sql, templated_name, physical_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 225, in _run_sql_execution
    self._con.sql(final_sql)
duckdb.duckdb.OutOfRangeException: Out of Range Error: cannot take logarithm of zero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 190, in <module>
    cli()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cage/public_html/mdinteractive-03/extraApps/dedupliFHIR-main/cli/deduplifhirLib/utils.py", line 150, in wrapper
    return func(*args,**kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 35, in dedupe_data
    linker.estimate_parameters_using_expectation_maximisation(
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 1682, in estimate_parameters_using_expectation_maximisation
    em_training_session._train()
  File "/usr/local/lib/python3.11/site-packages/splink/em_training_session.py", line 197, in _train
    expectation_maximisation(self, cvv)
  File "/usr/local/lib/python3.11/site-packages/splink/expectation_maximisation.py", line 256, in expectation_maximisation
    df_params = linker._execute_sql_pipeline([df_comparison_vector_values])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 629, in _execute_sql_pipeline
    raise e
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 623, in _execute_sql_pipeline
    dataframe = self._sql_to_splink_dataframe_checking_cache(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 889, in _sql_to_splink_dataframe_checking_cache
    splink_dataframe = self._execute_sql_against_backend(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 220, in _execute_sql_against_backend
    self._log_and_run_sql_execution(sql, templated_name, physical_name)
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 703, in _log_and_run_sql_execution
    raise SplinkException(
splink.exceptions.SplinkException: Error executing the following sql for table `__splink__m_u_counts`(__splink__m_u_counts_26a1779fb):
CREATE TABLE __splink__m_u_counts_26a1779fb AS
(
  WITH __splink__df_comparison_vectors AS (
    SELECT
      *
    FROM __splink__df_comparison_vectors_a7cd2b9d8
  ), __splink__df_match_weight_parts AS (
    SELECT
      "unique_id_l",
      "unique_id_r",
      gamma_given_name,
      CASE
        WHEN gamma_given_name = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_given_name = 4
        THEN CAST(111.89189187998129 AS DOUBLE)
        WHEN gamma_given_name = 3
        THEN CAST(81.88130481588395 AS DOUBLE)
        WHEN gamma_given_name = 2
        THEN CAST(92.09886357253461 AS DOUBLE)
        WHEN gamma_given_name = 1
        THEN CAST(48.06629076272102 AS DOUBLE)
        WHEN gamma_given_name = 0
        THEN CAST(4.009431336940681e-229 AS DOUBLE)
      END AS bf_given_name,
      CASE
        WHEN gamma_given_name = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 4
        THEN (
          CASE
            WHEN NOT COALESCE("tf_given_name_l", "tf_given_name_r") IS NULL
            THEN POWER(
              CAST(0.0014492753623188406 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_given_name_l", "tf_given_name_r") >= COALESCE("tf_given_name_r", "tf_given_name_l")
                    THEN COALESCE("tf_given_name_l", "tf_given_name_r")
                    ELSE COALESCE("tf_given_name_r", "tf_given_name_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_given_name = 3
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 2
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_given_name,
      gamma_family_name,
      CASE
        WHEN gamma_family_name = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_family_name = 4
        THEN CAST(139.19707503507163 AS DOUBLE)
        WHEN gamma_family_name = 3
        THEN CAST(115.4071885370217 AS DOUBLE)
        WHEN gamma_family_name = 2
        THEN CAST(165.4130726624298 AS DOUBLE)
        WHEN gamma_family_name = 1
        THEN CAST(57.22386641775578 AS DOUBLE)
        WHEN gamma_family_name = 0
        THEN CAST(3.992587823932763e-229 AS DOUBLE)
      END AS bf_family_name,
      CASE
        WHEN gamma_family_name = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 4
        THEN (
          CASE
            WHEN NOT COALESCE("tf_family_name_l", "tf_family_name_r") IS NULL
            THEN POWER(
              CAST(0.0011649825407125862 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_family_name_l", "tf_family_name_r") >= COALESCE("tf_family_name_r", "tf_family_name_l")
                    THEN COALESCE("tf_family_name_l", "tf_family_name_r")
                    ELSE COALESCE("tf_family_name_r", "tf_family_name_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_family_name = 3
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 2
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_family_name,
      gamma_postal_code,
      CASE
        WHEN gamma_postal_code = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_postal_code = 4
        THEN CAST(266.5650741350906 AS DOUBLE)
        WHEN gamma_postal_code = 3
        THEN CAST(3.970841325158504e-229 AS DOUBLE)
        WHEN gamma_postal_code = 2
        THEN CAST(1.259921049894873 AS DOUBLE)
        WHEN gamma_postal_code = 1
        THEN CAST(0.19842513149602492 AS DOUBLE)
        WHEN gamma_postal_code = 0
        THEN CAST(0.03125 AS DOUBLE)
      END AS bf_postal_code,
      gamma_street_address,
      CASE
        WHEN gamma_street_address = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_street_address = 1
        THEN CAST(259.14714720358563 AS DOUBLE)
        WHEN gamma_street_address = 0
        THEN CAST(0.9640980278331669 AS DOUBLE)
      END AS bf_street_address,
      CASE
        WHEN gamma_street_address = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_street_address = 1
        THEN (
          CASE
            WHEN NOT COALESCE("tf_street_address_l", "tf_street_address_r") IS NULL
            THEN POWER(
              CAST(0.00013905627143784185 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_street_address_l", "tf_street_address_r") >= COALESCE("tf_street_address_r", "tf_street_address_l")
                    THEN COALESCE("tf_street_address_l", "tf_street_address_r")
                    ELSE COALESCE("tf_street_address_r", "tf_street_address_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_street_address = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_street_address,
      gamma_phone,
      CASE
        WHEN gamma_phone = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_phone = 1
        THEN CAST(1.7515202196470399 AS DOUBLE)
        WHEN gamma_phone = 0
        THEN CAST(0.5579969710188165 AS DOUBLE)
      END AS bf_phone,
      CASE
        WHEN gamma_phone = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_phone = 1
        THEN (
          CASE
            WHEN NOT COALESCE("tf_phone_l", "tf_phone_r") IS NULL
            THEN POWER(
              CAST(0.3703346620932604 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_phone_l", "tf_phone_r") >= COALESCE("tf_phone_r", "tf_phone_l")
                    THEN COALESCE("tf_phone_l", "tf_phone_r")
                    ELSE COALESCE("tf_phone_r", "tf_phone_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_phone = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_phone,
      "ssn_l",
      "ssn_r",
      match_key
    FROM __splink__df_comparison_vectors
  ), __splink__df_predict AS (
    SELECT
      LOG(
        2,
        CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
      ) AS match_weight,
      CASE
        WHEN bf_given_name = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_given_name = CAST('infinity' AS DOUBLE)
        OR bf_family_name = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_family_name = CAST('infinity' AS DOUBLE)
        OR bf_postal_code = CAST('infinity' AS DOUBLE)
        OR bf_street_address = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_street_address = CAST('infinity' AS DOUBLE)
        OR bf_phone = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_phone = CAST('infinity' AS DOUBLE)
        THEN 1.0
        ELSE (
          CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
        ) / NULLIF(
          (
            1 + (
              CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
            )
          ),
          0
        )
      END AS match_probability,
      "unique_id_l",
      "unique_id_r",
      gamma_given_name,
      gamma_family_name,
      gamma_postal_code,
      gamma_street_address,
      gamma_phone,
      "ssn_l",
      "ssn_r",
      match_key
    FROM __splink__df_match_weight_parts
    ORDER BY
      1 NULLS LAST
  )
  SELECT
    gamma_given_name AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'given_name' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_given_name
  UNION ALL
  SELECT
    gamma_family_name AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'family_name' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_family_name
  UNION ALL
  SELECT
    gamma_postal_code AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'postal_code' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_postal_code
  UNION ALL
  SELECT
    gamma_street_address AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'street_address' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_street_address
  UNION ALL
  SELECT
    gamma_phone AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'phone' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_phone
  UNION ALL
  SELECT
    0 AS comparison_vector_value,
    SUM(match_probability * 1) / NULLIF(SUM(1), 0) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) / NULLIF(SUM(1), 0) AS u_count,
    '_probability_two_random_records_match' AS output_column_name
  FROM __splink__df_predict
)

Error was: Out of Range Error: cannot take logarithm of zero

To Reproduce
Use the above input with this CLI call:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv

Expected behavior
The CLI should output the deduplicated data normally.

Actual behavior
The tool throws an out of range error

@IsaacMilarky
Copy link
Collaborator Author

The problem that was described in the issue that was filed was that the data that was provided did not have the proper fields that the tool was set to look for (SSN specifically). This is a problem for how the tool works on the back-end because the tool uses a strategy of using both deterministic and probabilistic rules to find duplicates and it can’t do that if none of the data satisfies any of the deterministic rules.

To solve this, the user must change the deterministic rules so that they match at least 5 data points in order to be accepted. This is done through editing splink_settings.json and changing the fields to ones that are present in the data.

@cage77
Copy link

cage77 commented Dec 6, 2024

I think i've hit the same bug.
For me i have a file:
cage@yeti cli]$ wc -l /tmp/02.csv
14949 /tmp/02.csv
The file contains following fields (can't share patient data):
"unique_id","given_name","family_name","gender","birth_date"

So has roughly 15000 lines. When i run it , i get an out of range error (this is the entire output):
-------------------- Begin output ----
cage@yeti cli]$ python ecqm_dedupe.py dedupe-data --fmt CSV /tmp/02.csv /tmp/result_02.csv
Stats for nerds:
blocking_rule row_count cumulative_rows cartesian match_key start
0 l."birth_date" = r."birth_date" 6366 6366 119312628 0 0
1 (l."ssn" = r."ssn") AND (l."birth_date" = r."birth_date") 0 6366 119312628 1 6366
2 l."phone" = r."phone" 48079 54445 119312628 2 6366
----- Estimating u probabilities using random sampling -----
u probability not trained for street_address0 - Exact match on street_address0 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
- street_address0 (some u values are not trained, no m values are trained).
- postal_code0 (some u values are not trained, no m values are trained).
- street_address1 (some u values are not trained, no m values are trained).
- postal_code1 (some u values are not trained, no m values are trained).
- phone (no m values are trained).
- given_name (no m values are trained).
- family_name (no m values are trained).
- birth_date (no m values are trained).
/home/cage/WORK/paulo/dedupli2/cli

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"

Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:

WARNING:
Level Exact match on street_address0 on comparison street_address0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.7 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 1 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.000362 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address0 - Exact match on street_address0 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.7 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- street_address0 (some u values are not trained, some m values are not trained).
- postal_code0 (some u values are not trained, some m values are not trained).
- street_address1 (some u values are not trained, some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- given_name (some m values are not trained).
- family_name (some m values are not trained).
- birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"

Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- birth_date

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.922 in the m_probability of street_address1, level Exact match on street_address1
Iteration 2: Largest change in params was 0.0657 in the m_probability of given_name, level Jaro-Winkler distance of given_name >= 0.92
Iteration 3: Largest change in params was -0.0324 in the m_probability of given_name, level All other comparisons
Iteration 4: Largest change in params was -0.0175 in the m_probability of given_name, level All other comparisons
Iteration 5: Largest change in params was -0.00984 in the m_probability of given_name, level All other comparisons

EM converged after 5 iterations
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- street_address0 (some u values are not trained).
- postal_code0 (some u values are not trained, some m values are not trained).
- street_address1 (some u values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."street_address0" = r."street_address0") AND (l."postal_code0" = r."postal_code0")

Parameter estimates will be made for the following comparison(s):
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- street_address0
- postal_code0

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on given_name on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.88 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Exact match on family_name on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of family_name >= 0.88 on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.294 in probability_two_random_records_match
Iteration 2: Largest change in params was 1.32e-09 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Exact match on given_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Exact match on family_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Jaro-Winkler distance of family_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- street_address0 (some u values are not trained).
- postal_code0 (some u values are not trained, some m values are not trained).
- street_address1 (some u values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).
Blocking time: 0.07 seconds
Traceback (most recent call last):
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/database_api.py", line 63, in _log_and_run_sql_execution
return self._execute_sql_against_backend(final_sql)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/duckdb/database_api.py", line 104, in _execute_sql_against_backend
return self._con.sql(final_sql)
^^^^^^^^^^^^^^^^^^^^^^^^
duckdb.duckdb.OutOfRangeException: Out of Range Error: cannot take logarithm of zero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/cage/WORK/paulo/dedupli2/cli/ecqm_dedupe.py", line 124, in
cli()
File "/usr/lib/python3.12/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/WORK/paulo/dedupli2/cli/deduplifhirLib/utils.py", line 241, in wrapper
return func(*args,**kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/home/cage/WORK/paulo/dedupli2/cli/ecqm_dedupe.py", line 44, in dedupe_data
pairwise_predictions = linker.inference.predict()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/linker_components/inference.py", line 281, in predict
predictions = self._linker._db_api.sql_pipeline_to_splink_dataframe(pipeline)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/database_api.py", line 200, in sql_pipeline_to_splink_dataframe
splink_dataframe = self.sql_to_splink_dataframe_checking_cache(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/database_api.py", line 171, in sql_to_splink_dataframe_checking_cache
splink_dataframe = self._sql_to_splink_dataframe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/database_api.py", line 93, in _sql_to_splink_dataframe
spark_df = self._log_and_run_sql_execution(sql, templated_name, physical_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cage/.local/lib/python3.12/site-packages/splink/internals/database_api.py", line 75, in _log_and_run_sql_execution
raise SplinkException(
splink.internals.exceptions.SplinkException: Error executing the following sql for table __splink__df_predict(__splink__df_predict_bf926a097):
CREATE TABLE __splink__df_predict_bf926a097 AS
WITH __splink__blocked_id_pairs AS (
SELECT
*
FROM __splink__blocked_id_pairs_95cc4436d
), __splink__df_concat_with_tf AS (
SELECT
*
FROM __splink__df_concat_with_tf_a4716ce0f
), blocked_with_cols AS (
SELECT
"l"."unique_id" AS "unique_id_l",
"r"."unique_id" AS "unique_id_r",
"l"."street_address0" AS "street_address0_l",
"r"."street_address0" AS "street_address0_r",
"l"."postal_code0" AS "postal_code0_l",
"r"."postal_code0" AS "postal_code0_r",
"l"."street_address1" AS "street_address1_l",
"r"."street_address1" AS "street_address1_r",
"l"."postal_code1" AS "postal_code1_l",
"r"."postal_code1" AS "postal_code1_r",
"l"."phone" AS "phone_l",
"r"."phone" AS "phone_r",
"l"."tf_phone" AS "tf_phone_l",
"r"."tf_phone" AS "tf_phone_r",
"l"."given_name" AS "given_name_l",
"r"."given_name" AS "given_name_r",
"l"."tf_given_name" AS "tf_given_name_l",
"r"."tf_given_name" AS "tf_given_name_r",
"l"."family_name" AS "family_name_l",
"r"."family_name" AS "family_name_r",
"l"."tf_family_name" AS "tf_family_name_l",
"r"."tf_family_name" AS "tf_family_name_r",
"l"."birth_date" AS "birth_date_l",
"r"."birth_date" AS "birth_date_r",
"l"."ssn" AS "ssn_l",
"r"."ssn" AS "ssn_r",
b.match_key
FROM __splink__df_concat_with_tf AS l
INNER JOIN __splink__blocked_id_pairs AS b
ON l."unique_id" = b.join_key_l
INNER JOIN __splink__df_concat_with_tf AS r
ON r."unique_id" = b.join_key_r
), __splink__df_comparison_vectors AS (
SELECT
"unique_id_l",
"unique_id_r",
"street_address0_l",
"street_address0_r",
CASE
WHEN "street_address0_l" IS NULL OR "street_address0_r" IS NULL
THEN -1
WHEN "street_address0_l" = "street_address0_r"
THEN 1
ELSE 0
END AS gamma_street_address0,
"postal_code0_l",
"postal_code0_r",
CASE
WHEN "postal_code0_l" IS NULL OR "postal_code0_r" IS NULL
THEN -1
WHEN "postal_code0_l" = "postal_code0_r"
THEN 4
WHEN NULLIF(REGEXP_EXTRACT("postal_code0_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code0_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
THEN 3
WHEN NULLIF(REGEXP_EXTRACT("postal_code0_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code0_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
THEN 2
WHEN NULLIF(REGEXP_EXTRACT("postal_code0_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code0_r", '^[A-Za-z]{1,2}', 0), '')
THEN 1
ELSE 0
END AS gamma_postal_code0,
"street_address1_l",
"street_address1_r",
CASE
WHEN "street_address1_l" IS NULL OR "street_address1_r" IS NULL
THEN -1
WHEN "street_address1_l" = "street_address1_r"
THEN 1
ELSE 0
END AS gamma_street_address1,
"postal_code1_l",
"postal_code1_r",
CASE
WHEN "postal_code1_l" IS NULL OR "postal_code1_r" IS NULL
THEN -1
WHEN "postal_code1_l" = "postal_code1_r"
THEN 4
WHEN NULLIF(REGEXP_EXTRACT("postal_code1_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code1_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')
THEN 3
WHEN NULLIF(REGEXP_EXTRACT("postal_code1_l", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code1_r", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')
THEN 2
WHEN NULLIF(REGEXP_EXTRACT("postal_code1_l", '^[A-Za-z]{1,2}', 0), '') = NULLIF(REGEXP_EXTRACT("postal_code1_r", '^[A-Za-z]{1,2}', 0), '')
THEN 1
ELSE 0
END AS gamma_postal_code1,
"phone_l",
"phone_r",
CASE
WHEN "phone_l" IS NULL OR "phone_r" IS NULL
THEN -1
WHEN "phone_l" = "phone_r"
THEN 1
ELSE 0
END AS gamma_phone,
"tf_phone_l",
"tf_phone_r",
"given_name_l",
"given_name_r",
CASE
WHEN "given_name_l" IS NULL OR "given_name_r" IS NULL
THEN -1
WHEN "given_name_l" = "given_name_r"
THEN 4
WHEN JARO_WINKLER_SIMILARITY("given_name_l", "given_name_r") >= 0.92
THEN 3
WHEN JARO_WINKLER_SIMILARITY("given_name_l", "given_name_r") >= 0.88
THEN 2
WHEN JARO_WINKLER_SIMILARITY("given_name_l", "given_name_r") >= 0.7
THEN 1
ELSE 0
END AS gamma_given_name,
"tf_given_name_l",
"tf_given_name_r",
"family_name_l",
"family_name_r",
CASE
WHEN "family_name_l" IS NULL OR "family_name_r" IS NULL
THEN -1
WHEN "family_name_l" = "family_name_r"
THEN 4
WHEN JARO_WINKLER_SIMILARITY("family_name_l", "family_name_r") >= 0.92
THEN 3
WHEN JARO_WINKLER_SIMILARITY("family_name_l", "family_name_r") >= 0.88
THEN 2
WHEN JARO_WINKLER_SIMILARITY("family_name_l", "family_name_r") >= 0.7
THEN 1
ELSE 0
END AS gamma_family_name,
"tf_family_name_l",
"tf_family_name_r",
"birth_date_l",
"birth_date_r",
CASE
WHEN TRY_STRPTIME("birth_date_l", '%Y-%m-%d') IS NULL
OR TRY_STRPTIME("birth_date_r", '%Y-%m-%d') IS NULL
THEN -1
WHEN "birth_date_l" = "birth_date_r"
THEN 5
WHEN DAMERAU_LEVENSHTEIN("birth_date_l", "birth_date_r") <= 1
THEN 4
WHEN ABS(
TIME_TO_UNIX(TRY_STRPTIME("birth_date_l", '%Y-%m-%d')) - TIME_TO_UNIX(TRY_STRPTIME("birth_date_r", '%Y-%m-%d'))
) <= 2629800.0
THEN 3
WHEN ABS(
TIME_TO_UNIX(TRY_STRPTIME("birth_date_l", '%Y-%m-%d')) - TIME_TO_UNIX(TRY_STRPTIME("birth_date_r", '%Y-%m-%d'))
) <= 31557600.0
THEN 2
WHEN ABS(
TIME_TO_UNIX(TRY_STRPTIME("birth_date_l", '%Y-%m-%d')) - TIME_TO_UNIX(TRY_STRPTIME("birth_date_r", '%Y-%m-%d'))
) <= 315576000.0
THEN 1
ELSE 0
END AS gamma_birth_date,
"ssn_l",
"ssn_r",
match_key
FROM blocked_with_cols
), __splink__df_match_weight_parts AS (
SELECT
"unique_id_l",
"unique_id_r",
"street_address0_l",
"street_address0_r",
gamma_street_address0,
CASE
WHEN gamma_street_address0 = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_street_address0 = 1
THEN CAST(42.15231680942197 AS DOUBLE)
WHEN gamma_street_address0 = 0
THEN CAST(0.9804469233550045 AS DOUBLE)
END AS bf_street_address0,
"postal_code0_l",
"postal_code0_r",
gamma_postal_code0,
CASE
WHEN gamma_postal_code0 = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_postal_code0 = 4
THEN CAST(570.0 AS DOUBLE)
WHEN gamma_postal_code0 = 3
THEN CAST(8.0 AS DOUBLE)
WHEN gamma_postal_code0 = 2
THEN CAST(1.259921049894873 AS DOUBLE)
WHEN gamma_postal_code0 = 1
THEN CAST(0.19842513149602492 AS DOUBLE)
WHEN gamma_postal_code0 = 0
THEN CAST(4.512634359622608e-169 AS DOUBLE)
END AS bf_postal_code0,
"street_address1_l",
"street_address1_r",
gamma_street_address1,
CASE
WHEN gamma_street_address1 = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_street_address1 = 1
THEN CAST(30.108800613284384 AS DOUBLE)
WHEN gamma_street_address1 = 0
THEN CAST(1.0 AS DOUBLE)
END AS bf_street_address1,
"postal_code1_l",
"postal_code1_r",
gamma_postal_code1,
CASE
WHEN gamma_postal_code1 = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_postal_code1 = 4
THEN CAST(570.0 AS DOUBLE)
WHEN gamma_postal_code1 = 3
THEN CAST(8.0 AS DOUBLE)
WHEN gamma_postal_code1 = 2
THEN CAST(1.259921049894873 AS DOUBLE)
WHEN gamma_postal_code1 = 1
THEN CAST(0.19842513149602492 AS DOUBLE)
WHEN gamma_postal_code1 = 0
THEN CAST(4.512634359622608e-169 AS DOUBLE)
END AS bf_postal_code1,
"phone_l",
"phone_r",
gamma_phone,
CASE
WHEN gamma_phone = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_phone = 1
THEN CAST(1.8759978269512303 AS DOUBLE)
WHEN gamma_phone = 0
THEN CAST(0.47390934295768505 AS DOUBLE)
END AS bf_phone,
CASE
WHEN gamma_phone = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_phone = 1
THEN (
CASE
WHEN NOT COALESCE("tf_phone_l", "tf_phone_r") IS NULL
THEN POWER(
CAST(0.375219298245614 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_phone_l", "tf_phone_r") >= COALESCE("tf_phone_r", "tf_phone_l")
THEN COALESCE("tf_phone_l", "tf_phone_r")
ELSE COALESCE("tf_phone_r", "tf_phone_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_phone = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_phone,
"given_name_l",
"given_name_r",
gamma_given_name,
CASE
WHEN gamma_given_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_given_name = 4
THEN CAST(120.74264446644217 AS DOUBLE)
WHEN gamma_given_name = 3
THEN CAST(1055.0528541226215 AS DOUBLE)
WHEN gamma_given_name = 2
THEN CAST(112.3840298851113 AS DOUBLE)
WHEN gamma_given_name = 1
THEN CAST(4.332150405172895 AS DOUBLE)
WHEN gamma_given_name = 0
THEN CAST(0.05281003937169819 AS DOUBLE)
END AS bf_given_name,
CASE
WHEN gamma_given_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 4
THEN (
CASE
WHEN NOT COALESCE("tf_given_name_l", "tf_given_name_r") IS NULL
THEN POWER(
CAST(0.0009826045494528983 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_given_name_l", "tf_given_name_r") >= COALESCE("tf_given_name_r", "tf_given_name_l")
THEN COALESCE("tf_given_name_l", "tf_given_name_r")
ELSE COALESCE("tf_given_name_r", "tf_given_name_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_given_name = 3
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 2
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 1
THEN CAST(1 AS DOUBLE)
WHEN gamma_given_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_given_name,
"family_name_l",
"family_name_r",
gamma_family_name,
CASE
WHEN gamma_family_name = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_family_name = 4
THEN CAST(19.803312588130648 AS DOUBLE)
WHEN gamma_family_name = 3
THEN CAST(264.14964416057666 AS DOUBLE)
WHEN gamma_family_name = 2
THEN CAST(32.72271588798214 AS DOUBLE)
WHEN gamma_family_name = 1
THEN CAST(1.8975119539228475 AS DOUBLE)
WHEN gamma_family_name = 0
THEN CAST(0.04265123962992186 AS DOUBLE)
END AS bf_family_name,
CASE
WHEN gamma_family_name = -1
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 4
THEN (
CASE
WHEN NOT COALESCE("tf_family_name_l", "tf_family_name_r") IS NULL
THEN POWER(
CAST(0.006243577412221209 AS DOUBLE) / NULLIF(
(
CASE
WHEN COALESCE("tf_family_name_l", "tf_family_name_r") >= COALESCE("tf_family_name_r", "tf_family_name_l")
THEN COALESCE("tf_family_name_l", "tf_family_name_r")
ELSE COALESCE("tf_family_name_r", "tf_family_name_l")
END
),
0
),
CAST(1.0 AS DOUBLE)
)
ELSE CAST(1 AS DOUBLE)
END
)
WHEN gamma_family_name = 3
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 2
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 1
THEN CAST(1 AS DOUBLE)
WHEN gamma_family_name = 0
THEN CAST(1 AS DOUBLE)
END AS bf_tf_adj_family_name,
"birth_date_l",
"birth_date_r",
gamma_birth_date,
CASE
WHEN gamma_birth_date = -1
THEN CAST(1.0 AS DOUBLE)
WHEN gamma_birth_date = 5
THEN CAST(17629.130434782608 AS DOUBLE)
WHEN gamma_birth_date = 4
THEN CAST(6.216481410502114 AS DOUBLE)
WHEN gamma_birth_date = 3
THEN CAST(4.137097185613472 AS DOUBLE)
WHEN gamma_birth_date = 2
THEN CAST(0.3010059017853831 AS DOUBLE)
WHEN gamma_birth_date = 1
THEN CAST(0.03201687689304088 AS DOUBLE)
WHEN gamma_birth_date = 0
THEN CAST(0.015376092736138937 AS DOUBLE)
END AS bf_birth_date,
"ssn_l",
"ssn_r",
match_key
FROM __splink__df_comparison_vectors
)
SELECT
LOG(
2,
CAST(0.00010001000100010001 AS DOUBLE) * bf_street_address0 * bf_postal_code0 * bf_street_address1 * bf_postal_code1 * bf_phone * bf_tf_adj_phone * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_birth_date
) AS match_weight,
CASE
WHEN bf_street_address0 = CAST('infinity' AS DOUBLE)
OR bf_postal_code0 = CAST('infinity' AS DOUBLE)
OR bf_street_address1 = CAST('infinity' AS DOUBLE)
OR bf_postal_code1 = CAST('infinity' AS DOUBLE)
OR bf_phone = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_phone = CAST('infinity' AS DOUBLE)
OR bf_given_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_given_name = CAST('infinity' AS DOUBLE)
OR bf_family_name = CAST('infinity' AS DOUBLE)
OR bf_tf_adj_family_name = CAST('infinity' AS DOUBLE)
OR bf_birth_date = CAST('infinity' AS DOUBLE)
THEN 1.0
ELSE (
CAST(0.00010001000100010001 AS DOUBLE) * bf_street_address0 * bf_postal_code0 * bf_street_address1 * bf_postal_code1 * bf_phone * bf_tf_adj_phone * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_birth_date
) / NULLIF(
(
1 + (
CAST(0.00010001000100010001 AS DOUBLE) * bf_street_address0 * bf_postal_code0 * bf_street_address1 * bf_postal_code1 * bf_phone * bf_tf_adj_phone * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_birth_date
)
),
0
)
END AS match_probability,
"unique_id_l",
"unique_id_r",
"street_address0_l",
"street_address0_r",
gamma_street_address0,
"postal_code0_l",
"postal_code0_r",
gamma_postal_code0,
"street_address1_l",
"street_address1_r",
gamma_street_address1,
"postal_code1_l",
"postal_code1_r",
gamma_postal_code1,
"phone_l",
"phone_r",
gamma_phone,
"given_name_l",
"given_name_r",
gamma_given_name,
"family_name_l",
"family_name_r",
gamma_family_name,
"birth_date_l",
"birth_date_r",
gamma_birth_date,
"ssn_l",
"ssn_r",
match_key
FROM __splink__df_match_weight_parts

Error was: Out of Range Error: cannot take logarithm of zero

---------------- End dump data ------
Because initially i was thinking there is some data that is wrong, i tried to split the file in 2, to see which record is creating this error:
cage@yeti cli]$ wc -l /tmp/0201.csv
9998 /tmp/0201.csv
cage@yeti cli]$ wc -l /tmp/0202.csv
4952 /tmp/0202.csv

When i run each of this file separately, i get no error - the script is generating the result with no error. It looks like there might be some limit of records that can be checked for duplication?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dedupliFHIR DedupliFHIR repo tickets
Projects
None yet
Development

No branches or pull requests

3 participants