Skip to content

Conversation

@RobinL
Copy link
Member

@RobinL RobinL commented Dec 17, 2025

Summary:

We now register input dataframes one at a time (turning them into SplinkDataFrames) prior to passing into other Splink functions

Example:

in_1 = db_api.register("df_1")  # source_dataset_name="ab1" is optional
in_2 = db_api.register("df_2")

count_comparisons_from_blocking_rule(
    [in_1, in_2],
    blocking_rule=block_on("first_name"),
    link_type="link_and_dedupe",
)

linker = Linker([in_1, in_2], settings, db_api)

Key files changed

In bold are the files with significant changes to the logic. The diff in the core logic is actually very small, it just results in lots of tiny updates, mostly to tests

  • blocking_analysis.py
  • completeness.py
  • database_api.py
  • linker.py
  • profile_data.py
  • splink_dataframe.py
  • splinkdataframe_utils.py
  • vertically_concatenate.py

Everything else is updating tests

Table registration logic

The main complexity in this PR is getting the table registration logic right and consistent with Splink 4

Note that when autonaming, the string in the source_dataset column should be deterministic , because it controls which rows appear on the left and right hand side of comparisons according to logic like:

where l."source_dataset" || '-__-' || l."unique_id" < r."source_dataset" || '-__-' || r."unique_id"

Splink 4

  • When passing tables to the linker, the linker gave them sequential names for the source_dataset column here
  • When passing tables to other functions (such as blocking analysis), source_dataset set using ascii_uid see here.

This PR changes this slightly to make it always sequential/deterministic. This is easier now the DB API 'remembers' what tables have been registered on them

Splink 4, table names provided to Splink as strings (i.e. referring to table that already exists in db).

If we pass:
linker = Linker(["df_tbl", "df_tbl"], settings, db_api)

then from linker functions we get:

__splink__df_concat as (
            select '__splink__input_table_0' as source_dataset, ...
            from df_tbl
             UNION ALL 
            select '__splink__input_table_1' as source_dataset, ...
            from df_tbl
            ), 

This is because the linker, when handling table registrations, assigns aliases __splink__input_table_0, __splink__input_table_1. I.e. it's the linker that does this, not the DatabaseAPI. here

But from blocking analysis functions we get:
count_comparisons_from_blocking_rule( table_or_tables=["df_tbl", "df_tbl"], ...

__splink__df_concat as (
            select
            '__splink__luz4riww' as source_dataset, ...
            from df_tbl
             UNION ALL 
            select
            '__splink__7q2uiw3y' as source_dataset, ...
            from df_tbl
            ), 

Splink 4, tables provided to Splink as pandas dataframes or similar

If we pass linker = Linker([df, df], settings, db_api)

Then from linker functions we get:

__splink__df_concat as (
            select
            '__splink__input_table_0' as source_dataset, ...
            from __splink__input_table_0
             UNION ALL 
            select
            '__splink__input_table_1' as source_dataset, ... 
            from __splink__input_table_1
            ), 

and from blocking analysis functions we get:

__splink__df_concat as (
            select
            '__splink__ymc14n5r' as source_dataset, ...
            from __splink__ymc14n5r
             UNION ALL 
            select
            '__splink__2dwkxy2e' as source_dataset, ...
            from __splink__2dwkxy2e
            ), 
Testing scripts

Before:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
import duckdb
from splink.blocking_analysis import count_comparisons_from_blocking_rule

con = duckdb.connect(":memory:")
db_api = DuckDBAPI(connection=con)

df = splink_datasets.fake_1000

df_1 = df[df.index % 2 == 0].copy()
df_2 = df[df.index % 2 == 1].copy()
con.register("df_1_tbl", df_1)
con.register("df_2_tbl", df_2)
con.register("df_tbl", df)
# con.sql("show tables").show(max_width=1000)
# con.table("df_1_tbl").show()


settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
    ],
)

# linker = Linker(["df_tbl", "df_tbl"], settings, db_api)
linker = Linker([df, df], settings, db_api)
print(linker._input_tables_dict)


import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(1)
# linker.inference.predict()

count_comparisons_from_blocking_rule(
    table_or_tables=[df, df ],
    blocking_rule=block_on("first_name"),
    db_api=db_api,
    link_type="link_and_dedupe",
)


# count_comparisons_from_blocking_rule(
#     table_or_tables=["df_tbl", "df_tbl"],
#     blocking_rule=block_on("first_name"),
#     db_api=db_api,
#     link_type="link_and_dedupe",
# )

After

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
import duckdb
from splink.blocking_analysis import count_comparisons_from_blocking_rule

con = duckdb.connect(":memory:")
db_api = DuckDBAPI(connection=con)

df = splink_datasets.fake_1000

df_1 = df[df.index % 2 == 0].copy()
df_2 = df[df.index % 2 == 1].copy()
con.register("df_1_tbl", df_1)
con.register("df_2_tbl", df_2)
con.register("df_tbl", df)
# con.sql("show tables").show(max_width=1000)
# con.table("df_1_tbl").show()


settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
    ],
)

# in_1 = db_api.register("df_tbl", source_dataset_name="ab1")
# in_2 = db_api.register("df_tbl", source_dataset_name="de2")

# in_1 = db_api.register(df)
# in_2 = db_api.register(df)

in_1 = db_api.register("df_tbl")
in_2 = db_api.register("df_tbl")

linker = Linker([in_1, in_2], settings, db_api)
print(linker._input_tables_dict)


import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(1)
linker.inference.predict()

count_comparisons_from_blocking_rule(
    [in_1, in_2],
    blocking_rule=block_on("first_name"),
    link_type="link_and_dedupe",
)

Why can we not set source_dataset_name = string table name when input is a physical table string?

Because user can do this:

in_1 = db_api.register("df_tbl")
in_2 = db_api.register("df_tbl")

linker = Linker([in_1, in_2], settings, db_api)

and if they do, we want the source_dataset_name to be different otherwise you get duplicated row unique_ids

Splink 4: What if we pass the same source_dataset_name to two input dataframes?

linker = Linker([df, df], settings, db_api, input_table_aliases=["ab1", "ab1"])
linker.inference.predict()

This causes the error:
Binder Error: Values list "l" does not have a column named "source_dataset"

This is because linker._input_tables_dict is a dict keyed with input_aliases see register_multiple_tables and where we create the dict

But in the above Splink 'thinks' there only one input table because the alias is reused

@RobinL RobinL changed the title (WIP) Splinkdataframes everywhere Register SplinkDataFrames using db_api before passing to Splink functions Dec 20, 2025
@RobinL RobinL changed the title Register SplinkDataFrames using db_api before passing to Splink functions Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions Dec 20, 2025
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, think the core of how this work looks good, and makes things a lot simpler.

Haven't looked in detail at the tests or notebooks, but look fine from a skim

@RobinL RobinL merged commit 8f4a5c0 into splink_5_dev Jan 12, 2026
31 checks passed
@RobinL RobinL deleted the splinkdataframes_everywhere branch January 12, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants