Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863

RobinL · 2025-12-17T11:39:51Z

Summary:

We now register input dataframes one at a time (turning them into SplinkDataFrames) prior to passing into other Splink functions

Example:

in_1 = db_api.register("df_1")  # source_dataset_name="ab1" is optional
in_2 = db_api.register("df_2")

count_comparisons_from_blocking_rule(
    [in_1, in_2],
    blocking_rule=block_on("first_name"),
    link_type="link_and_dedupe",
)

linker = Linker([in_1, in_2], settings, db_api)

Key files changed

In bold are the files with significant changes to the logic. The diff in the core logic is actually very small, it just results in lots of tiny updates, mostly to tests

blocking_analysis.py
completeness.py
database_api.py
linker.py
profile_data.py
splink_dataframe.py
splinkdataframe_utils.py
vertically_concatenate.py

Everything else is updating tests

Table registration logic

The main complexity in this PR is getting the table registration logic right and consistent with Splink 4

Note that when autonaming, the string in the source_dataset column should be deterministic , because it controls which rows appear on the left and right hand side of comparisons according to logic like:

where l."source_dataset" || '-__-' || l."unique_id" < r."source_dataset" || '-__-' || r."unique_id"

Splink 4

When passing tables to the linker, the linker gave them sequential names for the source_dataset column here
When passing tables to other functions (such as blocking analysis), source_dataset set using ascii_uid see here.

This PR changes this slightly to make it always sequential/deterministic. This is easier now the DB API 'remembers' what tables have been registered on them

Splink 4, table names provided to Splink as strings (i.e. referring to table that already exists in db).

If we pass:
linker = Linker(["df_tbl", "df_tbl"], settings, db_api)

then from linker functions we get:

__splink__df_concat as (
            select '__splink__input_table_0' as source_dataset, ...
            from df_tbl
             UNION ALL 
            select '__splink__input_table_1' as source_dataset, ...
            from df_tbl
            ),

This is because the linker, when handling table registrations, assigns aliases __splink__input_table_0, __splink__input_table_1. I.e. it's the linker that does this, not the DatabaseAPI. here

But from blocking analysis functions we get:
count_comparisons_from_blocking_rule( table_or_tables=["df_tbl", "df_tbl"], ...

__splink__df_concat as (
            select
            '__splink__luz4riww' as source_dataset, ...
            from df_tbl
             UNION ALL 
            select
            '__splink__7q2uiw3y' as source_dataset, ...
            from df_tbl
            ),

Splink 4, tables provided to Splink as pandas dataframes or similar

If we pass linker = Linker([df, df], settings, db_api)

Then from linker functions we get:

__splink__df_concat as (
            select
            '__splink__input_table_0' as source_dataset, ...
            from __splink__input_table_0
             UNION ALL 
            select
            '__splink__input_table_1' as source_dataset, ... 
            from __splink__input_table_1
            ),

and from blocking analysis functions we get:

__splink__df_concat as (
            select
            '__splink__ymc14n5r' as source_dataset, ...
            from __splink__ymc14n5r
             UNION ALL 
            select
            '__splink__2dwkxy2e' as source_dataset, ...
            from __splink__2dwkxy2e
            ),

Testing scripts

Before:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
import duckdb
from splink.blocking_analysis import count_comparisons_from_blocking_rule

con = duckdb.connect(":memory:")
db_api = DuckDBAPI(connection=con)

df = splink_datasets.fake_1000

df_1 = df[df.index % 2 == 0].copy()
df_2 = df[df.index % 2 == 1].copy()
con.register("df_1_tbl", df_1)
con.register("df_2_tbl", df_2)
con.register("df_tbl", df)
# con.sql("show tables").show(max_width=1000)
# con.table("df_1_tbl").show()


settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
    ],
)

# linker = Linker(["df_tbl", "df_tbl"], settings, db_api)
linker = Linker([df, df], settings, db_api)
print(linker._input_tables_dict)


import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(1)
# linker.inference.predict()

count_comparisons_from_blocking_rule(
    table_or_tables=[df, df ],
    blocking_rule=block_on("first_name"),
    db_api=db_api,
    link_type="link_and_dedupe",
)


# count_comparisons_from_blocking_rule(
#     table_or_tables=["df_tbl", "df_tbl"],
#     blocking_rule=block_on("first_name"),
#     db_api=db_api,
#     link_type="link_and_dedupe",
# )

After

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
import duckdb
from splink.blocking_analysis import count_comparisons_from_blocking_rule

con = duckdb.connect(":memory:")
db_api = DuckDBAPI(connection=con)

df = splink_datasets.fake_1000

df_1 = df[df.index % 2 == 0].copy()
df_2 = df[df.index % 2 == 1].copy()
con.register("df_1_tbl", df_1)
con.register("df_2_tbl", df_2)
con.register("df_tbl", df)
# con.sql("show tables").show(max_width=1000)
# con.table("df_1_tbl").show()


settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
    ],
)

# in_1 = db_api.register("df_tbl", source_dataset_name="ab1")
# in_2 = db_api.register("df_tbl", source_dataset_name="de2")

# in_1 = db_api.register(df)
# in_2 = db_api.register(df)

in_1 = db_api.register("df_tbl")
in_2 = db_api.register("df_tbl")

linker = Linker([in_1, in_2], settings, db_api)
print(linker._input_tables_dict)


import logging

logging.basicConfig(format="%(message)s")
logging.getLogger("splink").setLevel(1)
linker.inference.predict()

count_comparisons_from_blocking_rule(
    [in_1, in_2],
    blocking_rule=block_on("first_name"),
    link_type="link_and_dedupe",
)

Why can we not set `source_dataset_name` = string table name when input is a physical table string?

Because user can do this:

in_1 = db_api.register("df_tbl")
in_2 = db_api.register("df_tbl")

linker = Linker([in_1, in_2], settings, db_api)

and if they do, we want the source_dataset_name to be different otherwise you get duplicated row unique_ids

Splink 4: What if we pass the same `source_dataset_name` to two input dataframes?

linker = Linker([df, df], settings, db_api, input_table_aliases=["ab1", "ab1"])
linker.inference.predict()

This causes the error:
Binder Error: Values list "l" does not have a column named "source_dataset"

This is because linker._input_tables_dict is a dict keyed with input_aliases see register_multiple_tables and where we create the dict

But in the above Splink 'thinks' there only one input table because the alias is reused

…dataframes' for clarity

…n tests

…ered DataFrames in chart tests

…n chunking and cluster studio tests

…ered DataFrames in tests

…Frame registration in tests

…on with DuckDBAPI

…new input table naming

…registered DataFrames in Linker

…registration with DuckDBAPI

docs/demos/examples/duckdb/febrl3.ipynb docs/demos/examples/duckdb/febrl4.ipynb docs/demos/examples/duckdb/link_only.ipynb

docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb docs/demos/examples/duckdb/real_time_record_linkage.ipynb docs/demos/examples/duckdb/transactions.ipynb

…te_tests Update tests for splinkdataframes_everywhere

ADBond

Great, think the core of how this work looks good, and makes things a lot simpler.

Haven't looked in detail at the tests or notebooks, but look fine from a skim

splink/internals/database_api.py

splink/internals/splinkdataframe_utils.py

update_tests_plan.md

RobinL added 30 commits December 17, 2025 11:06

source dataset name

a39656e

add utils

944be27

vertical concat and blocking analysis

812fdc0

linker

c573c24

refactor: rename parameter 'table_or_tables' to 'splink_dataframe_or_…

78af7f9

…dataframes' for clarity

remove dbapi from other public non linker fns

fe3f948

harden

c1ace52

refactor: improve handling of source_dataset_name in register method

cd25028

update first test

7798192

update analyse blocking test

905e807

fix bit i missed

d404187

fixes

6435bf1

update test_accuracy

de2faff

refactor: update db_api retrieval in array comparison tests

d843d03

update spark

34cebb7

refactor: update Linker initialization to use registered DataFrame

8231154

refactor: update Linker initialization to use registered DataFrames i…

6479bd6

…n tests

refactor: replace Linker initialization with helper method for regist…

88bc475

…ered DataFrames in chart tests

refactor: update Linker initialization to use registered DataFrames i…

f3f6a23

…n chunking and cluster studio tests

refactor: replace Linker initialization with helper method for regist…

c437d28

…ered DataFrames in tests

refactor: update Linker initialization to use helper methods for Data…

dca0234

…Frame registration in tests

refactor: update tests to use db_api method for database API retrieval

26f1ac6

refactor: update tests to use registered DataFrames with DuckDBAPI

3197a60

refactor: update tests to use db_api method for DataFrame registration

bb62d83

refactor: update tests to use helper methods for DataFrame registrati…

cafae1e

…on with DuckDBAPI

refactor: update tests to register DataFrames with DuckDBAPI and use …

826ec12

…new input table naming

refactor: update tests to register DataFrames with DuckDBAPI and use …

ed734f7

…registered DataFrames in Linker

refactor: update tests to register DataFrames with DuckDBAPI and use …

dcb1e3f

…registered DataFrames in Linker

refactor: update tests to use linker_with_registration for DataFrame …

8f1d6a3

…registration with DuckDBAPI

refactor: update tests to use registered DataFrames with DuckDBAPI

3352463

RobinL and others added 20 commits December 20, 2025 09:27

update tutorials

b3e3859

accuracy analysis

37d58a2

add notes

8c507eb

updaate docs/demos/examples/duckdb/deduplicate_50k_synthetic.ipynb

30e0c3d

update docs/demos/examples/duckdb/deterministic_dedupe.ipynb

8974ed3

docs/demos/examples/duckdb/febrl3.ipynb docs/demos/examples/duckdb/febrl4.ipynb docs/demos/examples/duckdb/link_only.ipynb

docs/demos/examples/duckdb/pairwise_labels.ipynb

95e546d

docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb docs/demos/examples/duckdb/real_time_record_linkage.ipynb docs/demos/examples/duckdb/transactions.ipynb

bias eval

95ff490

business rates

9c999b7

cookbook progress

12adb0c

cookbook

1d6b319

pseudopeople

f234836

git commit -m "final two"

22dd13a

lint and mypy

41f5ec5

format: improve docstring formatting for profile_columns function

6e3445c

mypy

5153e72

Merge pull request #2866 from moj-analytical-services/helper_and_upda…

4537d2d

…te_tests Update tests for splinkdataframes_everywhere

refactor method names in databse api

a025690

fix tests

3398b01

harden

276efef

harden

06b17cf

RobinL changed the title ~~(WIP) Splinkdataframes everywhere~~ Register SplinkDataFrames using db_api before passing to Splink functions Dec 20, 2025

RobinL added the splink_5 label Dec 20, 2025

RobinL changed the title ~~Register SplinkDataFrames using db_api before passing to Splink functions~~ Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions Dec 20, 2025

ADBond approved these changes Jan 12, 2026

View reviewed changes

splink/internals/database_api.py Outdated Show resolved Hide resolved

splink/internals/splinkdataframe_utils.py Show resolved Hide resolved

update_tests_plan.md Outdated Show resolved Hide resolved

RobinL added 4 commits January 12, 2026 14:21

remove redundant file

76fb924

address first comment

2629dae

address comment about error message

f9f494b

add final decorator

b94bd20

RobinL merged commit 8f4a5c0 into splink_5_dev Jan 12, 2026
31 checks passed

RobinL deleted the splinkdataframes_everywhere branch January 12, 2026 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863

Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863

Uh oh!

RobinL commented Dec 17, 2025 •

edited

Loading

Uh oh!

ADBond left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863

Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863

Uh oh!

Conversation

RobinL commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Key files changed

Table registration logic

Splink 4

Splink 4, table names provided to Splink as strings (i.e. referring to table that already exists in db).

Splink 4, tables provided to Splink as pandas dataframes or similar

Before:

After

Why can we not set source_dataset_name = string table name when input is a physical table string?

Splink 4: What if we pass the same source_dataset_name to two input dataframes?

Uh oh!

ADBond left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RobinL commented Dec 17, 2025 •

edited

Loading

Why can we not set `source_dataset_name` = string table name when input is a physical table string?

Splink 4: What if we pass the same `source_dataset_name` to two input dataframes?