-
Notifications
You must be signed in to change notification settings - Fork 209
Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions #2863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…dataframes' for clarity
…ered DataFrames in chart tests
…n chunking and cluster studio tests
…ered DataFrames in tests
…Frame registration in tests
…on with DuckDBAPI
…new input table naming
…registered DataFrames in Linker
…registered DataFrames in Linker
…registration with DuckDBAPI
docs/demos/examples/duckdb/febrl3.ipynb docs/demos/examples/duckdb/febrl4.ipynb docs/demos/examples/duckdb/link_only.ipynb
docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb docs/demos/examples/duckdb/real_time_record_linkage.ipynb docs/demos/examples/duckdb/transactions.ipynb
…te_tests Update tests for splinkdataframes_everywhere
ADBond
approved these changes
Jan 12, 2026
Contributor
ADBond
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, think the core of how this work looks good, and makes things a lot simpler.
Haven't looked in detail at the tests or notebooks, but look fine from a skim
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
We now register input dataframes one at a time (turning them into SplinkDataFrames) prior to passing into other Splink functions
Example:
Key files changed
In bold are the files with significant changes to the logic. The diff in the core logic is actually very small, it just results in lots of tiny updates, mostly to tests
Everything else is updating tests
Table registration logic
The main complexity in this PR is getting the table registration logic right and consistent with Splink 4
Note that when autonaming, the string in the
source_datasetcolumn should be deterministic , because it controls which rows appear on the left and right hand side of comparisons according to logic like:where l."source_dataset" || '-__-' || l."unique_id" < r."source_dataset" || '-__-' || r."unique_id"Splink 4
source_datasetcolumn heresource_datasetset usingascii_uidsee here.This PR changes this slightly to make it always sequential/deterministic. This is easier now the DB API 'remembers' what tables have been registered on them
Splink 4, table names provided to Splink as strings (i.e. referring to table that already exists in db).
If we pass:
linker = Linker(["df_tbl", "df_tbl"], settings, db_api)then from linker functions we get:
This is because the linker, when handling table registrations, assigns aliases
__splink__input_table_0,__splink__input_table_1. I.e. it's the linker that does this, not theDatabaseAPI. hereBut from blocking analysis functions we get:
count_comparisons_from_blocking_rule( table_or_tables=["df_tbl", "df_tbl"], ...Splink 4, tables provided to Splink as pandas dataframes or similar
If we pass
linker = Linker([df, df], settings, db_api)Then from linker functions we get:
and from blocking analysis functions we get:
Testing scripts
Before:
After
Why can we not set
source_dataset_name= string table name when input is a physical table string?Because user can do this:
and if they do, we want the
source_dataset_nameto be different otherwise you get duplicated rowunique_idsSplink 4: What if we pass the same
source_dataset_nameto two input dataframes?This causes the error:
Binder Error: Values list "l" does not have a column named "source_dataset"This is because
linker._input_tables_dictis a dict keyed withinput_aliasessee register_multiple_tables and where we create the dictBut in the above Splink 'thinks' there only one input table because the alias is reused