Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
a39656e
source dataset name
RobinL Dec 17, 2025
944be27
add utils
RobinL Dec 17, 2025
812fdc0
vertical concat and blocking analysis
RobinL Dec 17, 2025
c573c24
linker
RobinL Dec 17, 2025
78af7f9
refactor: rename parameter 'table_or_tables' to 'splink_dataframe_or_…
RobinL Dec 17, 2025
fe3f948
remove dbapi from other public non linker fns
RobinL Dec 17, 2025
c1ace52
harden
RobinL Dec 17, 2025
cd25028
refactor: improve handling of source_dataset_name in register method
RobinL Dec 17, 2025
7798192
update first test
RobinL Dec 18, 2025
905e807
update analyse blocking test
RobinL Dec 18, 2025
d404187
fix bit i missed
RobinL Dec 19, 2025
6435bf1
fixes
RobinL Dec 19, 2025
de2faff
update test_accuracy
RobinL Dec 19, 2025
d843d03
refactor: update db_api retrieval in array comparison tests
RobinL Dec 19, 2025
34cebb7
update spark
RobinL Dec 19, 2025
8231154
refactor: update Linker initialization to use registered DataFrame
RobinL Dec 19, 2025
6479bd6
refactor: update Linker initialization to use registered DataFrames i…
RobinL Dec 19, 2025
88bc475
refactor: replace Linker initialization with helper method for regist…
RobinL Dec 19, 2025
f3f6a23
refactor: update Linker initialization to use registered DataFrames i…
RobinL Dec 19, 2025
c437d28
refactor: replace Linker initialization with helper method for regist…
RobinL Dec 19, 2025
dca0234
refactor: update Linker initialization to use helper methods for Data…
RobinL Dec 19, 2025
26f1ac6
refactor: update tests to use db_api method for database API retrieval
RobinL Dec 19, 2025
3197a60
refactor: update tests to use registered DataFrames with DuckDBAPI
RobinL Dec 19, 2025
bb62d83
refactor: update tests to use db_api method for DataFrame registration
RobinL Dec 19, 2025
cafae1e
refactor: update tests to use helper methods for DataFrame registrati…
RobinL Dec 19, 2025
826ec12
refactor: update tests to register DataFrames with DuckDBAPI and use …
RobinL Dec 19, 2025
ed734f7
refactor: update tests to register DataFrames with DuckDBAPI and use …
RobinL Dec 19, 2025
dcb1e3f
refactor: update tests to register DataFrames with DuckDBAPI and use …
RobinL Dec 19, 2025
8f1d6a3
refactor: update tests to use linker_with_registration for DataFrame …
RobinL Dec 19, 2025
3352463
refactor: update tests to use registered DataFrames with DuckDBAPI
RobinL Dec 19, 2025
d97c61c
refactor: update tests to use db_api method for database access and l…
RobinL Dec 19, 2025
ec7e5f8
refactor: update tests to use linker_with_registration for creating L…
RobinL Dec 19, 2025
c7f196c
refactor: update tests to use helper.linker_with_registration for Lin…
RobinL Dec 19, 2025
a156fa2
refactor: update tests to use db_api method for registering data with…
RobinL Dec 19, 2025
ee18959
refactor: update tests to use linker_with_registration for Linker ins…
RobinL Dec 19, 2025
451bc0b
refactor: standardize type hint for splink_dataframe_or_dataframes pa…
RobinL Dec 19, 2025
14abf82
refactor: update linker_with_registration method and adjust test case…
RobinL Dec 19, 2025
e0ca0e4
fix postgres tests
RobinL Dec 19, 2025
92b3d07
refactor: streamline dataframe conversion in test_score_missing_edges…
RobinL Dec 19, 2025
143651e
refactor: enhance register method to support overwriting of internal …
RobinL Dec 19, 2025
0e3f42e
refactor: update Linker initialization to use registered DataFrames i…
RobinL Dec 20, 2025
67bf85f
refactor: update SQLite tests to use registered DataFrames instead of…
RobinL Dec 20, 2025
7644ef4
exploratory
RobinL Dec 20, 2025
a7af437
exploratory
RobinL Dec 20, 2025
7dc3b9f
Implement feature X to enhance user experience and optimize performance
RobinL Dec 20, 2025
b3e3859
update tutorials
RobinL Dec 20, 2025
37d58a2
accuracy analysis
RobinL Dec 20, 2025
8c507eb
add notes
RobinL Dec 20, 2025
30e0c3d
updaate docs/demos/examples/duckdb/deduplicate_50k_synthetic.ipynb
RobinL Dec 20, 2025
8974ed3
update docs/demos/examples/duckdb/deterministic_dedupe.ipynb
RobinL Dec 20, 2025
95e546d
docs/demos/examples/duckdb/pairwise_labels.ipynb
RobinL Dec 20, 2025
95ff490
bias eval
RobinL Dec 20, 2025
9c999b7
business rates
RobinL Dec 20, 2025
12adb0c
cookbook progress
RobinL Dec 20, 2025
1d6b319
cookbook
RobinL Dec 20, 2025
f234836
pseudopeople
RobinL Dec 20, 2025
22dd13a
git commit -m "final two"
RobinL Dec 20, 2025
41f5ec5
lint and mypy
RobinL Dec 20, 2025
6e3445c
format: improve docstring formatting for profile_columns function
RobinL Dec 20, 2025
5153e72
mypy
RobinL Dec 20, 2025
4537d2d
Merge pull request #2866 from moj-analytical-services/helper_and_upda…
RobinL Dec 20, 2025
a025690
refactor method names in databse api
RobinL Dec 20, 2025
3398b01
fix tests
RobinL Dec 20, 2025
276efef
harden
RobinL Dec 20, 2025
06b17cf
harden
RobinL Dec 20, 2025
76fb924
remove redundant file
RobinL Jan 12, 2026
2629dae
address first comment
RobinL Jan 12, 2026
f9f494b
address comment about error message
RobinL Jan 12, 2026
b94bd20
add final decorator
RobinL Jan 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,331 changes: 1,166 additions & 1,165 deletions docs/demos/examples/duckdb/accuracy_analysis_from_labels_column.ipynb

Large diffs are not rendered by default.

3,365 changes: 1,683 additions & 1,682 deletions docs/demos/examples/duckdb/deduplicate_50k_synthetic.ipynb

Large diffs are not rendered by default.

1,606 changes: 803 additions & 803 deletions docs/demos/examples/duckdb/deterministic_dedupe.ipynb

Large diffs are not rendered by default.

3,161 changes: 1,582 additions & 1,579 deletions docs/demos/examples/duckdb/febrl3.ipynb

Large diffs are not rendered by default.

6,800 changes: 3,405 additions & 3,395 deletions docs/demos/examples/duckdb/febrl4.ipynb

Large diffs are not rendered by default.

1,462 changes: 731 additions & 731 deletions docs/demos/examples/duckdb/link_only.ipynb

Large diffs are not rendered by default.

1,524 changes: 763 additions & 761 deletions docs/demos/examples/duckdb/pairwise_labels.ipynb

Large diffs are not rendered by default.

1,398 changes: 700 additions & 698 deletions docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb

Large diffs are not rendered by default.

6,070 changes: 3,036 additions & 3,034 deletions docs/demos/examples/duckdb/real_time_record_linkage.ipynb

Large diffs are not rendered by default.

2,742 changes: 1,371 additions & 1,371 deletions docs/demos/examples/duckdb/transactions.ipynb

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions docs/demos/examples/duckdb_no_test/bias_eval.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -290,11 +290,13 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"linker = Linker(production_df, settings='../../demo_settings/model_h50k.json', db_api=db_api)"
"db_api = DuckDBAPI()\n",
"production_df_sdf = db_api.register(production_df)\n",
"linker = Linker(production_df_sdf, settings='../../demo_settings/model_h50k.json')"
]
},
{
Expand Down
15 changes: 8 additions & 7 deletions docs/demos/examples/duckdb_no_test/business_rates_match.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -477,8 +477,7 @@
" company_name,\n",
" company_number,\n",
" COALESCE(\n",
" REGEXP_EXTRACT(address_concat, '(\\\\d+[A-Z]?)'),\n",
" REGEXP_EXTRACT(address_concat, '(\\\\S+)(?=\\\\s+HOUSE)')\n",
" REGEXP_EXTRACT(address_concat, '(\\\\d+[A-Z]?)')\n",
" ) AS first_num_in_address,\n",
" postcode,\n",
" name_tokens_with_freq,\n",
Expand Down Expand Up @@ -540,7 +539,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -612,7 +611,9 @@
" retain_matching_columns=True,\n",
")\n",
"\n",
"linker = Linker([df_stockport, df_all_companies], settings, db_api)"
"df_stockport_sdf = db_api.register(df_stockport)\n",
"df_all_companies_sdf = db_api.register(df_all_companies)\n",
"linker = Linker([df_stockport_sdf, df_all_companies_sdf], settings)"
]
},
{
Expand Down Expand Up @@ -960,7 +961,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "splink (3.11.11)",
"language": "python",
"name": "python3"
},
Expand All @@ -974,7 +975,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
"version": "3.11.11"
}
},
"nbformat": 4,
Expand Down
70 changes: 44 additions & 26 deletions docs/demos/examples/duckdb_no_test/cookbook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -282,7 +282,9 @@
")\n",
"\n",
"\n",
"linker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)\n",
"db_api = DuckDBAPI()\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"linker.inference.predict().as_pandas_dataframe()"
]
Expand All @@ -298,7 +300,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -393,9 +395,11 @@
")\n",
"\n",
"\n",
"linker = Linker(df, settings, DuckDBAPI(), set_up_basic_logging=False)\n",
"db_api = DuckDBAPI()\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"linker.inference.predict().as_pandas_dataframe()\n"
"linker.inference.predict().as_pandas_dataframe()"
]
},
{
Expand All @@ -416,7 +420,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -477,6 +481,7 @@
"duckdb_df = duckdb.read_parquet(temp_file_path)\n",
"\n",
"db_api = DuckDBAPI(\":default:\")\n",
"df_sdf = db_api.register(df)\n",
"settings = SettingsCreator(\n",
" link_type=\"dedupe_only\",\n",
" comparisons=[\n",
Expand All @@ -489,7 +494,7 @@
" ],\n",
")\n",
"\n",
"linker = Linker(df, settings, db_api, set_up_basic_logging=False)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"result = linker.inference.predict().as_duckdbpyrelation()\n",
"\n",
Expand All @@ -498,7 +503,7 @@
"\n",
"# For example, we can use the `sort` function to sort the results,\n",
"# or could use result.to_parquet() to write to a parquet file.\n",
"result.sort(\"match_weight\")\n"
"result.sort(\"match_weight\")"
]
},
{
Expand All @@ -510,7 +515,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -628,7 +633,8 @@
")\n",
"\n",
"df = splink_datasets.fake_1000\n",
"linker = Linker(df, settings, db_api, set_up_basic_logging=False)\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n",
"linker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n",
Expand All @@ -647,7 +653,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -754,7 +760,8 @@
" ],\n",
")\n",
"df = splink_datasets.fake_1000\n",
"linker = Linker(df, settings, db_api, set_up_basic_logging=False)\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"linker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n",
"linker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n",
Expand All @@ -781,7 +788,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -808,7 +815,8 @@
" max_iterations=2,\n",
")\n",
"\n",
"linker = Linker(df, settings, db_api, set_up_basic_logging=False)\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings, set_up_basic_logging=False)\n",
"\n",
"linker.training.estimate_probability_two_random_records_match(\n",
" [block_on(\"first_name\", \"surname\")], recall=0.7\n",
Expand Down Expand Up @@ -859,7 +867,8 @@
" ]\n",
")\n",
"\n",
"linker = Linker(df, settings, db_api)\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings)\n",
"\n",
"\n",
"linker.misc.save_model_to_json(\"mod.json\", overwrite=True)\n",
Expand All @@ -869,8 +878,10 @@
"new_settings.retain_intermediate_calculation_columns = True\n",
"new_settings.blocking_rules_to_generate_predictions = [\"1=1\"]\n",
"new_settings.additional_columns_to_retain = [\"cluster\"]\n",
"db_api_new = DuckDBAPI()\n",
"df_sdf_new = db_api_new.register(df)\n",
"linker = Linker(df_sdf_new, new_settings)\n",
"\n",
"linker = Linker(df, new_settings, DuckDBAPI())\n",
"\n",
"linker.inference.predict().as_duckdbpyrelation().show()"
]
Expand All @@ -891,6 +902,7 @@
"import difflib\n",
"\n",
"import duckdb\n",
"from duckdb.sqltypes import VARCHAR, DOUBLE\n",
"\n",
"import splink.comparison_level_library as cll\n",
"import splink.comparison_library as cl\n",
Expand All @@ -910,8 +922,8 @@
"con.create_function(\n",
" \"custom_partial_ratio\",\n",
" custom_partial_ratio,\n",
" [duckdb.typing.VARCHAR, duckdb.typing.VARCHAR],\n",
" duckdb.typing.DOUBLE,\n",
" [VARCHAR, VARCHAR],\n",
" DOUBLE,\n",
")\n",
"db_api = DuckDBAPI(connection=con)\n",
"\n",
Expand Down Expand Up @@ -945,7 +957,8 @@
" max_iterations=2,\n",
")\n",
"\n",
"linker = Linker(df, settings, db_api)\n",
"df_sdf = db_api.register(df)\n",
"linker = Linker(df_sdf, settings)\n",
"\n",
"linker.training.estimate_probability_two_random_records_match(\n",
" [block_on(\"first_name\", \"surname\")], recall=0.7\n",
Expand Down Expand Up @@ -1092,7 +1105,8 @@
")\n",
"\n",
"db_api = DuckDBAPI(connection=con)\n",
"company_linker = Linker(\"company_person_records\", company_settings, db_api)\n",
"company_records_sdf = db_api.register(\"company_person_records\")\n",
"company_linker = Linker(company_records_sdf, company_settings)\n",
"company_predictions = company_linker.inference.predict(threshold_match_probability=0.5)\n",
"\n",
"print(\"\\nCompany pairwise matches:\")\n",
Expand Down Expand Up @@ -1176,8 +1190,8 @@
" retain_matching_columns=True,\n",
")\n",
"\n",
"# Link persons within company clusters\n",
"person_linker = Linker(\"records_with_company_cluster\", person_settings, db_api2)\n",
"person_records_sdf = db_api2.register(\"records_with_company_cluster\")\n",
"person_linker = Linker(person_records_sdf, person_settings)\n",
"person_predictions = person_linker.inference.predict(threshold_match_probability=0.5)\n",
"\n",
"print(\"\\nPerson pairwise matches (within company clusters):\")\n",
Expand All @@ -1187,7 +1201,8 @@
" person_predictions, threshold_match_probability=0.5\n",
")\n",
"\n",
"person_clusters.as_duckdbpyrelation().sort(\"cluster_id\").show(max_width=1000)\n"
"person_clusters.as_duckdbpyrelation().sort(\"cluster_id\").show(max_width=1000)\n",
"\n"
]
},
{
Expand Down Expand Up @@ -1296,16 +1311,19 @@
" retain_intermediate_calculation_columns=True,\n",
" retain_matching_columns=True,\n",
")\n",
"db_api_linker = DuckDBAPI(con)\n",
"df_left_sdf = db_api_linker.register(\"df_left\")\n",
"df_right_sdf = db_api_linker.register(\"df_right\")\n",
"linker = Linker(\n",
" [\"df_left\", \"df_right\"],\n",
" [df_left_sdf, df_right_sdf],\n",
" settings,\n",
" db_api=DuckDBAPI(con),\n",
")\n",
"\n",
"# Skip training for demo purposes, just demonstrate that predict() works\n",
"\n",
"df_predict = linker.inference.predict()\n",
"df_predict.as_duckdbpyrelation()\n"
"\n",
"df_predict.as_duckdbpyrelation()"
]
}
],
Expand Down
Loading