Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tf adjustments in Clickhouse server #46

Merged
merged 9 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Term frequency adjustments are now not limited in Clickhouse server (or `chdb` when `debug_mode` is switched on) [#46](https://github.com/ADBond/splinkclickhouse/pull/46).

### Changed

- Dropped support for Splink <= `4.0.5` [#46](https://github.com/ADBond/splinkclickhouse/pull/46).

## [0.3.2] - 2024-10-23

### Added

- SQL UDF `days_since_epoch` to parse a date representing a string to the number of days since `1970-01-01` [#39](https://github.com/ADBond/splinkclickhouse/pull/39)
- Custom Clickhouse `ColumnExpression` with additional transform `parse_date_to_int` to parse string to days since epoch [#39](https://github.com/ADBond/splinkclickhouse/pull/39)
- Custom date comparison and comparison levels working with integer type representing days since epoch [#39](https://github.com/ADBond/splinkclickhouse/pull/39)
- SQL UDF `days_since_epoch` to parse a date representing a string to the number of days since `1970-01-01` [#39](https://github.com/ADBond/splinkclickhouse/pull/39).
- Custom Clickhouse `ColumnExpression` with additional transform `parse_date_to_int` to parse string to days since epoch [#39](https://github.com/ADBond/splinkclickhouse/pull/39).
- Custom date comparison and comparison levels working with integer type representing days since epoch [#39](https://github.com/ADBond/splinkclickhouse/pull/39).

## [0.3.1] - 2024-10-14

Expand Down
6 changes: 0 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,12 +238,6 @@ import splink.comparison_level as cl
first_name_comparison = cl.DamerauLevenshteinAtThresholds("NULLIF(first_name, '')")
```

### Term-frequency adjustments

Currently at most one term frequency adjustment can be used with `ClickhouseAPI`.

This also applies to `ChDBAPI` but _only in `debug_mode`_. With `debug_mode` off there is no limit on term frequency adjustments.

### `ClickhouseAPI` pandas registration

`ClickhouseAPI` will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them `Nullable`.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ classifiers = [
]
requires-python = ">=3.9"
dependencies = [
"splink >= 4.0.2",
"splink >= 4.0.6",
"clickhouse_connect >= 0.7.0",
]
[project.urls]
Expand Down
5 changes: 2 additions & 3 deletions scripts/getting_started_clickhouse.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,10 @@

db_api = ClickhouseAPI(client)

# TODO: tf adjustments need deep work (can have _one_ but not more)
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name"),
cl.NameComparison("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
Expand All @@ -37,7 +36,7 @@
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.JaccardAtThresholds("email"),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
Expand Down
14 changes: 10 additions & 4 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,12 @@ def fake_1000_settings(version):
return SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name"),
cl.JaroAtThresholds("surname"),
cl.JaroWinklerAtThresholds("first_name").configure(
term_frequency_adjustments=True
),
cl.JaroAtThresholds("surname").configure(
term_frequency_adjustments=True
),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
Expand All @@ -112,8 +116,10 @@ def fake_1000_settings(version):
comparisons=[
cl.JaroWinklerAtThresholds(
ColumnExpression("first_name").regex_extract(".*")
),
cl.JaroAtThresholds(ColumnExpression("surname").regex_extract(".*")),
).configure(term_frequency_adjustments=True),
cl.JaroAtThresholds(
ColumnExpression("surname").regex_extract(".*")
).configure(term_frequency_adjustments=True),
cl.DateOfBirthComparison(
ColumnExpression("dob").regex_extract(".*"),
input_is_string=True,
Expand Down
8 changes: 4 additions & 4 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading