feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167) by masukai · Pull Request #629 · drt-hub/drt

masukai · 2026-06-09T22:26:37Z

Summary

Databricks Delta Lake destination — the third DWH destination alongside Snowflake (#353) and BigQuery (#584 in flight). Completes the major-DWH lineup. Same three-mode shape as the Snowflake leg.

Mode	Behaviour
`config.mode: insert`	Append-only `INSERT INTO` per record
`config.mode: merge`	Delta Lake `MERGE INTO` via staging scratch table (requires `upsert_key`)
`sync.mode: mirror`	Forces MERGE write path + end-of-sync `DELETE WHERE upsert_key NOT IN (observed)`

Implementation

Mirrors the Snowflake destination's structure (#599 / mirror leg of #340 family) but with Databricks-specific shape:

Auth: Databricks SQL Connector — host_env (workspace hostname) + http_path_env (SQL warehouse) + token_env (PAT, dapi*)
Naming: Unity Catalog three-part catalog.schema.table (or catalog: hive_metastore for legacy workspaces)
Merge staging: Databricks Delta has no session-local temp tables (no CREATE TEMP TABLE syntax), so the merge path creates a uniquely-named scratch Delta table catalog.schema.__drt_staging_<table> cloned from the target's schema, stages rows, MERGEs, then DROP TABLEs the scratch
Mirror semantics: same shape as Snowflake — composite keys use WHERE (c1, c2) NOT IN ((v1a, v1b), ...); safety guard skips DELETE entirely when no batch produced records

Example

destination:
  type: databricks
  host_env: DATABRICKS_HOST
  http_path_env: DATABRICKS_HTTP_PATH
  token_env: DATABRICKS_TOKEN
  catalog: main
  schema: default
  table: user_scores
  mode: merge
  upsert_key: [user_id]

Mirror mode (Census "Full Sync with Deletion" equivalent):

destination:
  type: databricks
  host_env: DATABRICKS_HOST
  http_path_env: DATABRICKS_HTTP_PATH
  token_env: DATABRICKS_TOKEN
  catalog: main
  schema: analytics
  table: active_users
  upsert_key: [user_id]
sync:
  mode: mirror

Tests

22 unit tests in tests/unit/test_databricks_destination.py:

Category	Tests
Config	`schema:` YAML alias for the mypy-strict `schema_` field, three-part FQN in `describe()`, Hive Metastore catalog
Empty-batch	Short-circuit before any `databricks.sql` import (#595 contract)
Auth	Missing creds raises ValueError, missing extras raises ImportError, `databricks.sql.connect()` kwargs shape (`server_hostname` / `http_path` / `access_token`) — protects against silent template-copy drift from the Snowflake destination
INSERT	Happy path with correct FQN, on_error=skip vs on_error=fail
MERGE	Happy path (staging CREATE + INSERTs + MERGE + DROP), `upsert_key` required, composite-key `ON` clause, all-columns-are-key (skips UPDATE clause)
Mirror	`upsert_key` validation, MERGE-write-path forcing, single-column DELETE, composite-key DELETE tuple form, skip-when-no-records safety guard, no-op `finalize_sync` for non-mirror modes
Connection	`test_connection` round-trip

databricks.sql is mocked via sys.modules injection — no real Databricks workspace or databricks-sql-connector install required.

Why a Delta scratch table instead of CREATE TEMP TABLE?

Databricks Delta Lake doesn't have session-local temp tables — the standard CREATE TEMP TABLE syntax isn't supported by Delta. A uniquely-named Delta scratch table in the same catalog.schema is the idiomatic shape. The __drt_staging_* prefix makes it identifiable in audit logs, and a CREATE OR REPLACE TABLE on the next run cleanly overwrites any interrupted-mid-sync remnant. Documented in docs/connectors/databricks.md.

Test plan

pytest tests/unit/test_databricks_destination.py — 22 passed
pytest tests/unit/test_databricks_destination.py tests/unit/test_snowflake_destination.py tests/unit/test_connector_registry.py tests/contracts/ — 125 passed (no regression on sibling destinations or contracts)
ruff check drt tests — clean
CI green on 3.10–3.13 + CodeQL

Docs

docs/connectors/databricks.md — full reference (all three modes, auth with PAT generation, Unity Catalog vs Hive Metastore, merge-path staging design, sync-mode compatibility table)
README.md + README.ja.md destination tables updated (v0.7.9 row, after Snowflake)

i18n marker bump for README.ja.md follows the established post-merge housekeeping pattern (same as #618 for #613 S3 etc.).

CHANGELOG

[Unreleased] → Added entry above the S3 entry.

Out of scope

replace_strategy: swap zero-downtime replace (Snowflake also lacks this — separate follow-up issue)
OAuth M2M / service principal auth flows (track separately; PR uses PAT-only for v1)
BigQuery contributor PR feat: add BigQuery destination #584 status (unrelated — continues independent contributor cycle)

Closes feat: add Databricks destination (Delta Lake upsert) #167
Joins feat: sync.mode: mirror — differential delete for stale rows #340 SQL mirror family (Postgres feat(engine,postgres): sync.mode: mirror — differential delete (#340 Step 1) #596 / MySQL feat(mysql): sync.mode: mirror — differential delete (#340 Step 2) #597 / ClickHouse feat(clickhouse): sync.mode: mirror — differential delete (#340 Step 3) #598 / Snowflake feat(snowflake): sync.mode: mirror — differential delete (#340 Step 4) #599 / Databricks this PR)
Completes major-DWH lineup: Snowflake ✅ / BigQuery (in flight via feat: add BigQuery destination #584) / Databricks this PR

🤖 Generated with Claude Code

…ge/mirror) (closes #167) Third DWH destination alongside Snowflake (#353) and BigQuery (#584 in flight) — completes the major-DWH lineup. Supports the same three modes as Snowflake's leg: - INSERT (append, `config.mode: insert`) - MERGE (upsert via Delta Lake's native MERGE INTO, `config.mode: merge`) - sync.mode: mirror (#340 family — Databricks leg) — MERGE upsert + end-of-sync DELETE-missing Auth via Databricks SQL Connector: - `host_env` — workspace hostname (dbc-*.cloud.databricks.com) - `http_path_env` — SQL warehouse HTTP path (/sql/1.0/warehouses/*) - `token_env` — personal access token (PAT, dapi*) Unity Catalog three-part names (catalog.schema.table) are the default; legacy workspaces use `catalog: hive_metastore`. Merge implementation note: Databricks Delta Lake doesn't have session-local temp tables (no `CREATE TEMP TABLE` syntax), so the merge path creates a uniquely-named scratch Delta table `catalog.schema.__drt_staging_<table>` cloned from the target's schema, stages rows via per-row INSERT, executes MERGE INTO, and DROP TABLEs the staging at the end. The `__drt_staging_*` prefix makes it identifiable in audit logs. The token-bearing principal needs CREATE on the schema in addition to MODIFY on the target. Mirror semantics match the Snowflake leg of #340: - `sync.mode: mirror` forces the MERGE write path regardless of `config.mode` - End-of-sync issues `DELETE FROM <table> WHERE upsert_key NOT IN (observed)` - Composite keys use `WHERE (c1, c2) NOT IN ((v1a, v1b), ...)` form - Safety guard: skips DELETE entirely when no batch produced records 22 unit tests in tests/unit/test_databricks_destination.py cover: - Config validation (schema: YAML alias, three-part FQN in describe(), Hive Metastore catalog) - Empty-batch short-circuit (#595 contract) - databricks.sql.connect() kwargs shape — protects against silent template-copy drift from the Snowflake destination - INSERT happy path + on_error=skip / on_error=fail - MERGE happy path + upsert_key required + composite key ON clause + all-columns-are-key (no UPDATE clause) - Mirror invariants: upsert_key validation, MERGE-write-path forcing, single-column DELETE, composite-key DELETE tuple form, skip-when-no-records safety guard, no-op finalize_sync for non-mirror modes - test_connection round-trip databricks.sql is mocked via sys.modules injection — no real Databricks workspace or databricks-sql-connector install required. Requires `pip install drt-core[databricks]` (depends on databricks-sql-connector>=3.0, already in pyproject extras). New `docs/connectors/databricks.md` covers all three modes, auth flow with PAT generation steps, Unity Catalog vs Hive Metastore, the merge-path staging design (why Delta scratch table and not CREATE TEMP TABLE), and a sync-mode compatibility table. README destination table updated on both English and Japanese sides (Databricks Delta Lake row added after Snowflake, v0.7.9). i18n marker bump for README.ja.md follows the established post-merge housekeeping pattern (#618-style). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

codecov · 2026-06-09T22:29:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…e to 100% codecov/patch on the prior commit hit 94.73% (target 86.72%) — the gate passed but not at 100%. Uncovered slice was the 3 branches not exercised by the happy-path tests: - **Lines 152-163** — MERGE-path staging INSERT failure handler (the per-row try/except inside the staging-table INSERT loop) - **Line 196** — mirror's ``failed_indices`` skip path inside the ``_mirror_keys`` accumulator (skip rows that didn't make it into the destination so they don't count as "observed in source" for the end-of-sync DELETE) - **Line 202** — the ``Unsupported mode`` defensive fallthrough ValueError (unreachable in normal flow because Pydantic Literal validates at config-load time, but tracked by coverage) 4 new tests: 1. `test_merge_staging_insert_failure_on_error_skip` — first staging INSERT fails, second succeeds; verifies result.failed=1 + row_errors recorded + MERGE still runs against whatever made it into staging. 2. `test_merge_staging_insert_failure_on_error_fail_raises` — same failure scenario but with on_error=fail; verifies the exception re-raises and the connection is still closed via try/finally. 3. `test_unsupported_mode_raises` — manually corrupts ``config.mode`` to "garbage" after Pydantic construction (bypasses Literal validation via ``object.__setattr__``) and verifies the defensive ValueError fires. 4. `test_mirror_skips_failed_keys_from_delete_observed_set` — mirror load with a staging failure on row 1, then finalize_sync; verifies the DELETE's NOT-IN list contains only the survivor's key (id=2), not the failed row's key (id=1). This catches the semantic bug where a row that failed to load would be deleted from the destination on next mirror run. drt/destinations/databricks.py file coverage: 94% → 100% (119/119 stmts). Coverage now matches the S3 / GCS / Azure Blob destinations from #613 / #623 / #624. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

masukai requested a review from yodakanohoshi June 9, 2026 22:26

masukai merged commit 3e66e92 into main Jun 9, 2026
8 checks passed

masukai deleted the feat/databricks-destination branch June 9, 2026 23:52

github-actions Bot locked and limited conversation to collaborators Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167)#629

feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167)#629
masukai merged 2 commits into
mainfrom
feat/databricks-destination

masukai commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

masukai commented Jun 9, 2026

Summary

Implementation

Example

Tests

Why a Delta scratch table instead of CREATE TEMP TABLE?

Test plan

Docs

CHANGELOG

Out of scope

Related

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 9, 2026 •

edited

Loading