Skip to content

feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167)#629

Merged
masukai merged 2 commits into
mainfrom
feat/databricks-destination
Jun 9, 2026
Merged

feat(destinations): add Databricks Delta Lake destination (insert/merge/mirror) (closes #167)#629
masukai merged 2 commits into
mainfrom
feat/databricks-destination

Conversation

@masukai

@masukai masukai commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Databricks Delta Lake destination — the third DWH destination alongside Snowflake (#353) and BigQuery (#584 in flight). Completes the major-DWH lineup. Same three-mode shape as the Snowflake leg.

Mode Behaviour
config.mode: insert Append-only INSERT INTO per record
config.mode: merge Delta Lake MERGE INTO via staging scratch table (requires upsert_key)
sync.mode: mirror Forces MERGE write path + end-of-sync DELETE WHERE upsert_key NOT IN (observed)

Implementation

Mirrors the Snowflake destination's structure (#599 / mirror leg of #340 family) but with Databricks-specific shape:

  • Auth: Databricks SQL Connector — host_env (workspace hostname) + http_path_env (SQL warehouse) + token_env (PAT, dapi*)
  • Naming: Unity Catalog three-part catalog.schema.table (or catalog: hive_metastore for legacy workspaces)
  • Merge staging: Databricks Delta has no session-local temp tables (no CREATE TEMP TABLE syntax), so the merge path creates a uniquely-named scratch Delta table catalog.schema.__drt_staging_<table> cloned from the target's schema, stages rows, MERGEs, then DROP TABLEs the scratch
  • Mirror semantics: same shape as Snowflake — composite keys use WHERE (c1, c2) NOT IN ((v1a, v1b), ...); safety guard skips DELETE entirely when no batch produced records

Example

destination:
  type: databricks
  host_env: DATABRICKS_HOST
  http_path_env: DATABRICKS_HTTP_PATH
  token_env: DATABRICKS_TOKEN
  catalog: main
  schema: default
  table: user_scores
  mode: merge
  upsert_key: [user_id]

Mirror mode (Census "Full Sync with Deletion" equivalent):

destination:
  type: databricks
  host_env: DATABRICKS_HOST
  http_path_env: DATABRICKS_HTTP_PATH
  token_env: DATABRICKS_TOKEN
  catalog: main
  schema: analytics
  table: active_users
  upsert_key: [user_id]
sync:
  mode: mirror

Tests

22 unit tests in tests/unit/test_databricks_destination.py:

Category Tests
Config schema: YAML alias for the mypy-strict schema_ field, three-part FQN in describe(), Hive Metastore catalog
Empty-batch Short-circuit before any databricks.sql import (#595 contract)
Auth Missing creds raises ValueError, missing extras raises ImportError, databricks.sql.connect() kwargs shape (server_hostname / http_path / access_token) — protects against silent template-copy drift from the Snowflake destination
INSERT Happy path with correct FQN, on_error=skip vs on_error=fail
MERGE Happy path (staging CREATE + INSERTs + MERGE + DROP), upsert_key required, composite-key ON clause, all-columns-are-key (skips UPDATE clause)
Mirror upsert_key validation, MERGE-write-path forcing, single-column DELETE, composite-key DELETE tuple form, skip-when-no-records safety guard, no-op finalize_sync for non-mirror modes
Connection test_connection round-trip

databricks.sql is mocked via sys.modules injection — no real Databricks workspace or databricks-sql-connector install required.

Why a Delta scratch table instead of CREATE TEMP TABLE?

Databricks Delta Lake doesn't have session-local temp tables — the standard CREATE TEMP TABLE syntax isn't supported by Delta. A uniquely-named Delta scratch table in the same catalog.schema is the idiomatic shape. The __drt_staging_* prefix makes it identifiable in audit logs, and a CREATE OR REPLACE TABLE on the next run cleanly overwrites any interrupted-mid-sync remnant. Documented in docs/connectors/databricks.md.

Test plan

  • pytest tests/unit/test_databricks_destination.py — 22 passed
  • pytest tests/unit/test_databricks_destination.py tests/unit/test_snowflake_destination.py tests/unit/test_connector_registry.py tests/contracts/ — 125 passed (no regression on sibling destinations or contracts)
  • ruff check drt tests — clean
  • CI green on 3.10–3.13 + CodeQL

Docs

  • docs/connectors/databricks.md — full reference (all three modes, auth with PAT generation, Unity Catalog vs Hive Metastore, merge-path staging design, sync-mode compatibility table)
  • README.md + README.ja.md destination tables updated (v0.7.9 row, after Snowflake)

i18n marker bump for README.ja.md follows the established post-merge housekeeping pattern (same as #618 for #613 S3 etc.).

CHANGELOG

[Unreleased] → Added entry above the S3 entry.

Out of scope

  • replace_strategy: swap zero-downtime replace (Snowflake also lacks this — separate follow-up issue)
  • OAuth M2M / service principal auth flows (track separately; PR uses PAT-only for v1)
  • BigQuery contributor PR feat: add BigQuery destination #584 status (unrelated — continues independent contributor cycle)

Related

🤖 Generated with Claude Code

…ge/mirror) (closes #167)

Third DWH destination alongside Snowflake (#353) and BigQuery
(#584 in flight) — completes the major-DWH lineup. Supports the
same three modes as Snowflake's leg:

- INSERT (append, `config.mode: insert`)
- MERGE (upsert via Delta Lake's native MERGE INTO, `config.mode: merge`)
- sync.mode: mirror (#340 family — Databricks leg) — MERGE upsert
  + end-of-sync DELETE-missing

Auth via Databricks SQL Connector:
- `host_env` — workspace hostname (dbc-*.cloud.databricks.com)
- `http_path_env` — SQL warehouse HTTP path (/sql/1.0/warehouses/*)
- `token_env` — personal access token (PAT, dapi*)

Unity Catalog three-part names (catalog.schema.table) are the
default; legacy workspaces use `catalog: hive_metastore`.

Merge implementation note: Databricks Delta Lake doesn't have
session-local temp tables (no `CREATE TEMP TABLE` syntax), so the
merge path creates a uniquely-named scratch Delta table
`catalog.schema.__drt_staging_<table>` cloned from the target's
schema, stages rows via per-row INSERT, executes MERGE INTO, and
DROP TABLEs the staging at the end. The `__drt_staging_*` prefix
makes it identifiable in audit logs. The token-bearing principal
needs CREATE on the schema in addition to MODIFY on the target.

Mirror semantics match the Snowflake leg of #340:
- `sync.mode: mirror` forces the MERGE write path regardless of
  `config.mode`
- End-of-sync issues `DELETE FROM <table> WHERE upsert_key NOT IN
  (observed)`
- Composite keys use `WHERE (c1, c2) NOT IN ((v1a, v1b), ...)` form
- Safety guard: skips DELETE entirely when no batch produced records

22 unit tests in tests/unit/test_databricks_destination.py cover:
- Config validation (schema: YAML alias, three-part FQN in
  describe(), Hive Metastore catalog)
- Empty-batch short-circuit (#595 contract)
- databricks.sql.connect() kwargs shape — protects against silent
  template-copy drift from the Snowflake destination
- INSERT happy path + on_error=skip / on_error=fail
- MERGE happy path + upsert_key required + composite key ON clause
  + all-columns-are-key (no UPDATE clause)
- Mirror invariants: upsert_key validation, MERGE-write-path
  forcing, single-column DELETE, composite-key DELETE tuple form,
  skip-when-no-records safety guard, no-op finalize_sync for
  non-mirror modes
- test_connection round-trip

databricks.sql is mocked via sys.modules injection — no real
Databricks workspace or databricks-sql-connector install required.

Requires `pip install drt-core[databricks]` (depends on
databricks-sql-connector>=3.0, already in pyproject extras).

New `docs/connectors/databricks.md` covers all three modes, auth
flow with PAT generation steps, Unity Catalog vs Hive Metastore,
the merge-path staging design (why Delta scratch table and not
CREATE TEMP TABLE), and a sync-mode compatibility table.

README destination table updated on both English and Japanese
sides (Databricks Delta Lake row added after Snowflake, v0.7.9).

i18n marker bump for README.ja.md follows the established post-merge
housekeeping pattern (#618-style).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@masukai masukai requested a review from yodakanohoshi June 9, 2026 22:26
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…e to 100%

codecov/patch on the prior commit hit 94.73% (target 86.72%) — the
gate passed but not at 100%. Uncovered slice was the 3 branches not
exercised by the happy-path tests:

- **Lines 152-163** — MERGE-path staging INSERT failure handler
  (the per-row try/except inside the staging-table INSERT loop)
- **Line 196** — mirror's ``failed_indices`` skip path inside the
  ``_mirror_keys`` accumulator (skip rows that didn't make it into
  the destination so they don't count as "observed in source" for
  the end-of-sync DELETE)
- **Line 202** — the ``Unsupported mode`` defensive fallthrough
  ValueError (unreachable in normal flow because Pydantic Literal
  validates at config-load time, but tracked by coverage)

4 new tests:

1. `test_merge_staging_insert_failure_on_error_skip` — first staging
   INSERT fails, second succeeds; verifies result.failed=1 +
   row_errors recorded + MERGE still runs against whatever made it
   into staging.

2. `test_merge_staging_insert_failure_on_error_fail_raises` — same
   failure scenario but with on_error=fail; verifies the exception
   re-raises and the connection is still closed via try/finally.

3. `test_unsupported_mode_raises` — manually corrupts ``config.mode``
   to "garbage" after Pydantic construction (bypasses Literal
   validation via ``object.__setattr__``) and verifies the
   defensive ValueError fires.

4. `test_mirror_skips_failed_keys_from_delete_observed_set` — mirror
   load with a staging failure on row 1, then finalize_sync;
   verifies the DELETE's NOT-IN list contains only the survivor's
   key (id=2), not the failed row's key (id=1). This catches the
   semantic bug where a row that failed to load would be deleted
   from the destination on next mirror run.

drt/destinations/databricks.py file coverage: 94% → 100% (119/119
stmts). Coverage now matches the S3 / GCS / Azure Blob destinations
from #613 / #623 / #624.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@masukai masukai merged commit 3e66e92 into main Jun 9, 2026
8 checks passed
@masukai masukai deleted the feat/databricks-destination branch June 9, 2026 23:52
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add Databricks destination (Delta Lake upsert)

1 participant