Skip to content

Conversation

@TimothyW553
Copy link
Collaborator

@TimothyW553 TimothyW553 commented Dec 3, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

  • Introduce a protocol-shaped, Spark-free catalog contract: CatalogWithManagedCommits (GET/POST
    commits, extractTableId) lives in kernel-defaults and is discovered via ServiceLoader. This removes
    UC/Spark dependencies from the core and lets UC/Glue/Polaris plug in without kernel-spark changes.
  • Centralize snapshot/commit-range construction in a generic manager: CatalogManagedSnapshotManager
    now paginates getCommits, enforces contiguity, maps ratified commits to _delta_log files, and
    builds snapshots with withLogData/.withMaxCatalogVersion (and .atVersion for time travel). Catalog
    implementations only fetch ratified commit metadata (and propose commits), reducing duplication
    and coupling.
  • Minimal connector wiring: SparkTable just routes through
    DeltaSnapshotManagerFactory.fromCatalogTable/fromPath; path tables remain unchanged. Managed tables
    auto-resolve via ServiceLoader if a table ID is present, otherwise fall back to path-based.

This flow shows how a catalog-managed Delta table snapshot is loaded when CCv2 is in play:

  • The goal: Given a Delta table backed by a catalog coordinator (UC/Glue/Polaris), fetch the
    catalog’s ratified commits and build a Delta Snapshot using only those commits, while falling back
    to path-based when no catalog info is present.
  • How it works: SparkTable asks the factory for a snapshot manager. If a catalog implementation
    is discoverable (via ServiceLoader) and yields a table ID, the catalog-managed manager is used;
    otherwise, it’s path-based. The catalog-managed manager calls the catalog’s GET commits, paginates
    and enforces contiguous versions, maps the commit entries to the exact _delta_log files, then
    builds the Snapshot with withLogData and withMaxCatalogVersion (and optional time travel). The
    final Snapshot is returned to SparkTable for reads.

Diagram:

flowchart LR
    A["SparkTable"] -->|1| B["DeltaSnapshotManagerFactory.fromCatalogTable/fromPath"]
    B -->|2| C{"ServiceLoader finds CatalogWithManagedCommits?"}
    C -- "yes" --> D["CatalogManagedSnapshotManager"]
    C -- "no" --> P["PathBasedSnapshotManager"]

    D -->|3| E["getCommits(tableId, path, start=0, end?)"]
    E -->|4| F["Paginate & validate contiguity"]
    F -->|5| G["Map to ParsedCatalogCommitData<br/>(_delta_log/&lt;file&gt;)"]
    G -->|6| H["TableManager.loadSnapshot(tablePath)<br/>.withLogData(logData)<br/>.withMaxCatalogVersion(latest)<br/>(.atVersion if set)"]
    H -->|7| S["Snapshot"]

    subgraph Catalog_Impl["Catalog Impl"]
      E --> UC["CatalogWithManagedCommits impl<br/>(e.g., UC client)"]
    end
Loading

Sequence work flow:

sequenceDiagram
    participant "ST" as "SparkTable (catalog/SparkTable)"
    participant "FACT" as "DeltaSnapshotManagerFactory"
    participant "MGR" as "CatalogManagedSnapshotManager"
    participant "CAT" as "CatalogWithManagedCommits (UC client via ServiceLoader)"
    participant "TM" as "TableManager (kernel)"
    participant "FS" as "_delta_log storage"

    "ST"->>"FACT": "fromCatalogTable(catalogTable, spark, hadoopConf)"
    "FACT"->>"CAT": "ServiceLoader.load(CatalogWithManagedCommits)"
    "CAT"-->>"FACT": "extractTableId(props) → Some(tableId)"
    "FACT"-->>"ST": "new CatalogManagedSnapshotManager(...)"

    "ST"->>"MGR": "loadLatestSnapshot()"
    "MGR"->>"CAT": "getCommits(tableId, tablePath, start=0, end=None)"
    "CAT"-->>"MGR": "GetCommitsResult(commits[], latestTableVersion)"
    "MGR"->>"MGR": "paginate if needed & validateContiguous"
    "MGR"->>"MGR": "toParsedLogData(commits) → ParsedCatalogCommitData"
    "MGR"->>"TM": "loadSnapshot(tablePath)<br/>.withLogData(logData)<br/>.withMaxCatalogVersion(latestTableVersion)<br/>(.atVersion(v) if provided)"
    "TM"->>"FS": "read commit/checkpoint files"
    "FS"-->>"TM": "file contents"
    "TM"-->>"MGR": "Snapshot"
    "MGR"-->>"ST": "Snapshot"
Loading

How was this patch tested?

Does this PR introduce any user-facing changes?

Signed-off-by: TimothyW553 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant