[kernel-spark] Implement initialOffset() for dsv2 streaming #5498

zikangh · 2025-11-14T18:12:43Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/initialoffset2 [Files changed]
- stack/plan1 [Files changed]
  - stack/integration [Files changed]
    - stack/snapshot1
  - stack/reader [Files changed]

Which Delta project/connector is this regarding?

Description

We finish implementing initialOffset() in SparkMicroBatchStream.java.

The initialOffset() method determines where a streaming query should start reading when there's no checkpointed offset. This is a DSv2-only API.

Details:

Added isFirstBatch tracking field - Boolean flag to track whether we're processing the first batch (set to true in initialOffset())
Updated latestOffset(startOffset, limit) - Now handles first batch differently by returning null (not previousOffset) when no data is available, matching DSv1's getStartingOffsetFromSpecificDeltaVersion behavior

How was this patch tested?

Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream).

Does this PR introduce any user-facing changes?

zikangh · 2025-11-14T18:16:45Z

Hello @huan233usc @gengliangwang @tdas @jerrypeng, could you please take a look at this PR? Thanks!

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

huan233usc

LGTM

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkScan.java

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

jerrypeng · 2025-11-25T05:47:33Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+  public Offset latestOffset(Offset startOffset, ReadLimit limit) {
+    // For the first batch, initialOffset() should be called before latestOffset().
+    // if startOffset is null: no data is available to read.
+    if (startOffset == null) {


This condition should never happen for DSv2 right?

We should just assert as a sanity check

It could happen:

we need a way for initialOffset() to indicate "there's no data to read".

latestOffset() will be called even if initialOffset() returns null: https://github.com/apache/spark/blob/27a4849834406a5bbfb0a0b11ea8b725936baef6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/MicroBatchExecution.scala#L773

we need a way for initialOffset() to indicate "there's no data to read".

Why would initialOffset need to indicate that?

latestOffset() will be called even if initialOffset() returns null

Why do initialOffset need initialOffset to return null. I don't understand the use case. Pls reference what Kafka v2 source:

https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L110

You are right. Please take a look at the modified code.

latestOffset() might return null if there's nothing to read (e.g. table empty).

DSv2 APIs don't explicitly mention whether initialOffset() can return null, but I'm now in agreement with you that we should not return null.

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSource.scala

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

jerrypeng · 2025-11-27T01:53:15Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+    }
+
+    if (version < 0) {
+      // This shouldn't happen; defensively return null.


This is this valid? The starting version shouldn't be less than 0. That is not a valid starting version for delta.

Yes, this shouldn't happen -- we validate startingVersion in DeltaOptions.scala before we reach this code. I'm mirroring the logic here in DSv1:

delta/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSource.scala

Line 944 in 60730fe

if (version < 0) {

.

…treaming (#5409) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/5409/files) to review incremental changes. - [**stack/latestsnapshot2**](#5409) [[Files changed](https://github.com/delta-io/delta/pull/5409/files)] - [stack/initialoffset2](#5498) [[Files changed](https://github.com/delta-io/delta/pull/5498/files/1718356813a6b39c80585d36e7aac6c8abc3a6a0..9833eaf816ee2f1dcf94d5d9a47136e69fd26336)] - [stack/plan1](#5499) [[Files changed](https://github.com/delta-io/delta/pull/5499/files/9833eaf816ee2f1dcf94d5d9a47136e69fd26336..90345c732c6bd182c51648a4b875fdce2c14fc63)] - [stack/integration](#5572) [[Files changed](https://github.com/delta-io/delta/pull/5572/files/90345c732c6bd182c51648a4b875fdce2c14fc63..813a49a41719ef4b773caf5438c975c8f77c646b)] - stack/reader ---------  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  We add implementation for `latestOffset(startOffset, limit)` and `getDefaultReadLimit()` for a complete `SupportsAdmissionControl` implementation. Also refactored a few `DeltaSource.scala` methods -- we make them static so we can call them from SparkMicrobatchStream.java. ## How was this patch tested?  Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream). ## Does this PR introduce _any_ user-facing changes?  No --------- Signed-off-by: TimothyW553 <[email protected]> Signed-off-by: Timothy Wang <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Timothy Wang <[email protected]>

…treaming (delta-io#5409) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/5409/files) to review incremental changes. - [**stack/latestsnapshot2**](delta-io#5409) [[Files changed](https://github.com/delta-io/delta/pull/5409/files)] - [stack/initialoffset2](delta-io#5498) [[Files changed](https://github.com/delta-io/delta/pull/5498/files/1718356813a6b39c80585d36e7aac6c8abc3a6a0..9833eaf816ee2f1dcf94d5d9a47136e69fd26336)] - [stack/plan1](delta-io#5499) [[Files changed](https://github.com/delta-io/delta/pull/5499/files/9833eaf816ee2f1dcf94d5d9a47136e69fd26336..90345c732c6bd182c51648a4b875fdce2c14fc63)] - [stack/integration](delta-io#5572) [[Files changed](https://github.com/delta-io/delta/pull/5572/files/90345c732c6bd182c51648a4b875fdce2c14fc63..813a49a41719ef4b773caf5438c975c8f77c646b)] - stack/reader ---------  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  We add implementation for `latestOffset(startOffset, limit)` and `getDefaultReadLimit()` for a complete `SupportsAdmissionControl` implementation. Also refactored a few `DeltaSource.scala` methods -- we make them static so we can call them from SparkMicrobatchStream.java. ## How was this patch tested?  Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream). ## Does this PR introduce _any_ user-facing changes?  No --------- Signed-off-by: TimothyW553 <[email protected]> Signed-off-by: Timothy Wang <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Timothy Wang <[email protected]>

gengliangwang · 2025-12-03T23:57:55Z

@zikangh could you resolve the conflicts?

tdas · 2025-12-04T23:16:35Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

  private final boolean shouldValidateOffsets;
  private final SparkSession spark;

+  // Tracks whether this is the first batch for this stream (no checkpointed offset).


can you write down the assumptions around this. that this boolean is used with the assumption that the following sequence of methods will be called - initialOffset -> latestOffset -> ... and then set to false.

tdas · 2025-12-04T23:25:50Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+
+  @ParameterizedTest
+  @MethodSource("initialOffsetParameters")
+  public void testInitialOffset_FirstBatchParity(


is this the write way to test this stuff?

basically... there are other existing tests that already provide test coverage for this.. right? is it that we are unable to run those tests because the streaming source v2 is incomplete?

Yes, we have a lot of end-to-end tests that we can enable once we have all the requisite pieces. This is a future PR that enables some of these tests: https://github.com/delta-io/delta/pull/5572/files/154897c75c21697300bd31e851b04147339ce466..f6980981137c5943fc590f0b46c70557adb4d161#diff-5b8b5b3f181cbc43ecdeffe4e814641b3e78801fb46d6aaf74a6b2928ba64791

tdas

this makes sense. minor questions but LGTM

…#5498) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/5498/files) to review incremental changes. - [**stack/initialoffset2**](delta-io#5498) [[Files changed](https://github.com/delta-io/delta/pull/5498/files)] - [stack/plan1](delta-io#5499) [[Files changed](https://github.com/delta-io/delta/pull/5499/files/90e1d9ba4b26d039bfa1b870e693e73204201750..35731eb6ffcb10f85ed97b04058e3bf49de771d8)] - [stack/integration](delta-io#5572) [[Files changed](https://github.com/delta-io/delta/pull/5572/files/35731eb6ffcb10f85ed97b04058e3bf49de771d8..9c2e743cff0c1fcb8cf6ddf8efa3a1b98fddba3c)] - stack/snapshot1 - [stack/reader](delta-io#5638) [[Files changed](https://github.com/delta-io/delta/pull/5638/files/35731eb6ffcb10f85ed97b04058e3bf49de771d8..35731eb6ffcb10f85ed97b04058e3bf49de771d8)] ---------  #### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  We finish implementing `initialOffset()` in `SparkMicroBatchStream.java`. The `initialOffset()` method determines where a streaming query should start reading when there's no checkpointed offset. This is a DSv2-only API. Details: - Added `isFirstBatch` tracking field - Boolean flag to track whether we're processing the first batch (set to true in initialOffset()) - Updated `latestOffset(startOffset, limit)` - Now handles first batch differently by returning null (not `previousOffset`) when no data is available, matching DSv1's `getStartingOffsetFromSpecificDeltaVersion` behavior ## How was this patch tested? Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream).  ## Does this PR introduce _any_ user-facing changes?

zikangh mentioned this pull request Nov 14, 2025

[kernel-spark] Implement latestOffset() with rate limiting for dsv2 streaming #5409

Merged

5 tasks

zikangh changed the title ~~initial offset~~ [kernel-spark] Implement initialOffset() for dsv2 streaming Nov 14, 2025

zikangh marked this pull request as ready for review November 14, 2025 18:16

huan233usc reviewed Nov 14, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

huan233usc approved these changes Nov 14, 2025

View reviewed changes

gengliangwang reviewed Nov 17, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Show resolved Hide resolved

gengliangwang reviewed Nov 17, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkScan.java Outdated Show resolved Hide resolved

gengliangwang reviewed Nov 17, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

zikangh force-pushed the stack/initialoffset2 branch 2 times, most recently from 195a309 to 3763efa Compare November 17, 2025 22:58

zikangh requested a review from gengliangwang November 17, 2025 23:29

gengliangwang approved these changes Nov 18, 2025

View reviewed changes

gengliangwang requested a review from tdas November 18, 2025 18:58

This was referenced Nov 24, 2025

[kernel-spark] Implement planInputPartitions and createReaderFactory for dsv2 streaming #5499

Merged

[kernel-spark] Implement deserializeOffset(), commit(), and stop() for dsv2 streaming #5572

Merged

jerrypeng reviewed Nov 25, 2025

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSource.scala Show resolved Hide resolved

jerrypeng reviewed Nov 25, 2025

View reviewed changes

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java Show resolved Hide resolved

zikangh force-pushed the stack/initialoffset2 branch 3 times, most recently from 05dc041 to b777299 Compare November 26, 2025 00:54

PorridgeSwim mentioned this pull request Nov 26, 2025

[kernel-spark] Implement availableNow trigger support for dsv2 streaming #5585

Open

5 tasks

zikangh force-pushed the stack/initialoffset2 branch 3 times, most recently from f288551 to 3c2826a Compare November 26, 2025 23:34

zikangh requested a review from jerrypeng November 26, 2025 23:35

jerrypeng reviewed Nov 27, 2025

View reviewed changes

zikangh force-pushed the stack/initialoffset2 branch from 3c2826a to 9833eaf Compare December 1, 2025 19:11

zikangh force-pushed the stack/initialoffset2 branch from 9833eaf to 8087d6a Compare December 2, 2025 03:39

zikangh requested a review from jerrypeng December 2, 2025 19:39

jerrypeng approved these changes Dec 3, 2025

View reviewed changes

zikangh force-pushed the stack/initialoffset2 branch from 8087d6a to 63662ce Compare December 4, 2025 21:48

zikangh mentioned this pull request Dec 4, 2025

[kernel-spark] Implement 2-pass delta commit validation algorithm for dsv2 streaming #5638

Open

5 tasks

zikangh force-pushed the stack/initialoffset2 branch from 63662ce to e1adc7c Compare December 4, 2025 22:34

tdas reviewed Dec 4, 2025

View reviewed changes

tdas approved these changes Dec 4, 2025

View reviewed changes

zikangh force-pushed the stack/initialoffset2 branch 2 times, most recently from 89b26e6 to 4e9c219 Compare December 5, 2025 00:29

initial offset: squashed commit

90e1d9b

zikangh force-pushed the stack/initialoffset2 branch from 4e9c219 to 90e1d9b Compare December 5, 2025 00:59

huan233usc merged commit 9cc309a into delta-io:master Dec 5, 2025
20 checks passed

zikangh added the kernel-spark label Dec 10, 2025

[kernel-spark] Implement initialOffset() for dsv2 streaming #5498

[kernel-spark] Implement initialOffset() for dsv2 streaming #5498

Uh oh!

Conversation

zikangh commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

zikangh commented Nov 14, 2025

Uh oh!

Uh oh!

huan233usc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zikangh Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zikangh commented Nov 14, 2025 •

edited

Loading

zikangh Dec 5, 2025 •

edited

Loading