-
Notifications
You must be signed in to change notification settings - Fork 2k
SC-211135: Implement analyzer rule for V2 streaming reads #5475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This implements Option A from the design doc: using an analyzer rule to replace V1 (DeltaTableV2) with V2 (SparkTable) for streaming reads only. Key changes: - Add UseKernelForStreamingRule analyzer rule in spark-unified module - Rule pattern matches on StreamingRelationV2 to isolate streaming reads - Add DELTA_KERNEL_STREAMING_ENABLED config flag (default: false) - Register rule in DeltaSparkSessionExtension Behavior: - Streaming reads (readStream) → V2 (Kernel-based, MicroBatchStream) - Streaming writes (writeStream) → V1 (DeltaLog-based) - Batch reads/writes → V1 (DeltaLog-based) - MERGE/UPDATE/DELETE → V1 (DeltaLog-based) This approach: - Requires zero user code changes - Works with existing V1 and V2 implementations unchanged - Enables gradual rollout via configuration flag - Provides graceful fallback on errors
| // Register the analyzer rule for kernel-based streaming | ||
| // This rule replaces V1 (DeltaTableV2) with V2 (SparkTable) for streaming queries | ||
| extensions.injectResolutionRule { session => | ||
| new UseKernelForStreamingRule(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel-spark will be named as sparkV2. So let's rename the rule as UseV2ForStreaming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| /////////////////// | ||
|
|
||
| val DELTA_KERNEL_STREAMING_ENABLED = | ||
| buildConf("kernel.streaming.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about sparkV2.streaming.enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to v2.streaming.enabled
|
|
||
| test("catalog table logical plan uses V2 when enabled") { | ||
| withTable("test_table") { | ||
| sql("CREATE TABLE test_table (id INT, value STRING) USING delta") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's have a beforeAll and afterAll action for the test creatation and drop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think withTable is cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we dont need to create table and insert table in every test case. Anyway this is nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this table properties? delta.feature.catalogOwned-preview=supported
| } | ||
| } | ||
|
|
||
| test("path-based table uses V1 even when config enabled") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test the V2 streaming code path with path-based table as well.
So probably we should create a new config for testing only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this works, SparkTable requires Identifier as a constructor parameter. Unless we want to use a random generated identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta.`tablePath`
| * Check if the DataSource is a catalog-managed Delta table. | ||
| * We only convert catalog-managed tables to V2, not path-based tables. | ||
| */ | ||
| private def isCatalogManagedDeltaTable(dataSource: DataSource): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#5477 will introduce utils to check is a table is a ccv2 table or uc ccv2 table. Maybe try to use/ patch it for testing?
This implements Option A from the design doc: using an analyzer rule to replace V1 (DeltaTableV2) with V2 (SparkTable) for streaming reads only.
Key changes:
Behavior:
This approach:
Which Delta project/connector is this regarding?
Description
How was this patch tested?
Does this PR introduce any user-facing changes?