feat(cdc): persist the backfill state for table-on-source #13276

StrikeW · 2023-11-07T00:19:49Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Add a state table (table_id | pk columns... | backfill_finished | row_count | cdc_offset) to the CdcBackfill executor to persist the backfill progress during backfill. The pk columns would be multiple columns if the table has composite primary keys.

close #13204

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

…tate

src/connector/src/parser/mod.rs

src/frontend/src/optimizer/plan_node/stream_table_scan.rs

codecov · 2023-11-07T05:46:21Z

Codecov Report

Attention: 185 lines in your changes are missing coverage. Please review.

Comparison is base (0020507) 67.87% compared to head (d8fd8a2) 67.86%.
Report is 2 commits behind head on main.

Files	Patch %	Lines
src/stream/src/executor/backfill/cdc/state.rs	32.71%	72 Missing ⚠️
src/meta/service/src/ddl_service.rs	0.00%	62 Missing ⚠️
...d/src/optimizer/plan_node/stream_cdc_table_scan.rs	0.00%	24 Missing ⚠️
...c/stream/src/executor/backfill/cdc/cdc_backfill.rs	35.71%	9 Missing ⚠️
src/stream/src/from_proto/stream_scan.rs	0.00%	8 Missing ⚠️
src/meta/src/manager/catalog/mod.rs	0.00%	3 Missing ⚠️
src/stream/src/from_proto/source/trad_source.rs	0.00%	2 Missing ⚠️
src/common/src/catalog/external_table.rs	0.00%	1 Missing ⚠️
src/connector/src/parser/mod.rs	0.00%	1 Missing ⚠️
src/connector/src/source/external.rs	66.66%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13276      +/-   ##
==========================================
- Coverage   67.87%   67.86%   -0.01%     
==========================================
  Files        1526     1526              
  Lines      259952   260043      +91     
==========================================
+ Hits       176432   176479      +47     
- Misses      83520    83564      +44

Flag	Coverage Δ
rust	`67.86% <19.21%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kwannoel · 2023-11-09T02:19:02Z

Is there some test for cdc backfill over recovery?

StrikeW · 2023-11-09T02:24:23Z

Is there some test for cdc backfill over recovery?

Right now only has e2e test. I insert new data into MV created in cdc.share_stream.slt after the cluster has been shutdown, then restart the cluster and check that new data has been ingested.

…tate

ci/scripts/e2e-source-test.sh

kwannoel · 2023-11-09T06:27:08Z

src/stream/src/executor/backfill/cdc/state.rs

+    }
+
+    /// Mark the backfill has done and save the last cdc offset
+    pub async fn mutate_state(&mut self, state_item: CdcStateItem) -> StreamExecutorResult<()> {


Suggested change

pub async fn mutate_state(&mut self, state_item: CdcStateItem) -> StreamExecutorResult<()> {

pub async fn finish_state(&mut self, state_item: CdcStateItem) -> StreamExecutorResult<()> {

The comment is misleading, the function is not called only after the backfill is complete. We use this function to update the backfill state.

src/stream/src/executor/backfill/cdc/state.rs

kwannoel · 2023-11-09T06:31:23Z

src/stream/src/executor/backfill/cdc/state.rs

+
+                Ok(CdcStateItem {
+                    is_finished: finished,
+                    ..Default::default()


Hmm should we only recover is_finished? What about last cdc offset, row count etc

The cdc offset and row count is only for observability, the executor doesn't rely on these fields.

src/stream/src/executor/backfill/cdc/state.rs

StrikeW · 2023-11-09T11:34:32Z

Existing comments addressed. PTAL @hzxa21 @fuyufjh @BugenZhao

fuyufjh

Mostly LGTM.

This PR only records a boolean flag - completed or not in the persisted state, intead of a real progress (marked by PK perhaps). I think we had better complete this before releasing to users, because we can't just tell users the progress is persisted but either 0% or 100%.

src/frontend/src/optimizer/plan_node/stream_table_scan.rs

fuyufjh · 2023-11-10T09:24:47Z

src/stream/src/executor/backfill/cdc/state.rs

 pub enum CdcBackfillStateImpl<S: StateStore> {
-    Undefined,
-    SingleTable(SingleTableState<S>),
+    SingleTable(SingleBackfillState<S>),
+    MultiTable(MultiBackfillState<S>),
+}


In the future, shall we deprecate the SingleTable option (and remove the enum)? It seems not worth to maintain it because MultiTable can completely cover its use case.

I also want to deprecate it. But we need to change the original Table job plan from source -> mview to source -> stream_scan -> mview, so that it can has its own state, technically it is feasible.

+1. Currently we have 3 ways to use mysql-cdc:

create table xxx with (....)

set cdc_backfill='true' + create table xxx with (...)

set cdc_backfill=true + create source with (...) + create table from ...

It is confused and not easy to explain to user which one to choose. Ideally we should only have 2 ways:

create table xxx with (....)

create source with (...) + create table from ...

1 and 2 can share the same implementation and 1 can be a syntax suger for 2 if only user only needs to cdc one table from the db.

The question is whether we need to ensure compatibility to make existing jobs created via set cdc_backfill='true' + create table xxx with (...) to be compatible. IMO, we can keep it simple and avoid compatibility codes here since cdc_backfill is announced as experimental feature only.

src/stream/src/from_proto/stream_scan.rs

src/meta/service/src/ddl_service.rs

src/frontend/src/optimizer/plan_node/stream_table_scan.rs

src/stream/src/executor/backfill/cdc/cdc_backfill.rs

…tate

StrikeW

I updated the pr to support recoverable backfill, that is to record the backfill progress and can continue the progress upon cluster recovery.

src/stream/src/from_proto/stream_scan.rs

src/meta/service/src/ddl_service.rs

StrikeW · 2023-11-13T03:16:56Z

src/stream/src/executor/backfill/cdc/state.rs

 pub enum CdcBackfillStateImpl<S: StateStore> {
-    Undefined,
-    SingleTable(SingleTableState<S>),
+    SingleTable(SingleBackfillState<S>),
+    MultiTable(MultiBackfillState<S>),
+}


I also want to deprecate it. But we need to change the original Table job plan from source -> mview to source -> stream_scan -> mview, so that it can has its own state, technically it is feasible.

…tate

hzxa21 · 2023-11-13T05:55:41Z

src/stream/src/executor/backfill/cdc/state.rs

        let mut key = self.split_id.to_string();
        key.push_str(BACKFILL_STATE_KEY_SUFFIX);
        // write backfill finished flag
        self.source_state_handler
-            .set(key.into(), JsonbVal::from(Value::Bool(true)))
+            .set(
+                key.into(),


I just realized that we are treating the source state table as a key value store and setting two keys (<split_id>_backfill here and <split_id> in L298). I feel that this can make the source state table contains different primary key schema. Will this break the read path of the source state table?

Since for the backfill for single table scenario, we didn't introduce a new operator to the query plan. Just wraps the source executor in the cdc backfill executor and reuse the state table of Source. We access source state table only via point query with the key column, so it won't break.

risingwave/src/stream/src/from_proto/source/trad_source.rs

Line 200 in b10238c

Box::new(source_exec),

True. If only get is used in the streaming job, there won't be unexpected result. But when querying source state via SQL, we can see something like this:

partition_id | offset_info ---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1002 | {"split_info": {"mysql_split": {"inner": {"snapshot_done": false, "split_id": 1002, "start_offset": "{\"sourcePartition\":{\"server\":\"RW_CDC_1002\"},\"sourceOffset\":{\"transaction_id\":null,\"ts_sec\":1699866101,\"file\":\"binlog.000012\",\"pos\":157,\"server_id\":1},\"isHeartbeat\":true}"}}, "pg_split": null}, "split_type": "mysql-cdc"} 1002_backfill | true (2 rows)

This looks very strange to me. How about storing the backfill finished flag inside the json as a new field under "mysql_split"? This issue is introduced in #12535 and not related to the changes in this PR. I am okay with handling it separately.

Yeah, it is a bit dirty. As mentioned in #13276 (comment), I plan to refactor the plan of single table cdc backfill to make the cdc backfill become a separate operator instead of just wraps the source inside it. So that we don't need to write state to source state table.

hzxa21

LGTM

kwannoel · 2023-11-14T02:23:31Z

I updated the pr to support recoverable backfill, that is to record the backfill progress and can continue the progress upon cluster recovery.

Could you update the PR description and add a test for this as well (separate PR is fine as well, as long as there's an issue to track it).

Previously the test only tests for case where cdc backfill is done.

kwannoel · 2023-11-14T02:26:01Z

Also do we need both pk and cdc offset?
If we have pk already, seems we don't need cdc offset?

StrikeW · 2023-11-14T02:54:59Z

Thanks for the reminder, I will add more test for the recoverable backfill part, tracked by #13400

Also do we need both pk and cdc offset? If we have pk already, seems we don't need cdc offset?

The persisted cdc_offset acts as a low watermark for upstream events to ignore events with offset that less than the cdc_offset. And also may help us debuging the backfill process.

fuyufjh

LGTM

…tate

StrikeW added 6 commits November 5, 2023 17:42

fill table id for cdc backfill

69ce9b7

WIP: persist backfill state

e6a2495

finish persist backfill state

1ee01a9

clean code

ad2f388

add test case

fa6bd75

Merge remote-tracking branch 'origin/main' into siyuan/cdc-backfill-s…

8c5bbf9

…tate

github-actions bot added the type/feature label Nov 7, 2023

StrikeW requested review from tabVersion, BugenZhao, kwannoel and hzxa21 November 7, 2023 00:21

minor

cc5ebf2

kwannoel reviewed Nov 7, 2023

View reviewed changes

src/connector/src/parser/mod.rs Show resolved Hide resolved

kwannoel reviewed Nov 7, 2023

View reviewed changes

src/frontend/src/optimizer/plan_node/stream_table_scan.rs Outdated Show resolved Hide resolved

update test

7b20a95

StrikeW force-pushed the siyuan/cdc-backfill-state branch from 96991d1 to 7b20a95 Compare November 7, 2023 05:15

StrikeW marked this pull request as draft November 8, 2023 13:30

StrikeW added 3 commits November 9, 2023 11:50

refine state definitions

c2fd0ae

minor

2d7a806

Merge remote-tracking branch 'origin/main' into siyuan/cdc-backfill-s…

19602ad

…tate

StrikeW marked this pull request as ready for review November 9, 2023 05:49

StrikeW requested a review from fuyufjh November 9, 2023 05:49

kwannoel reviewed Nov 9, 2023

View reviewed changes

ci/scripts/e2e-source-test.sh Show resolved Hide resolved

kwannoel reviewed Nov 9, 2023

View reviewed changes

src/stream/src/executor/backfill/cdc/state.rs Show resolved Hide resolved

kwannoel reviewed Nov 9, 2023

View reviewed changes

src/stream/src/executor/backfill/cdc/state.rs Show resolved Hide resolved

refine dbz connector config

a5c4a80

StrikeW requested a review from kwannoel November 9, 2023 11:31

StrikeW added 2 commits November 9, 2023 19:39

minor

4e09161

check server.id is provided

5696fbe

fuyufjh reviewed Nov 10, 2023

View reviewed changes

hzxa21 reviewed Nov 10, 2023

View reviewed changes

src/frontend/src/optimizer/plan_node/stream_table_scan.rs Outdated Show resolved Hide resolved

src/frontend/src/optimizer/plan_node/stream_table_scan.rs Outdated Show resolved Hide resolved

src/stream/src/executor/backfill/cdc/cdc_backfill.rs Outdated Show resolved Hide resolved

StrikeW added 2 commits November 12, 2023 14:41

Merge remote-tracking branch 'origin/main' into siyuan/cdc-backfill-s…

9921ba2

…tate

recoverable cdc backfill

a1a3f31

StrikeW commented Nov 13, 2023

View reviewed changes

fix

26a6113

StrikeW force-pushed the siyuan/cdc-backfill-state branch from bec0206 to 26a6113 Compare November 13, 2023 03:31

StrikeW requested review from hzxa21 and fuyufjh November 13, 2023 03:31

Merge remote-tracking branch 'origin/main' into siyuan/cdc-backfill-s…

2ad688f

…tate

hzxa21 reviewed Nov 13, 2023

View reviewed changes

hzxa21 approved these changes Nov 13, 2023

View reviewed changes

fuyufjh approved these changes Nov 14, 2023

View reviewed changes

assert vnodes is None

362ddce

StrikeW enabled auto-merge November 14, 2023 05:28

StrikeW added this pull request to the merge queue Nov 14, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 14, 2023

Merge remote-tracking branch 'origin/main' into siyuan/cdc-backfill-s…

d8fd8a2

…tate

StrikeW added this pull request to the merge queue Nov 14, 2023

Merged via the queue into main with commit 3b7036c Nov 14, 2023
8 of 9 checks passed

StrikeW deleted the siyuan/cdc-backfill-state branch November 14, 2023 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cdc): persist the backfill state for table-on-source #13276

feat(cdc): persist the backfill state for table-on-source #13276

StrikeW commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023 •

edited

Loading

kwannoel commented Nov 9, 2023

StrikeW commented Nov 9, 2023

kwannoel Nov 9, 2023

StrikeW Nov 9, 2023

kwannoel Nov 9, 2023

StrikeW Nov 9, 2023

StrikeW commented Nov 9, 2023

fuyufjh left a comment •

edited

Loading

fuyufjh Nov 10, 2023

StrikeW Nov 13, 2023

hzxa21 Nov 13, 2023

hzxa21 Nov 13, 2023

StrikeW left a comment

StrikeW Nov 13, 2023

hzxa21 Nov 13, 2023

StrikeW Nov 13, 2023 •

edited

Loading

hzxa21 Nov 13, 2023

StrikeW Nov 13, 2023 •

edited

Loading

hzxa21 left a comment

kwannoel commented Nov 14, 2023

kwannoel commented Nov 14, 2023

StrikeW commented Nov 14, 2023

fuyufjh left a comment

	pub async fn mutate_state(&mut self, state_item: CdcStateItem) -> StreamExecutorResult<()> {
	pub async fn finish_state(&mut self, state_item: CdcStateItem) -> StreamExecutorResult<()> {

feat(cdc): persist the backfill state for table-on-source #13276

feat(cdc): persist the backfill state for table-on-source #13276

Conversation

StrikeW commented Nov 7, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

codecov bot commented Nov 7, 2023 • edited Loading

Codecov Report

kwannoel commented Nov 9, 2023

StrikeW commented Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW commented Nov 9, 2023

fuyufjh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

kwannoel commented Nov 14, 2023

kwannoel commented Nov 14, 2023

StrikeW commented Nov 14, 2023

fuyufjh left a comment

Choose a reason for hiding this comment

StrikeW commented Nov 7, 2023 •

edited

Loading

codecov bot commented Nov 7, 2023 •

edited

Loading

fuyufjh left a comment •

edited

Loading

StrikeW Nov 13, 2023 •

edited

Loading

StrikeW Nov 13, 2023 •

edited

Loading