[copy_from]: Initial implementation, add `OneshotSource` and `OneshotFormat`, support appending Batches to Tables #30942

ParkMyCar · 2025-01-03T19:28:51Z

This PR is an initial implementation of COPY ... FROM <url>, aka "COPY FROM S3".

Goals for this PR

Note: traditionally we may have written a design doc for this feature, but I would instead like to try a lighter weight approach where we specifically make a decision on the core changes necessary for this feature, and later record those in a Decision Log. Those are:

How do we handle appending large amount of data to a Table?
- This PR has implemented this via creating Persist Batches in clusterd and then handing them back to environmentd for final linking into the Persist shard, these changes are in the 3rd commit. A different idea would be to implement "renditions" in Persist, but that is a much larger change.
Should this "oneshot ingestion" live in "storage" or "compute"?
- I added this implementation to "storage" because it seemed easier to do.
(really 2a) Do the current changes to the Storage Controller API/Protocol make sense?
- The Storage Controller already has lots of responsibilities, I want to make sure that changes I made resonate with folks and fit into any "north star"-ish visions we have, these changes are in the 2nd commit.
- Specifically, I could see wanting to fold StorageCommand::RunOneshotIngestion into StorageCommand::RunIngestion, but given a oneshot ingestion is ephemeral and shouldn't be restarted when clusterd crashes, keeping them separate seemed reasonable.

What this PR implements

A framework for "oneshot sources" and formats via two new traits, OneshotSource and OneshotFormat.
- There is a doc comment on src/storage-operators/src/oneshot_source.rs that should explain how these traits are used. If that comment is not clear please let me know!
Changes to the Storage protocol for creating a "oneshot ingestion", and having it async respond to the Coordinator/environmentd with ProtoBatchs that can be linked into a Table.
Changes to txn-wal and the Coordinator to support appending ProtoBatchs to a Table instead of just Vec<Row>.
Parsing, Planning, and Sequencing changes to support COPY ... FROM <url>

Feature Gating

The COPY ... FROM <url> feature is currently gated behind a LaunchDarkly flag called enable_copy_from_remote, so all of the Storage related code is not reachable, unless this flag is turned on. Only the SQL parser and Table appending changes are reachable without the flag.

Motivation

Progress towards https://github.com/MaterializeInc/database-issues/issues/6575

Tips for reviewer

I did my best to split this PR into logically separate commits to make them easier to review:

Initial implementation of "oneshot ingetions". This commit defines the OneshotSource and OneshotFormat traits, and implements the dataflow rendering for a "oneshot ingestion".
Changes to the Storage Controller to support rendering and sending results of a "oneshot ingestion". ⭐
Changes to the Coordinator and txn-wal to support appending Batches to tables. ⭐
Parsing, Planning, and Sequencing changes in the Adapter.
Formatting, Linting, etc.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def-

I'd like to see some tests for this:

testdrive
platform-checks
parallel-workload

I can work on that myself later this week if it's ok for you.

jkosh44

Adapter parts LGTM, didn't look closely at the storage parts.

jkosh44 · 2025-01-07T15:30:11Z

src/storage-client/src/client.rs

+#[derive(Debug, Clone, PartialEq)]
+pub enum TableData {
+    /// Rows that still need to be persisted and appended.
+    ///
+    /// The contained [`Row`]s are _not_ consolidated.
+    Rows(Vec<(Row, Diff)>),
+    /// Batches already staged in Persist ready to be appended.
+    Batches(SmallVec<[ProtoBatch; 1]>),
+}
+
+impl TableData {
+    pub fn is_empty(&self) -> bool {
+        match self {
+            TableData::Rows(rows) => rows.is_empty(),
+            TableData::Batches(batches) => batches.is_empty(),
+        }
+    }
+}
+


This is totally subjective and just an idea, so feel free to ignore/disagree. It seems like we almost always wrap this enum in a Vec, which gives us a slightly awkward vec of vecs. Would we be better off using something like the following so we can consolidate all the inner vecs?

struct TableData { rows: Vec<(Row, Diff)>, batches: SmallVec<[ProtoBatch; 1]>, }

Good idea, I'll try this out and see how it feels

src/txn-wal/src/txn_write.rs

src/sql/src/plan/statement/dml.rs

src/adapter/src/coord/sequencer/inner/copy_from.rs

jkosh44 · 2025-01-07T16:03:12Z

src/adapter/src/coord/sequencer/inner/copy_from.rs

+        // Stash the execute context so we can cancel the COPY.
+        self.active_copies
+            .insert(ctx.session().conn_id().clone(), ctx);


I'm not sure if this has been discussed, but if we cancel after the batch has been staged, then will we leak the batch in persist?

We would leak the batches. If we cancel the request we could spawn a task that will wait for the response and clean them up, but concurrently Persist is also working on a leaked blob detector so this shouldn't be too much of an issue

src/adapter/src/coord/sequencer/inner/copy_from.rs

jkosh44 · 2025-01-07T16:05:50Z

src/adapter/src/coord/sequencer/inner/copy_from.rs

+        if let Err(err) = stage_write {
+            ctx.retire(Err(err));
+        } else {
+            ctx.retire(Ok(ExecuteResponse::Copied(row_count.cast_into())));


Why Copied and not CopyFrom?

The CopyFrom execute response is actually what drives the existing COPY FROM implementation, so it doesn't really work as the response type here. When ending a session with ExecuteResponse::CopyFrom we actually move the Session to a separate task which streams in data

teskje

I only got through the first commit yet, posting my comments so far.

src/storage-operators/src/oneshot_source.rs

aljoscha · 2025-01-13T16:46:25Z

src/storage-client/src/client.rs

@@ -137,7 +142,7 @@ impl<T> StorageCommand<T> {
            | AllowWrites
            | UpdateConfiguration(_)
            | AllowCompaction(_) => false,
-            RunIngestions(_) | RunSinks(_) => true,
+            RunIngestions(_) | RunSinks(_) | RunOneshotIngestion(_) => true,


note to self: come back to this one

Are these different because they're not permanent objects? What eventually cleans them up or do they shut down themselves? Then again, if they shut down themselves, might the response side wait forever for a response?

src/storage-client/src/client.proto

aljoscha · 2025-01-13T16:51:34Z

src/storage-controller/src/history.rs

@@ -71,6 +71,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> {
                RunIngestions(x) => metrics.run_ingestions_count.add(x.len().cast_into()),
                RunSinks(x) => metrics.run_sinks_count.add(x.len().cast_into()),
                AllowCompaction(x) => metrics.allow_compaction_count.add(x.len().cast_into()),
+                RunOneshotIngestion(_) => {
+                    // TODO(parkmycar): Metrics.


for this PR or follow up? I'm just curious

aljoscha · 2025-01-13T16:53:47Z

src/storage-controller/src/history.rs

@@ -147,6 +155,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> {
            run_sinks.push(sink);
        }

+        // TODO(parkmycar): ???


??? 🤔

again, note to self: what eventually cleans one-shot ingestions out of the command history? For regular ingestions it's an AllowCompaction to the empty antichain. See here for that logic:

materialize/src/storage-controller/src/history.rs

Line 129 in 7667ddf

// Discard ingestions that have been dropped, keep the rest.

IMO we should handle this the same way we handle peeks in compute: Have a CancelOneshotIngestion command and have the controller send that command as soon as it receives a StagedBatches response. When compacting the command history, a CancelOneshotIngestion command cancels out the corresponding RunOneshotIngestion command. In a (hopefully near) future where we support replication for storage, CancelOneshotIngestion also tells the other replicas that they don't need to bother continuing their ingestion and can safe some work.

... or I guess AllowCompaction could fulfill the same purpose if we handle it accordingly in the controller and on the replica side. Not sure if that's the case currently.

Chatted with Petros about this today and aligned with what you describe @teskje, a CancelOneshotIngestion like message similar CancelPeek. Planning to do this in a follow up if it's okay with y'all?

Fine with me, as long as we do this before making the feature available to users.

src/storage/src/storage_state.rs

aljoscha · 2025-01-13T17:01:09Z

src/storage/src/storage_state.rs

@@ -900,6 +954,10 @@ impl<'w, A: Allocate> Worker<'w, A> {
                        }
                    }
                }
+                StorageCommand::RunOneshotIngestion(oneshot) => {
+                    info!(%worker_id, ?oneshot, "reconcile: received RunOneshotIngestion command");


Might the response side sit and wait forever when this happens? Is there a timeout?

But we also don't clean out stale one-shot ingestions from our state? This is where normal ingestions are cleaned out:

materialize/src/storage/src/storage_state.rs

Line 1102 in 7667ddf

let stale_objects = self

Good question! Cancellation is still a TODO and something I'll follow up with, but like described above I'll add a CancelOneshotIngestion command, I added a TODO(cf1) here so it'll be fixed before releasing to users.

aljoscha · 2025-01-13T17:06:49Z

src/adapter/src/coord/appends.rs

        // Add table advancements for all tables.
        for table in self.catalog().entries().filter(|entry| entry.is_table()) {
            appends.entry(table.id()).or_default();
        }
-        let appends: Vec<_> = appends
+
+        // Consolidate all Rows for a given table.


So we're only consolidating the "raw" rows we might have, batched data is passed through untouched, yes?

Exactly, I left a comment describing as much

aljoscha · 2025-01-13T17:12:25Z

src/storage-controller/src/persist_handles/read_only_table_worker.rs

+                            itertools::Either::Left(iter)
+                        }
+                        TableData::Batches(_) => {
+                            // TODO(cf1): Handle Batches of updates in ReadOnlyTableWorker.


@jkosh44 We might want to decide to never want to support this? The read-only table worker is only used to write to newly migrated builtin tables in read-only mode, so ... 🤷‍♂️

Works for me, will chat with y'all in Slack

aljoscha · 2025-01-13T17:15:22Z

src/storage-client/src/client.rs

@@ -968,6 +968,25 @@ pub struct TimestamplessUpdate {
    pub diff: Diff,
 }

+#[derive(Debug, Clone, PartialEq)]


When I initially saw this used above, I thought we'd removed TimestamplessUpdate, but apparently I can't get nice things... 😅 these days, the initial sentence in is description doesn't even make sense/most people don't know what it would mean

Heh maybe one day soon! I left a TODO(cf2) to see if I can remove that type

src/storage-client/src/client.proto

src/storage-client/src/client.rs

teskje · 2025-01-14T13:38:40Z

src/storage-controller/src/history.rs

@@ -147,6 +155,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> {
            run_sinks.push(sink);
        }

+        // TODO(parkmycar): ???


IMO we should handle this the same way we handle peeks in compute: Have a CancelOneshotIngestion command and have the controller send that command as soon as it receives a StagedBatches response. When compacting the command history, a CancelOneshotIngestion command cancels out the corresponding RunOneshotIngestion command. In a (hopefully near) future where we support replication for storage, CancelOneshotIngestion also tells the other replicas that they don't need to bother continuing their ingestion and can safe some work.

... or I guess AllowCompaction could fulfill the same purpose if we handle it accordingly in the controller and on the replica side. Not sure if that's the case currently.

src/storage-controller/src/history.rs

src/storage-controller/src/lib.rs

src/storage/src/render.rs

ParkMyCar · 2025-01-14T17:03:25Z

Thanks @teskje and @aljoscha for the reviews! Something I should have described earlier is I annotated the TODOs with cf#, roughly what I'm thinking for these are:

cf1: needs to be fixed before giving to any users (Private Preview blockers)
cf2: followups that should be solved before announcing anything publicly/adding docs (Public Preview blockers)
cf3: long tail fixes that we should fix before "GA" but most likely can go to "public preview" without them (GA blockers)

teskje

I reviewed the first two commits (+ commits addressing review comments) and they lgtm!

petrosagg

There are quite a bit of new concepts and abstractions introduced in this PR. We should discuss whether they fill a new need or if we can reuse existing patterns, which will reduce the overall complexity

petrosagg · 2025-01-15T11:30:46Z

src/storage-operators/src/oneshot_source.rs

+
+//! "Oneshot" sources are a one-time ingestion of data from an external system, unlike traditional
+//! sources, they __do not__ run continuously. Oneshot sources are generally used for `COPY FROM`
+//! SQL statements.


The existing source framework already supports sources that run to completion, like a load generator source that produce some finite amount of data and then completes. It also already supports the concept of decoding the incoming data using a format. What is the motivation behind building a new set of abstractions for these operators?

I did initially look at the SourceRender trait and thought through adding a type like OneshotSourceConnection that would implement it, but it didn't seem like the right fit. Granted I'm new to the storage stack, but there is a lot of machinery around progress tracking, remapping timestamps, error handling, multiple exports, health tracking, and possibly more, all of which oneshot ingestions wouldn't use. So it felt like trying to get oneshot ingestions to fit into the existing Source render pipeline was more effort than it was worth? Happy to revisit this in the future if you think the two should be merged though!

petrosagg · 2025-01-15T11:34:41Z

src/storage-operators/src/oneshot_source.rs

+//!     ┃  Work 1  ┃         ┃  Work n  ┃
+//!     ┗━━━━━┯━━━━┛         ┗━━━━━┯━━━━┛
+//!           │                    │
+//!           ├───< Distribute >───┤


We should only distribute once in the beginning, ideally just the descriptions of the objects to fetched, and then run the required work on one worker. Distributing multiple times down the dataflow incurs a network cost and makes the dataflow more susceptible to runaway memory usage since backpressure cannot be communicated cross-worker by default.

The reason we distribute multiple times are for the cases where there is a single large file that users want to ingest. For example, a single 200GB Parquet file can be split into hundreds of Row Groups that can be parallelized across any number of workers. Maybe this means we should combine the Discover and Split Work stages into a single operator, but to keep the implementation as flexible as possible I made them separate.

petrosagg · 2025-01-15T11:36:30Z

src/storage-operators/src/oneshot_source.rs

+
+/// Render an operator that given a stream of [`Row`]s will stage them in Persist and return a
+/// stream of [`ProtoBatch`]es that can later be linked into a shard.
+pub fn render_stage_batches_operator<G>(


I have a similar resuability question here. The persist_sink implementation we have in storage already has a stage that writes batches to persist and then collects the batch descriptions onto one worker to perform the CaA call. I was expecting that we'd reuse those operator and only replace the last append operator with one that sends the descriptions back to the controller.

I read through the write_batches operator in the persist_sink and opted not to use it because similar to the SourceRender trait it seems to be setup to do much more than we require. I'm more optimistic we could refactor this code to use write_batches, but given that render_stage_batches_operator from this PR is essentially just a loop that appends to a Persist Batch, the amount of code duplication isn't that large, so a new operator seemed okay to me.

* add OneshotSource and OneshotFormat traits * add HttpSource and CsvFormat * implement render(...) function to build a dataflow given a OneshotSource and OneshotFormat

* add new StorageCommand::RunOneshotIngestion * add new StorageReseponse::StagedBatches * add build_oneshot_ingestion_dataflow function which calls render(...) from the previous commit

* introduce a TableData enum that supports Rows or Batches * refactor the append codepath to use the new enum * refactor txn-wal to support appending a ProtoBatch in addition to Rows

* support specifying an Expr in COPY FROM * update plan_copy_from to handle copying from a remote source * add sequence_copy_from which calls the storage-controller to render a oneshot ingestion * add a new Coordinator message StagedBatches to handle the result of the oneshot ingestion asynchronously

* remove duplicated code * update comments * update error types

* refactor use of pact::Distribute and .distribute operator * update comments * update structure a bit of tokio task spawning

* use the stage_batches initial capability instead of a CapabilitySet * add comments describing existing behavior

* update a lot of comments with TODO(cf) format for better tracking * mostly leave comments around cancelation with is TODO

…ey don't also link it

ParkMyCar · 2025-01-17T22:38:20Z

Chatted with @petrosagg in Slack and he gave a thumbs up for merging

ParkMyCar force-pushed the copy/from-s3-initial-branch branch 2 times, most recently from 324c097 to a2429c7 Compare January 6, 2025 16:36

ParkMyCar marked this pull request as ready for review January 6, 2025 17:21

ParkMyCar requested review from a team and jkosh44 as code owners January 6, 2025 17:21

ParkMyCar mentioned this pull request Jan 6, 2025

[dnr][copy_from]: Complete implementation of CSV format #30956

Draft

5 tasks

antiguru self-requested a review January 6, 2025 20:07

def- reviewed Jan 7, 2025

View reviewed changes

jkosh44 approved these changes Jan 7, 2025

View reviewed changes

ParkMyCar requested a review from aljoscha as a code owner January 7, 2025 21:58

teskje reviewed Jan 8, 2025

View reviewed changes

aljoscha reviewed Jan 13, 2025

View reviewed changes

teskje reviewed Jan 14, 2025

View reviewed changes

teskje approved these changes Jan 15, 2025

View reviewed changes

petrosagg reviewed Jan 15, 2025

View reviewed changes

ParkMyCar force-pushed the copy/from-s3-initial-branch branch from 45cee62 to e1febaa Compare January 15, 2025 15:30

ParkMyCar added 10 commits January 17, 2025 17:31

start, initial implementation of oneshot dataflows

6354e5a

* add OneshotSource and OneshotFormat traits * add HttpSource and CsvFormat * implement render(...) function to build a dataflow given a OneshotSource and OneshotFormat

storage protocol and rending changes

6317c33

* add new StorageCommand::RunOneshotIngestion * add new StorageReseponse::StagedBatches * add build_oneshot_ingestion_dataflow function which calls render(...) from the previous commit

update Table appends to support already staged Batches

1ce9843

* introduce a TableData enum that supports Rows or Batches * refactor the append codepath to use the new enum * refactor txn-wal to support appending a ProtoBatch in addition to Rows

clippy, fmt, bin/lint

af3a80a

responding to GitHub feedback

c900943

* remove duplicated code * update comments * update error types

respond to GitHub feedback

e312032

* refactor use of pact::Distribute and .distribute operator * update comments * update structure a bit of tokio task spawning

rename process_work to fetch_work

ba81b63

responding to GitHub feedback

1ae99e9

* use the stage_batches initial capability instead of a CapabilitySet * add comments describing existing behavior

respond to Github feedback

c636302

* update a lot of comments with TODO(cf) format for better tracking * mostly leave comments around cancelation with is TODO

ParkMyCar added 2 commits January 17, 2025 17:32

add a comment calling out that oneshot ingestions only stage data, th…

95c5ec4

…ey don't also link it

use uuid for oneshot ingestion dataflows instead of GlobalId

51fad2c

ParkMyCar force-pushed the copy/from-s3-initial-branch branch from de4f593 to 51fad2c Compare January 17, 2025 22:37

ParkMyCar enabled auto-merge (squash) January 17, 2025 22:38

ParkMyCar merged commit f58b528 into MaterializeInc:main Jan 17, 2025
79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[copy_from]: Initial implementation, add `OneshotSource` and `OneshotFormat`, support appending Batches to Tables #30942

[copy_from]: Initial implementation, add `OneshotSource` and `OneshotFormat`, support appending Batches to Tables #30942

ParkMyCar commented Jan 3, 2025 •

edited

Loading

def- left a comment •

edited

Loading

jkosh44 left a comment

jkosh44 Jan 7, 2025

ParkMyCar Jan 7, 2025

jkosh44 Jan 7, 2025

ParkMyCar Jan 7, 2025

jkosh44 Jan 7, 2025

ParkMyCar Jan 7, 2025

teskje left a comment

aljoscha Jan 13, 2025

aljoscha Jan 13, 2025

aljoscha Jan 13, 2025

teskje Jan 14, 2025

ParkMyCar Jan 14, 2025

teskje Jan 15, 2025

aljoscha Jan 13, 2025

ParkMyCar Jan 14, 2025

aljoscha Jan 13, 2025

ParkMyCar Jan 14, 2025

aljoscha Jan 13, 2025

ParkMyCar Jan 14, 2025

aljoscha Jan 13, 2025

ParkMyCar Jan 14, 2025

teskje Jan 14, 2025

ParkMyCar commented Jan 14, 2025

teskje left a comment

petrosagg left a comment

petrosagg Jan 15, 2025

ParkMyCar Jan 15, 2025

petrosagg Jan 15, 2025

ParkMyCar Jan 15, 2025

petrosagg Jan 15, 2025

ParkMyCar Jan 15, 2025

ParkMyCar commented Jan 17, 2025

[copy_from]: Initial implementation, add OneshotSource and OneshotFormat, support appending Batches to Tables #30942

[copy_from]: Initial implementation, add OneshotSource and OneshotFormat, support appending Batches to Tables #30942

Conversation

ParkMyCar commented Jan 3, 2025 • edited Loading

Motivation

Tips for reviewer

Checklist

def- left a comment • edited Loading

Choose a reason for hiding this comment

jkosh44 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teskje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParkMyCar commented Jan 14, 2025

teskje left a comment

Choose a reason for hiding this comment

petrosagg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParkMyCar commented Jan 17, 2025

[copy_from]: Initial implementation, add `OneshotSource` and `OneshotFormat`, support appending Batches to Tables #30942

[copy_from]: Initial implementation, add `OneshotSource` and `OneshotFormat`, support appending Batches to Tables #30942

ParkMyCar commented Jan 3, 2025 •

edited

Loading

def- left a comment •

edited

Loading