-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[copy_from]: Initial implementation, add OneshotSource
and OneshotFormat
, support appending Batches to Tables
#30942
[copy_from]: Initial implementation, add OneshotSource
and OneshotFormat
, support appending Batches to Tables
#30942
Conversation
324c097
to
a2429c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see some tests for this:
- testdrive
- platform-checks
- parallel-workload
I can work on that myself later this week if it's ok for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapter parts LGTM, didn't look closely at the storage parts.
#[derive(Debug, Clone, PartialEq)] | ||
pub enum TableData { | ||
/// Rows that still need to be persisted and appended. | ||
/// | ||
/// The contained [`Row`]s are _not_ consolidated. | ||
Rows(Vec<(Row, Diff)>), | ||
/// Batches already staged in Persist ready to be appended. | ||
Batches(SmallVec<[ProtoBatch; 1]>), | ||
} | ||
|
||
impl TableData { | ||
pub fn is_empty(&self) -> bool { | ||
match self { | ||
TableData::Rows(rows) => rows.is_empty(), | ||
TableData::Batches(batches) => batches.is_empty(), | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is totally subjective and just an idea, so feel free to ignore/disagree. It seems like we almost always wrap this enum in a Vec
, which gives us a slightly awkward vec of vecs. Would we be better off using something like the following so we can consolidate all the inner vecs?
struct TableData {
rows: Vec<(Row, Diff)>,
batches: SmallVec<[ProtoBatch; 1]>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll try this out and see how it feels
// Stash the execute context so we can cancel the COPY. | ||
self.active_copies | ||
.insert(ctx.session().conn_id().clone(), ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this has been discussed, but if we cancel after the batch has been staged, then will we leak the batch in persist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would leak the batches. If we cancel the request we could spawn a task that will wait for the response and clean them up, but concurrently Persist is also working on a leaked blob detector so this shouldn't be too much of an issue
if let Err(err) = stage_write { | ||
ctx.retire(Err(err)); | ||
} else { | ||
ctx.retire(Ok(ExecuteResponse::Copied(row_count.cast_into()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Copied
and not CopyFrom
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CopyFrom
execute response is actually what drives the existing COPY FROM
implementation, so it doesn't really work as the response type here. When ending a session with ExecuteResponse::CopyFrom
we actually move the Session to a separate task which streams in data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only got through the first commit yet, posting my comments so far.
@@ -137,7 +142,7 @@ impl<T> StorageCommand<T> { | |||
| AllowWrites | |||
| UpdateConfiguration(_) | |||
| AllowCompaction(_) => false, | |||
RunIngestions(_) | RunSinks(_) => true, | |||
RunIngestions(_) | RunSinks(_) | RunOneshotIngestion(_) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: come back to this one
Are these different because they're not permanent objects? What eventually cleans them up or do they shut down themselves? Then again, if they shut down themselves, might the response side wait forever for a response?
@@ -71,6 +71,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> { | |||
RunIngestions(x) => metrics.run_ingestions_count.add(x.len().cast_into()), | |||
RunSinks(x) => metrics.run_sinks_count.add(x.len().cast_into()), | |||
AllowCompaction(x) => metrics.allow_compaction_count.add(x.len().cast_into()), | |||
RunOneshotIngestion(_) => { | |||
// TODO(parkmycar): Metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this PR or follow up? I'm just curious
@@ -147,6 +155,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> { | |||
run_sinks.push(sink); | |||
} | |||
|
|||
// TODO(parkmycar): ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??? 🤔
again, note to self: what eventually cleans one-shot ingestions out of the command history? For regular ingestions it's an AllowCompaction
to the empty antichain. See here for that logic:
// Discard ingestions that have been dropped, keep the rest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should handle this the same way we handle peeks in compute: Have a CancelOneshotIngestion
command and have the controller send that command as soon as it receives a StagedBatches
response. When compacting the command history, a CancelOneshotIngestion
command cancels out the corresponding RunOneshotIngestion
command. In a (hopefully near) future where we support replication for storage, CancelOneshotIngestion
also tells the other replicas that they don't need to bother continuing their ingestion and can safe some work.
... or I guess AllowCompaction
could fulfill the same purpose if we handle it accordingly in the controller and on the replica side. Not sure if that's the case currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with Petros about this today and aligned with what you describe @teskje, a CancelOneshotIngestion
like message similar CancelPeek
. Planning to do this in a follow up if it's okay with y'all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me, as long as we do this before making the feature available to users.
@@ -900,6 +954,10 @@ impl<'w, A: Allocate> Worker<'w, A> { | |||
} | |||
} | |||
} | |||
StorageCommand::RunOneshotIngestion(oneshot) => { | |||
info!(%worker_id, ?oneshot, "reconcile: received RunOneshotIngestion command"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might the response side sit and wait forever when this happens? Is there a timeout?
But we also don't clean out stale one-shot ingestions from our state? This is where normal ingestions are cleaned out:
materialize/src/storage/src/storage_state.rs
Line 1102 in 7667ddf
let stale_objects = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Cancellation is still a TODO and something I'll follow up with, but like described above I'll add a CancelOneshotIngestion
command, I added a TODO(cf1)
here so it'll be fixed before releasing to users.
src/adapter/src/coord/appends.rs
Outdated
// Add table advancements for all tables. | ||
for table in self.catalog().entries().filter(|entry| entry.is_table()) { | ||
appends.entry(table.id()).or_default(); | ||
} | ||
let appends: Vec<_> = appends | ||
|
||
// Consolidate all Rows for a given table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're only consolidating the "raw" rows we might have, batched data is passed through untouched, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, I left a comment describing as much
itertools::Either::Left(iter) | ||
} | ||
TableData::Batches(_) => { | ||
// TODO(cf1): Handle Batches of updates in ReadOnlyTableWorker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkosh44 We might want to decide to never want to support this? The read-only table worker is only used to write to newly migrated builtin tables in read-only mode, so ... 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, will chat with y'all in Slack
@@ -968,6 +968,25 @@ pub struct TimestamplessUpdate { | |||
pub diff: Diff, | |||
} | |||
|
|||
#[derive(Debug, Clone, PartialEq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I initially saw this used above, I thought we'd removed TimestamplessUpdate
, but apparently I can't get nice things... 😅 these days, the initial sentence in is description doesn't even make sense/most people don't know what it would mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh maybe one day soon! I left a TODO(cf2)
to see if I can remove that type
@@ -147,6 +155,9 @@ impl<T: std::fmt::Debug> CommandHistory<T> { | |||
run_sinks.push(sink); | |||
} | |||
|
|||
// TODO(parkmycar): ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should handle this the same way we handle peeks in compute: Have a CancelOneshotIngestion
command and have the controller send that command as soon as it receives a StagedBatches
response. When compacting the command history, a CancelOneshotIngestion
command cancels out the corresponding RunOneshotIngestion
command. In a (hopefully near) future where we support replication for storage, CancelOneshotIngestion
also tells the other replicas that they don't need to bother continuing their ingestion and can safe some work.
... or I guess AllowCompaction
could fulfill the same purpose if we handle it accordingly in the controller and on the replica side. Not sure if that's the case currently.
Thanks @teskje and @aljoscha for the reviews! Something I should have described earlier is I annotated the TODOs with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the first two commits (+ commits addressing review comments) and they lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a bit of new concepts and abstractions introduced in this PR. We should discuss whether they fill a new need or if we can reuse existing patterns, which will reduce the overall complexity
|
||
//! "Oneshot" sources are a one-time ingestion of data from an external system, unlike traditional | ||
//! sources, they __do not__ run continuously. Oneshot sources are generally used for `COPY FROM` | ||
//! SQL statements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing source framework already supports sources that run to completion, like a load generator source that produce some finite amount of data and then completes. It also already supports the concept of decoding the incoming data using a format. What is the motivation behind building a new set of abstractions for these operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did initially look at the SourceRender
trait and thought through adding a type like OneshotSourceConnection
that would implement it, but it didn't seem like the right fit. Granted I'm new to the storage stack, but there is a lot of machinery around progress tracking, remapping timestamps, error handling, multiple exports, health tracking, and possibly more, all of which oneshot ingestions wouldn't use. So it felt like trying to get oneshot ingestions to fit into the existing Source render pipeline was more effort than it was worth? Happy to revisit this in the future if you think the two should be merged though!
//! ┃ Work 1 ┃ ┃ Work n ┃ | ||
//! ┗━━━━━┯━━━━┛ ┗━━━━━┯━━━━┛ | ||
//! │ │ | ||
//! ├───< Distribute >───┤ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only distribute once in the beginning, ideally just the descriptions of the objects to fetched, and then run the required work on one worker. Distributing multiple times down the dataflow incurs a network cost and makes the dataflow more susceptible to runaway memory usage since backpressure cannot be communicated cross-worker by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we distribute multiple times are for the cases where there is a single large file that users want to ingest. For example, a single 200GB Parquet file can be split into hundreds of Row Groups that can be parallelized across any number of workers. Maybe this means we should combine the Discover
and Split Work
stages into a single operator, but to keep the implementation as flexible as possible I made them separate.
|
||
/// Render an operator that given a stream of [`Row`]s will stage them in Persist and return a | ||
/// stream of [`ProtoBatch`]es that can later be linked into a shard. | ||
pub fn render_stage_batches_operator<G>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar resuability question here. The persist_sink implementation we have in storage already has a stage that writes batches to persist and then collects the batch descriptions onto one worker to perform the CaA call. I was expecting that we'd reuse those operator and only replace the last append operator with one that sends the descriptions back to the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the write_batches
operator in the persist_sink
and opted not to use it because similar to the SourceRender
trait it seems to be setup to do much more than we require. I'm more optimistic we could refactor this code to use write_batches
, but given that render_stage_batches_operator
from this PR is essentially just a loop that appends to a Persist Batch, the amount of code duplication isn't that large, so a new operator seemed okay to me.
45cee62
to
e1febaa
Compare
* add OneshotSource and OneshotFormat traits * add HttpSource and CsvFormat * implement render(...) function to build a dataflow given a OneshotSource and OneshotFormat
* add new StorageCommand::RunOneshotIngestion * add new StorageReseponse::StagedBatches * add build_oneshot_ingestion_dataflow function which calls render(...) from the previous commit
* introduce a TableData enum that supports Rows or Batches * refactor the append codepath to use the new enum * refactor txn-wal to support appending a ProtoBatch in addition to Rows
* support specifying an Expr in COPY FROM * update plan_copy_from to handle copying from a remote source * add sequence_copy_from which calls the storage-controller to render a oneshot ingestion * add a new Coordinator message StagedBatches to handle the result of the oneshot ingestion asynchronously
* remove duplicated code * update comments * update error types
* refactor use of pact::Distribute and .distribute operator * update comments * update structure a bit of tokio task spawning
* use the stage_batches initial capability instead of a CapabilitySet * add comments describing existing behavior
* update a lot of comments with TODO(cf) format for better tracking * mostly leave comments around cancelation with is TODO
de4f593
to
51fad2c
Compare
Chatted with @petrosagg in Slack and he gave a thumbs up for merging |
This PR is an initial implementation of
COPY ... FROM <url>
, aka "COPY FROM S3".Goals for this PR
Note: traditionally we may have written a design doc for this feature, but I would instead like to try a lighter weight approach where we specifically make a decision on the core changes necessary for this feature, and later record those in a Decision Log. Those are:
clusterd
and then handing them back toenvironmentd
for final linking into the Persist shard, these changes are in the 3rd commit. A different idea would be to implement "renditions" in Persist, but that is a much larger change.StorageCommand::RunOneshotIngestion
intoStorageCommand::RunIngestion
, but given a oneshot ingestion is ephemeral and shouldn't be restarted whenclusterd
crashes, keeping them separate seemed reasonable.What this PR implements
OneshotSource
andOneshotFormat
.src/storage-operators/src/oneshot_source.rs
that should explain how these traits are used. If that comment is not clear please let me know!ProtoBatch
s that can be linked into a Table.txn-wal
and the Coordinator to support appendingProtoBatch
s to a Table instead of justVec<Row>
.COPY ... FROM <url>
Feature Gating
The
COPY ... FROM <url>
feature is currently gated behind a LaunchDarkly flag calledenable_copy_from_remote
, so all of the Storage related code is not reachable, unless this flag is turned on. Only the SQL parser and Table appending changes are reachable without the flag.Motivation
Progress towards https://github.com/MaterializeInc/database-issues/issues/6575
Tips for reviewer
I did my best to split this PR into logically separate commits to make them easier to review:
OneshotSource
andOneshotFormat
traits, and implements the dataflow rendering for a "oneshot ingestion".txn-wal
to support appending Batches to tables. ⭐Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.