-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[PERF]: parallelize applying log to segment types #3134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF]: parallelize applying log to segment types #3134
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
@@ -787,11 +788,12 @@ impl<'me> LogMaterializer<'me> { | |||
|
|||
// This needs to be public for testing | |||
#[allow(async_fn_in_trait)] | |||
pub trait SegmentWriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lifetime doesn't need to be on the struct
740db31
to
c40d5f2
Compare
@@ -628,7 +630,7 @@ pub struct RecordSegmentReader<'me> { | |||
user_id_to_id: BlockfileReader<'me, &'me str, u32>, | |||
id_to_user_id: BlockfileReader<'me, u32, &'me str>, | |||
id_to_data: BlockfileReader<'me, u32, DataRecord<'me>>, | |||
curr_max_offset_id: Arc<AtomicU32>, | |||
max_offset_id: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to change this because the shared state led to a bug in the log materialization logic when running multiple materializations in parallel (see comment in compact.rs
). Readers should not expose mutable state.
88e8e29
to
bfd894f
Compare
rust/worker/src/execution/operators/apply_log_to_segment_writer.rs
Outdated
Show resolved
Hide resolved
panic!("Error creating record segment reader"); | ||
} | ||
RecordSegmentReaderCreationError::UserRecordNotFound(_) => { | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not panic here / can we use this as a moment to clean up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is inside a test; were you referring to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we clean up the panics we bail out to in metadata segment?
- Can we expand the comment for the atomics dance that explains that we view materialization as idempotent and that it should result in the same results on each thread. Otherwise we can have horrible bugs.
- Is there anyway to aggressively test this atomics dance?
7be9113
to
cffe2a3
Compare
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
e696cd7
to
851511a
Compare
e30c087
to
cb43e94
Compare
92a0f28
to
f8ab4ac
Compare
@@ -309,7 +310,7 @@ impl CompactOrchestrator { | |||
}, | |||
}; | |||
|
|||
self.num_write_tasks = partitions.len() as i32; | |||
self.num_write_tasks = partitions.len() as i32 * 3; // 3 different segment types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems worth calling out as an assumption for when we add the index. I don't know the best way to log it. I'd do a TODO(feat)
, but we need to agree on what that would mean.
cb43e94
to
5e19390
Compare
1588bc4
to
b8d61ce
Compare
Superseded by #3359 (new branch was easier than rebasing as approach has changed). |
Description of changes
Applies log updates to segment types in parallel rather than sequentially:
Pipelining flushes to S3/applying to blockfile is in the next PR in this stack.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
n/a