feat: Implement custom RecordBatch serde for shuffle for improved performance #1190

andygrove · 2024-12-20T16:51:09Z

Which issue does this PR close?

Builds on #1192

Rationale for this change

Arrow IPC is a good general purpose serde framework but we can get better performance by implementing specialized code optimized for Comet, which encodes single batches to shuffle blocks.

This PR implements a new BatchWriter and BatchReader and updates shuffle writer to use them when possible (when all data types are supported), falling back to Arrow IPC for other cases.

Specializations include:

The schema gets encoded to bytes just once per shuffle operation rather than once per batch. The encoded schema bytes are then written out directly with each shuffle block, avoiding the schema serde cost per batch.
Raw data, offset, and null buffers are written out directly with no flatbuffer encoding, no alignment, and no metadata

Microbenchmarks (encoding only, no compression)

Without compression, we see an almost 3x speedup in writes.

shuffle_writer/shuffle_writer: write encoded (enable_fast_encoding=true, compression=None)
                        time:   [6.2751 µs 6.2906 µs 6.2906 µs]
shuffle_writer/shuffle_writer: write encoded (enable_fast_encoding=false, compression=None)
                        time:   [18.591 µs 18.599 µs 18.599 µs]

Note that the time saved is tiny compared to compression costs, but it still helps. With this PR I am seeing a TPC-H time of 329s compared to 336s in #1192, which this PR builds on.

Spark takes 644s, so with this PR, we are 1.96x faster than Spark. We need to shave off another 7 seconds now to get to 2x (we may get this with the new ParquetExec work).

Benchmark Results

Single node TPC-H.

Single node TPC-DS with optimized version of q72 (better join order).

TPC-H q3

Encoding + compression is now much closer to Gluten + Velox for the lineitem exchange (8.6s versus 6.2s).

Comet:

Gluten + Velox:

What changes are included in this PR?

How are these changes tested?

native/core/src/execution/shuffle/batch_serde.rs

parthchandra · 2024-12-20T21:46:58Z

native/core/src/execution/shuffle/batch_serde.rs

+use std::io::Write;
+use std::sync::Arc;
+
+pub fn write_batch_fast(


Are you going to end up implementing a form of arrow (stream) IPC?

I discovered that we may be able to just use https://docs.rs/arrow-ipc/latest/arrow_ipc/writer/struct.IpcDataGenerator.html#method.encoded_batch and am going to look into that next

Are you going to end up implementing a form of arrow (stream) IPC?

Yes, but without using flatbuffers to align and encode anything, just the raw bytes, and without the metadata messages.

comphead · 2024-12-20T23:02:01Z

native/core/benches/batch_serde.rs

+
+fn create_batch() -> RecordBatch {
+    let schema = Arc::new(Schema::new(vec![
+        Field::new("c0", DataType::Utf8, true),


Thanks @andygrove interesting if other datatypes keep the same performance benefit

codecov-commenter · 2024-12-30T14:13:09Z

Codecov Report

Attention: Patch coverage is 79.50820% with 25 lines in your changes missing coverage. Please review.

Project coverage is 34.83%. Comparing base (2e0f00a) to head (ab95a9b).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...execution/shuffle/NativeBatchDecoderIterator.scala	71.95%	12 Missing and 11 partials ⚠️
...t/execution/shuffle/CometShuffleExchangeExec.scala	80.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1190       +/-   ##
=============================================
- Coverage     56.94%   34.83%   -22.11%     
- Complexity      929      990       +61     
=============================================
  Files           112      116        +4     
  Lines         10985    43844    +32859     
  Branches       2119     9564     +7445     
=============================================
+ Hits           6255    15274     +9019     
- Misses         3617    25599    +21982     
- Partials       1113     2971     +1858

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andygrove force-pushed the experimental-fast-batch-serde branch from 49d0c27 to f7d8cce Compare December 20, 2024 17:07

kazuyukitanimura reviewed Dec 20, 2024

View reviewed changes

native/core/src/execution/shuffle/batch_serde.rs Outdated Show resolved Hide resolved

parthchandra reviewed Dec 20, 2024

View reviewed changes

comphead reviewed Dec 20, 2024

View reviewed changes

andygrove changed the title ~~feat: Implement fast serde for single record batches~~ [do not review] feat: Implement fast serde for single record batches Dec 21, 2024

andygrove added 25 commits December 22, 2024 11:43

Implement native decoding and decompression

8ce9bb5

revert some variable renaming for smaller diff

a9a0593

fix oom issues?

11320a5

upmerge

e2f28f9

make NativeBatchDecoderIterator more consistent with ArrowReaderIterator

c97eb58

fix oom and prep for review

4ffe47d

format

68d2331

Add LZ4 support

a3fb105

clippy, new benchmark

b593e80

rename metrics, clean up lz4 code

4078551

update test

f286309

save

e5a29ef

Add support for snappy

fbc2124

format

bed543a

change default back to lz4

e13d72f

make metrics more accurate

f1ed927

format

45b020e

save

02f0d3f

clippy

587feee

roundtrip passes

ed4b6db

roundtrip passes with nulls

c50c693

Save

41affcc

Save

0cfdab2

Save

86ace48

Save

cc83b97

andygrove added 8 commits December 29, 2024 10:57

clippy

290b57a

fix benches

d05567a

format

1808231

precompute batch header

f0793f0

fix benchmark

21628eb

fix: support empty batch with row count

3dcbec6

fix bug with null buffer length

db055e1

support more data types (Int8, Int16, Float32)

74b6f0e

andygrove changed the title ~~feat: Implement fast serde for single record batches~~ feat: Implement custom RecordBatch serde for shuffle for improved performance Dec 30, 2024

andygrove added 2 commits December 29, 2024 20:29

fix regression

e16d24a

add timestamp support

34dd489

andygrove added 14 commits December 30, 2024 08:42

timestamp tests + fix

874a02f

add config to enable fast encoding

430ac67

update bench

7c4e1a8

Make compression codec configurable for columnar shuffle

0b2f0e9

clippy

ad1adc1

fix bench

3e15b12

fmt

1c08a4b

upmerge

573748a

address feedback

7dd2ff6

address feedback

1fc3d49

address feedback

5a3fb2e

minor code simplification

2b88bbd

upmerge

b4b6aff

cargo fmt

78340a1

andygrove mentioned this pull request Jan 3, 2025

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192

Open

andygrove added 3 commits January 3, 2025 16:44

overflow check

69b54d9

rename compression level config

f41180d

upmerge

ab95a9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement custom RecordBatch serde for shuffle for improved performance #1190

feat: Implement custom RecordBatch serde for shuffle for improved performance #1190

andygrove commented Dec 20, 2024 •

edited

Loading

parthchandra Dec 20, 2024

andygrove Dec 20, 2024

andygrove Dec 21, 2024

comphead Dec 20, 2024

codecov-commenter commented Dec 30, 2024 •

edited

Loading

feat: Implement custom RecordBatch serde for shuffle for improved performance #1190

Are you sure you want to change the base?

feat: Implement custom RecordBatch serde for shuffle for improved performance #1190

Conversation

andygrove commented Dec 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Microbenchmarks (encoding only, no compression)

Benchmark Results

TPC-H q3

What changes are included in this PR?

How are these changes tested?

parthchandra Dec 20, 2024

Choose a reason for hiding this comment

andygrove Dec 20, 2024

Choose a reason for hiding this comment

andygrove Dec 21, 2024

Choose a reason for hiding this comment

comphead Dec 20, 2024

Choose a reason for hiding this comment

codecov-commenter commented Dec 30, 2024 • edited Loading

Codecov Report

andygrove commented Dec 20, 2024 •

edited

Loading

codecov-commenter commented Dec 30, 2024 •

edited

Loading