GH-48187: [C++] Cache ZSTD compression/decompression context #48192

Ext3h · 2025-11-20T15:04:36Z

Rationale for this change

Avoid costly reallocations of ZSTD context when reusing ZSTDCodec instances.

What changes are included in this PR?

Replace calls to ZSTD_compress / ZSTD_decompress which are allocating the ZSTD context internally with corresponding APIs with explicit context management.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

GitHub Issue: [C++] ZSTD context is re-allocated excessively and bypassing memory pool #48187

github-actions · 2025-11-20T15:05:07Z

⚠️ GitHub issue #48187 has been automatically assigned in GitHub to PR creator.

felipecrv

LGTM

cpp/src/arrow/util/compression_zstd.cc

pitrou · 2025-11-25T09:38:48Z

cpp/src/arrow/util/compression_zstd.cc

-
-    size_t ret = ZSTD_decompress(output_buffer, static_cast<size_t>(output_buffer_len),
-                                 input, static_cast<size_t>(input_len));
+    // Decompression context for ZSTD contains several large heap allocations.


How large is "large" here? This is proposing to keep those per-thread heap allocations alive until the threads themselves are joined (which typically happens at process exit for a thread pool).

IIRC 5-10MB in total. Enough to hurt performance with small blocks (i.e. Parquet with 8kB row groups) both due to memory management and cache trashing, not enough to hurt in terms of total memory footprint.

Would have liked to slave those allocations to the arrow default memory pool for proper tracing, but that feature is exclusive to the static linkage of ZSTD.

I did deliberately avoid managing a pool per instance, assuming that there may be many instances of this class, more than threads in the thread pools.

To that means that reuse of the contexts should be governed at a higher level, for example the Parquet reader. Perhaps do how the Rust implementation did and expose some kind of "decompression context" API?

Unsure about that - the problematic free-threaded case there is the use of thread pools within feather/ipc. They'd need a thread_local like patterns in any case. Which means instead of one central thread_local there would simply be one at 3+ code locations instead.

Exposing the context for use with Parquet would require exposing it all the way out to parquet::WriterProperties::Builder - and then you'd possibly still end up with multiple writer instances wrongly sharing a context, rendering threading of those writers suddenly impossible. If anything you'd need to export a threading aware "context pool" rather than a context, but that would be equal to reinventing thread_local except worse in terms of cache locality and undesirable extra synchronization primitives.

The Rust implementation did not encounter those issues as there is no sharing of the context permitted in the first place due to being constrained by language features. And correspondingly also no aggressive threading using a (potentially) shared state.

Ultimately, having exactly one cached context per thread for the single shot compression/decompression API is the recommended usage pattern from the ZSTD maintainers, and aligns best with the available API:

/*= Decompression context * When decompressing many times, * it is recommended to allocate a context only once, * and reuse it for each successive compression operation. * This will make workload friendlier for system's memory. * Use one context per thread for parallel execution. */ typedef struct ZSTD_DCtx_s ZSTD_DCtx;

... after checking the Rust implementation, it really should have used a thread_local! scoped context as well. That went badly, where it's now creating that ZSTD context even if LZ4 is selected, it's creating one distinct context per usage location, and it's still creating a new context for a lot of potentially short-lived objects. Also it missed that there is not just a need for the CompressionContext but also the DecompressionContext specifically when talking about the IPC library which uses compression in both directions...

Cache ZSTD compression/decompression context apache#48187

fab5a70

github-actions bot added Component: C++ awaiting review Awaiting review labels Nov 20, 2025

Minimise ZSTD context changes apache#48187

859bf82

Ext3h force-pushed the zstd_context_reuse branch from 729ee5e to 859bf82 Compare November 20, 2025 15:25

felipecrv approved these changes Nov 20, 2025

View reviewed changes

thisisnic changed the title ~~GH-48187: Cache ZSTD compression/decompression context~~ GH-48187: [C++] Cache ZSTD compression/decompression context Nov 20, 2025

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Nov 20, 2025

Ext3h commented Nov 20, 2025

View reviewed changes

cpp/src/arrow/util/compression_zstd.cc Outdated Show resolved Hide resolved

Ext3h marked this pull request as draft November 20, 2025 16:42

Ext3h marked this pull request as ready for review November 25, 2025 07:32

Cache context in thread_local rather than class member#48187

f9f5e60

Ext3h force-pushed the zstd_context_reuse branch from cea69b6 to f9f5e60 Compare November 25, 2025 07:47

pitrou reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48187: [C++] Cache ZSTD compression/decompression context #48192

GH-48187: [C++] Cache ZSTD compression/decompression context #48192

Ext3h commented Nov 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

felipecrv left a comment

Uh oh!

Uh oh!

pitrou Nov 25, 2025

Uh oh!

Ext3h Nov 25, 2025

Uh oh!

pitrou Nov 25, 2025

Uh oh!

Ext3h Nov 25, 2025 •

edited

Loading

Uh oh!

Ext3h Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-48187: [C++] Cache ZSTD compression/decompression context #48192

Are you sure you want to change the base?

GH-48187: [C++] Cache ZSTD compression/decompression context #48192

Conversation

Ext3h commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

felipecrv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Ext3h Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Ext3h Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ext3h Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ext3h commented Nov 20, 2025 •

edited

Loading

Ext3h Nov 25, 2025 •

edited

Loading