Skip to content

Conversation

@Ext3h
Copy link

@Ext3h Ext3h commented Nov 20, 2025

Rationale for this change

Avoid costly reallocations of ZSTD context when reusing ZSTDCodec instances.

What changes are included in this PR?

Replace calls to ZSTD_compress / ZSTD_decompress which are allocating the ZSTD context internally with corresponding APIs with explicit context management.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #48187 has been automatically assigned in GitHub to PR creator.

@Ext3h Ext3h force-pushed the zstd_context_reuse branch from 729ee5e to 859bf82 Compare November 20, 2025 15:25
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thisisnic thisisnic changed the title GH-48187: Cache ZSTD compression/decompression context GH-48187: [C++] Cache ZSTD compression/decompression context Nov 20, 2025
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Nov 20, 2025
@Ext3h Ext3h marked this pull request as draft November 20, 2025 16:42
@Ext3h Ext3h marked this pull request as ready for review November 25, 2025 07:32
@Ext3h Ext3h force-pushed the zstd_context_reuse branch from cea69b6 to f9f5e60 Compare November 25, 2025 07:47

size_t ret = ZSTD_decompress(output_buffer, static_cast<size_t>(output_buffer_len),
input, static_cast<size_t>(input_len));
// Decompression context for ZSTD contains several large heap allocations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is "large" here? This is proposing to keep those per-thread heap allocations alive until the threads themselves are joined (which typically happens at process exit for a thread pool).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC 5-10MB in total. Enough to hurt performance with small blocks (i.e. Parquet with 8kB row groups) both due to memory management and cache trashing, not enough to hurt in terms of total memory footprint.

Would have liked to slave those allocations to the arrow default memory pool for proper tracing, but that feature is exclusive to the static linkage of ZSTD.

I did deliberately avoid managing a pool per instance, assuming that there may be many instances of this class, more than threads in the thread pools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To that means that reuse of the contexts should be governed at a higher level, for example the Parquet reader. Perhaps do how the Rust implementation did and expose some kind of "decompression context" API?

Copy link
Author

@Ext3h Ext3h Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about that - the problematic free-threaded case there is the use of thread pools within feather/ipc. They'd need a thread_local like patterns in any case. Which means instead of one central thread_local there would simply be one at 3+ code locations instead.

Exposing the context for use with Parquet would require exposing it all the way out to parquet::WriterProperties::Builder - and then you'd possibly still end up with multiple writer instances wrongly sharing a context, rendering threading of those writers suddenly impossible. If anything you'd need to export a threading aware "context pool" rather than a context, but that would be equal to reinventing thread_local except worse in terms of cache locality and undesirable extra synchronization primitives.

The Rust implementation did not encounter those issues as there is no sharing of the context permitted in the first place due to being constrained by language features. And correspondingly also no aggressive threading using a (potentially) shared state.

Ultimately, having exactly one cached context per thread for the single shot compression/decompression API is the recommended usage pattern from the ZSTD maintainers, and aligns best with the available API:

/*= Decompression context
 *  When decompressing many times,
 *  it is recommended to allocate a context only once,
 *  and reuse it for each successive compression operation.
 *  This will make workload friendlier for system's memory.
 *  Use one context per thread for parallel execution. */
typedef struct ZSTD_DCtx_s ZSTD_DCtx;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... after checking the Rust implementation, it really should have used a thread_local! scoped context as well. That went badly, where it's now creating that ZSTD context even if LZ4 is selected, it's creating one distinct context per usage location, and it's still creating a new context for a lot of potentially short-lived objects. Also it missed that there is not just a need for the CompressionContext but also the DecompressionContext specifically when talking about the IPC library which uses compression in both directions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants