-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48187: [C++] Cache ZSTD compression/decompression context #48192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
729ee5e to
859bf82
Compare
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cea69b6 to
f9f5e60
Compare
|
|
||
| size_t ret = ZSTD_decompress(output_buffer, static_cast<size_t>(output_buffer_len), | ||
| input, static_cast<size_t>(input_len)); | ||
| // Decompression context for ZSTD contains several large heap allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large is "large" here? This is proposing to keep those per-thread heap allocations alive until the threads themselves are joined (which typically happens at process exit for a thread pool).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC 5-10MB in total. Enough to hurt performance with small blocks (i.e. Parquet with 8kB row groups) both due to memory management and cache trashing, not enough to hurt in terms of total memory footprint.
Would have liked to slave those allocations to the arrow default memory pool for proper tracing, but that feature is exclusive to the static linkage of ZSTD.
I did deliberately avoid managing a pool per instance, assuming that there may be many instances of this class, more than threads in the thread pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To that means that reuse of the contexts should be governed at a higher level, for example the Parquet reader. Perhaps do how the Rust implementation did and expose some kind of "decompression context" API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about that - the problematic free-threaded case there is the use of thread pools within feather/ipc. They'd need a thread_local like patterns in any case. Which means instead of one central thread_local there would simply be one at 3+ code locations instead.
Exposing the context for use with Parquet would require exposing it all the way out to parquet::WriterProperties::Builder - and then you'd possibly still end up with multiple writer instances wrongly sharing a context, rendering threading of those writers suddenly impossible. If anything you'd need to export a threading aware "context pool" rather than a context, but that would be equal to reinventing thread_local except worse in terms of cache locality and undesirable extra synchronization primitives.
The Rust implementation did not encounter those issues as there is no sharing of the context permitted in the first place due to being constrained by language features. And correspondingly also no aggressive threading using a (potentially) shared state.
Ultimately, having exactly one cached context per thread for the single shot compression/decompression API is the recommended usage pattern from the ZSTD maintainers, and aligns best with the available API:
/*= Decompression context
* When decompressing many times,
* it is recommended to allocate a context only once,
* and reuse it for each successive compression operation.
* This will make workload friendlier for system's memory.
* Use one context per thread for parallel execution. */
typedef struct ZSTD_DCtx_s ZSTD_DCtx;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... after checking the Rust implementation, it really should have used a thread_local! scoped context as well. That went badly, where it's now creating that ZSTD context even if LZ4 is selected, it's creating one distinct context per usage location, and it's still creating a new context for a lot of potentially short-lived objects. Also it missed that there is not just a need for the CompressionContext but also the DecompressionContext specifically when talking about the IPC library which uses compression in both directions...
Rationale for this change
Avoid costly reallocations of ZSTD context when reusing
ZSTDCodecinstances.What changes are included in this PR?
Replace calls to
ZSTD_compress/ZSTD_decompresswhich are allocating the ZSTD context internally with corresponding APIs with explicit context management.Are these changes tested?
Yes.
Are there any user-facing changes?
No.