-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48187: [C++] Cache ZSTD compression/decompression context #48192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ext3h
wants to merge
3
commits into
apache:main
Choose a base branch
from
Ext3h:zstd_context_reuse
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large is "large" here? This is proposing to keep those per-thread heap allocations alive until the threads themselves are joined (which typically happens at process exit for a thread pool).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC 5-10MB in total. Enough to hurt performance with small blocks (i.e. Parquet with 8kB row groups) both due to memory management and cache trashing, not enough to hurt in terms of total memory footprint.
Would have liked to slave those allocations to the arrow default memory pool for proper tracing, but that feature is exclusive to the static linkage of ZSTD.
I did deliberately avoid managing a pool per instance, assuming that there may be many instances of this class, more than threads in the thread pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To that means that reuse of the contexts should be governed at a higher level, for example the Parquet reader. Perhaps do how the Rust implementation did and expose some kind of "decompression context" API?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about that - the problematic free-threaded case there is the use of thread pools within feather/ipc. They'd need a
thread_locallike patterns in any case. Which means instead of one centralthread_localthere would simply be one at 3+ code locations instead.Exposing the context for use with Parquet would require exposing it all the way out to
parquet::WriterProperties::Builder- and then you'd possibly still end up with multiple writer instances wrongly sharing a context, rendering threading of those writers suddenly impossible. If anything you'd need to export a threading aware "context pool" rather than a context, but that would be equal to reinventingthread_localexcept worse in terms of cache locality and undesirable extra synchronization primitives.The Rust implementation did not encounter those issues as there is no sharing of the context permitted in the first place due to being constrained by language features. And correspondingly also no aggressive threading using a (potentially) shared state.
Ultimately, having exactly one cached context per thread for the single shot compression/decompression API is the recommended usage pattern from the ZSTD maintainers, and aligns best with the available API:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... after checking the Rust implementation, it really should have used a
thread_local!scoped context as well. That went badly, where it's now creating that ZSTD context even if LZ4 is selected, it's creating one distinct context per usage location, and it's still creating a new context for a lot of potentially short-lived objects. Also it missed that there is not just a need for theCompressionContextbut also theDecompressionContextspecifically when talking about the IPC library which uses compression in both directions...Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean like:
At minimum this would require a (reasonable...) backport of
std::weak_ptr<T>::owner_equalto work with older C++ standards in the lineinstances_.find(factory). We can not usestd::unordered_mapwithstd::weak_ptrbecausestd::weak_ptr<T>::owner_hash- unlikeowner_equal- is not possible to back port as it requires private interfaces onstd::weak_ptr.Apart from that it should behave acceptable.
std::mapin uncontested cache lines / TLS is surprisingly fast. Any creation and destruction of instances is kept strictly out of contested code locations.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't really say I like the idea though, that bad use of that interface (i.e. not explicitly sharing the factory as the user of the outermost interface!) would still inevitably result in carrying multiple instances of the thread local scope instead.
Makes it harder to use, and potentially even backfires in terms of peak memory consumption. Even worst, that map in the TLS is accumulating a memory leak of expired shared ptrs - which due to extensive use of
std::make_sharedare usually actually fused allocations.I would rather take the risk of leaking a few MB of memory per thread, assuming that threads are usually a resource developers are good at tracking and tearing down when no longer intended for re-use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two non-exclusive answers: 1) documentation 2) make caching optional in the
CodecconstructorRight, but the underlying compression context will be destroyed anyway, which is what matters.
(and a mitigation for this is to scrub expired weak_ptrs depending on heuristics)
Arrow uses its internal thread pool extensively, so that doesn't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, a simpler alternative to this non-static
ThreadLocalclass is to manage a bounded-size freelist of contexts at the Codec level. This would probably cut down on implementation complexity, though it would also limit context reuse if the number of decompressing threads goes above the limit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One global thread pool, and the size of that pool is bounded to the number of CPU cores. From my perspective, this means that there is already a sane limit in the number of threads.
Meanwhile, for any application I can think of, it should be more or less be safe to assume that any thread that did require a
thread_localscoped context once, will also either need it again during the lifetime of the process, or the thread will have a limited life time itself. It's not like someone is going to spawn thousands of threads for single-shot compression, and then somehow additionally keeps all of them around indefinitely without ever using them for the same task again. (While also not having enough RAM to pay for that allocated context, while still having enough RAM to pay for the remaining overhead of an OS thread...)That would imply a global synchronization primitive on the freelist, and also completely messes up the cache locality in case the cached context is jumping cores due to the non-local properties of any naively implemented freelist. You'd at least have to implement an instance stealing pattern in order to preserve locality in the good path. Also "bounded-size" - assuming that you had wanted to block until an instance became free - means there's now a source of priority inversion, and the only way to avoid is to "coincidentally" set the upper bound exactly to the number of threads.
The only potential benefit was to be had in the case of potentially short-lived threads - but the overhead of creating a fresh thread dominates the cost for that ZSTD context allocation bound to the TLS by magnitudes so certainly not worth the effort to optimize for.
By all means,
thread_localis appropriate here for holding the context.