ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079
+235
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR replaces the original spin-based barrier in GGML with a futex-based yield barrier to improve thread scheduling efficiency and overall system performance.
Currently, the feature can be controlled using the CMake parameter
GGML_YIELD_BARRIER
, allowing users to enable or disable the yield barrier as needed.Key Benefits:
Improved Scalability
The futex-based barrier allows threads to yield instead of busy-waiting. This reduces CPU waste and improves scalability when the number of threads exceeds the number of physical cores, or when other workloads are competing for CPU time.
Better Performance on Hybrid Architectures
On systems with heterogeneous cores (e.g., big.LITTLE or Intel Hybrid Architecture), yielding helps critical threads get scheduled on performance cores, potentially improving throughput (e.g., PP performance in multi-threaded inference).
Power Efficiency and Thermal Stability
By avoiding unnecessary spinning, this change can reduce power consumption and help maintain higher sustained performance, especially on thermally constrained devices. It may also mitigate CPU throttling under load.
Benchmark:
based on build: 42eb248 (5025)
Apple M1 (4P+4E) (disable Accelerate framework and Metal)
before
after:
Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)
before:
after:
Apple M4 (compile on M1 native)
before
after:
before:
after:
Snapdragon 888 (X1 + A78x3 + A55x4)
before:
after:
before:
after:
Snapdragon 6Gen1 (A78x4 + A55x4)
before:
after:
before:
after:
Ryzen 9950X (light thermal throttling observed)
before:
after:
before:
after:
Ryzen 9950X (spin-based bottleneck: threads > cores)
before:
after:
Conclusion:
Across most tested devices, the
pp512
workload consistently benefits from the futex-based yield barrier, showing noticeable throughput improvements. This is especially evident on high-core-count or hybrid-core systems, where reduced spinning improves scheduling fairness and efficiency.However, for
tg128
— which is typically less compute-intensive and more sensitive to load imbalance — performance may degrade slightly in some cases. This is likely due to the lower thread saturation and increased context switching overhead introduced by yielding, which affects lighter workloads more noticeably.