Skip to content

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SongXiaoXi
Copy link
Contributor

Description:
This PR replaces the original spin-based barrier in GGML with a futex-based yield barrier to improve thread scheduling efficiency and overall system performance.

Currently, the feature can be controlled using the CMake parameter GGML_YIELD_BARRIER, allowing users to enable or disable the yield barrier as needed.

Key Benefits:

  1. Improved Scalability
    The futex-based barrier allows threads to yield instead of busy-waiting. This reduces CPU waste and improves scalability when the number of threads exceeds the number of physical cores, or when other workloads are competing for CPU time.

  2. Better Performance on Hybrid Architectures
    On systems with heterogeneous cores (e.g., big.LITTLE or Intel Hybrid Architecture), yielding helps critical threads get scheduled on performance cores, potentially improving throughput (e.g., PP performance in multi-threaded inference).

  3. Power Efficiency and Thermal Stability
    By avoiding unnecessary spinning, this change can reduce power consumption and help maintain higher sustained performance, especially on thermally constrained devices. It may also mitigate CPU throttling under load.

Benchmark:

based on build: 42eb248 (5025)

Apple M1 (4P+4E) (disable Accelerate framework and Metal)

before

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 488.30 ± 28.06
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 108.54 ± 19.58

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 824.37 ± 7.58
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 62.45 ± 0.14

Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)

before:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 72.28 ± 0.39
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 11.89 ± 0.42

after:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 91.85 ± 1.59
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.84 ± 0.20

Apple M4 (compile on M1 native)

before

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 4 pp512 15.33 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 4 tg128 4.85 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 4 pp512 15.32 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 4 tg128 4.73 ± 0.00

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 10 pp512 27.93 ± 0.07
llama 8B F16 14.96 GiB 8.03 B CPU 10 tg128 5.98 ± 0.08

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 10 pp512 28.38 ± 0.07
llama 8B F16 14.96 GiB 8.03 B CPU 10 tg128 6.10 ± 0.00

Snapdragon 888 (X1 + A78x3 + A55x4)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 210.31 ± 3.34
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 39.36 ± 0.35

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 300.16 ± 5.45
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 14.33 ± 0.08

before:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 80.65 ± 8.72
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 8.05 ± 0.05

after:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 95.31 ± 1.67
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 6.45 ± 0.05

Snapdragon 6Gen1 (A78x4 + A55x4)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 196.30 ± 0.58
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 30.97 ± 0.17

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 pp512 261.19 ± 2.26
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 8 tg128 11.07 ± 0.11

before:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 79.43 ± 0.40
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 5.78 ± 0.04

after:

model size params backend threads test t/s
llama 1B F16 2.79 GiB 1.50 B CPU 8 pp512 79.56 ± 0.34
llama 1B F16 2.79 GiB 1.50 B CPU 8 tg128 4.45 ± 0.01

Ryzen 9950X (light thermal throttling observed)

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 216.12 ± 0.17
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.15 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 16 pp512 222.44 ± 2.12
llama 8B F16 14.96 GiB 8.03 B CPU 16 tg128 4.15 ± 0.00

before:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 32 pp512 221.41 ± 2.07
llama 8B F16 14.96 GiB 8.03 B CPU 32 tg128 3.94 ± 0.00

after:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 32 pp512 222.19 ± 4.64
llama 8B F16 14.96 GiB 8.03 B CPU 32 tg128 3.76 ± 0.04

Ryzen 9950X (spin-based bottleneck: threads > cores)

before:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 pp512 59.36 ± 0.43
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 tg128 0.26 ± 0.00

after:

model size params backend threads test t/s
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 pp512 2052.45 ± 4.99
qwen2 1B Q4_0 403.20 MiB 630.17 M CPU 33 tg128 47.28 ± 1.20

Conclusion:

Across most tested devices, the pp512 workload consistently benefits from the futex-based yield barrier, showing noticeable throughput improvements. This is especially evident on high-core-count or hybrid-core systems, where reduced spinning improves scheduling fairness and efficiency.

However, for tg128 — which is typically less compute-intensive and more sensitive to load imbalance — performance may degrade slightly in some cases. This is likely due to the lower thread saturation and increased context switching overhead introduced by yielding, which affects lighter workloads more noticeably.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 23, 2025
@SongXiaoXi
Copy link
Contributor Author

Hi, I would like to ask for your opinion regarding the use of futex-based yield barriers versus traditional spin barriers.
While yielding improves scalability and efficiency on overloaded systems or hybrid architectures, it may introduce additional context-switching overhead for lighter workloads.

Would appreciate your thoughts on whether a yield-based approach is a good fit for GGML’s threading model on mobile devices or servers under heavy load.

Thank you for your consideration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant