Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 8, 2025

Mirrored from ggml-org/llama.cpp#17103

Adding -tgs to llama-batched-bench would make it decode the sequences separately, one by one:

# no -tgs
0123 0123 0123 ...

# -tgs
0 0 0 ... 1 1 1 ... 2 2 2 ... 3 3 3 ...

This is useful for benchmarking the performance of the unified KV cache where it's important to detect and skip masked regions in the KQ mask.

Example with the Metal backend:

# unified KV cache with up to 4 sequences, running one by one
llama-batched-bench -m ../models/gemma-3-4b-it/ggml-model-f16.gguf -c 33792 -npp 8192 -ntg 32 -npl 1,2,4 -kvu -tgs

# the cache looks like this
#
#                        prompt processing ends here v
# 000...[8192 tokens]...000111...111222...222333...333000...[32 tokens]...000111...111222...222333...333
#.                             text generation starts ^

With the -INF block optimizations in the FA kernels:

main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.101 |  2641.88 |    0.478 |    66.95 |    3.579 |  2298.00 |
|  8192 |     32 |    2 |  16448 |    6.091 |  2689.76 |    0.971 |    65.90 |    7.062 |  2328.95 |
|  8192 |     32 |    4 |  32896 |   12.373 |  2648.43 |    1.965 |    65.15 |   14.337 |  2294.45 |

Disabling the -INF block optimizations in the FA kernels:

patch
diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index cea535ade..6c249fb56 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -4633,7 +4633,7 @@ kernel void kernel_flash_attn_ext_blk(
     const int32_t nblk0 = ((args.ne30 + C - 1)/C);
 
     if (tiisg == 0) {
-        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = res;
+        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = 1;
     }
 }
 
@@ -5660,7 +5660,7 @@ void kernel_flash_attn_ext_vec_impl(
             }
 
             // skip -INF blocks
-            if (simd_max(sm[tiisg]) == -INFINITY) {
+            if (false) {
                 continue;
             }
 
main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.528 |  2321.91 |    0.500 |    64.05 |    4.028 |  2041.84 |
|  8192 |     32 |    2 |  16448 |    7.393 |  2216.28 |    1.027 |    62.30 |    8.420 |  1953.47 |
|  8192 |     32 |    4 |  32896 |   16.157 |  2028.06 |    2.159 |    59.30 |   18.316 |  1796.04 |

Observe that both pp and tg perf is worse and it's amplified with more sequences in the cache.

@DajanaV DajanaV force-pushed the main branch 28 times, most recently from 87bfdb3 to a14857a Compare November 11, 2025 19:07
@DajanaV DajanaV force-pushed the main branch 15 times, most recently from 24733fb to 4b4bb7c Compare November 13, 2025 12:15
@DajanaV DajanaV closed this Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants