Skip to content

Misc. bug: llama-server crashes with gpt-oss-20b pos_min == -1, but n_past > 0 - should not happen #17118

@lukas-wresch

Description

@lukas-wresch

Name and Version

lama-server --version
version: 6992 (aa3b7a9)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa on --jinja --reasoning-format none --port 8080 --n-gpu-layers 11

Problem description & steps to reproduce

Start the server and send commands to the completions endpoints.
After some request the server crashes with

slot launch_slot_: id  1 | task 2821 | processing task
slot update_slots: id  1 | task 2821 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 6325
slot update_slots: id  1 | task 2821 | n_past = 331, slot.prompt.tokens.size() = 1517, seq_id = 1, pos_min = 1378, n_swa = 128
slot update_slots: id  1 | task 2821 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2821 | erased invalidated context checkpoint (pos_min = 460, pos_max = 1102, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  1 | task 2821 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.323320
slot update_slots: id  1 | task 2821 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.646640
slot update_slots: id  1 | task 2821 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6135, batch.n_tokens = 2048, progress = 0.969960
slot update_slots: id  1 | task 2821 | n_tokens = 6135, memory_seq_rm [6135, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6261, batch.n_tokens = 129, progress = 0.989881
slot update_slots: id  1 | task 2821 | n_tokens = 6261, memory_seq_rm [6261, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6325, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  1 | task 2821 | prompt done, n_tokens = 6325, batch.n_tokens = 67
slot update_slots: id  1 | task 2821 | created context checkpoint 1 of 8 (pos_min = 5621, pos_max = 6260, size = 15.008 MiB)
slot print_timing: id  2 | task 2816 |
prompt eval time =   80068.15 ms /  3072 tokens (   26.06 ms per token,    38.37 tokens per second)
       eval time =  866753.46 ms /   755 tokens ( 1148.02 ms per token,     0.87 tokens per second)
      total time =  946821.62 ms /  3827 tokens
slot      release: id  2 | task 2816 | stop processing: n_tokens = 3826, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.371 (> 0.100 thold), f_keep = 0.087
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 3826, total state size = 93.632 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.087, sim = 0.371
srv          load:  - found better prompt with f_keep = 0.264, sim = 0.420
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
srv          load: failed to restore state with size 39783992
D:/a/llama.cpp/llama.cpp/tools/server/server.cpp:3843: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
slot  prompt_load: id  2 | task -1 | failed to load prompt from cache
srv        update:  - cache state: 14 prompts, 1000.673 MiB (limits: 8192.000 MiB, 131072 tokens, 246847 est)

First Bad Commit

No response

Relevant log output

slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2806 | processing task
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2807 | processing task
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = 360477965
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 2436, total state size = 60.100 MiB
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.027, sim = 0.083
srv        update:  - cache state: 1 prompts, 70.582 MiB (limits: 8192.000 MiB, 131072 tokens, 282730 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv  get_availabl: prompt cache update took 33.75 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2808 | processing task
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 590723549
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 3972, total state size = 114.173 MiB
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.016, sim = 0.059
srv        update:  - cache state: 2 prompts, 205.790 MiB (limits: 8192.000 MiB, 131072 tokens, 255087 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv  get_availabl: prompt cache update took 56.64 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 2809 | processing task
slot update_slots: id  0 | task 2807 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 824
slot update_slots: id  0 | task 2807 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 2807 | prompt processing progress, n_tokens = 760, batch.n_tokens = 760, progress = 0.922330
slot update_slots: id  1 | task 2806 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 811
slot update_slots: id  1 | task 2806 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2806 | prompt processing progress, n_tokens = 747, batch.n_tokens = 1507, progress = 0.921085
slot update_slots: id  2 | task 2809 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1097
slot update_slots: id  2 | task 2809 | n_past = 65, slot.prompt.tokens.size() = 3972, seq_id = 2, pos_min = 3075, n_swa = 128
slot update_slots: id  2 | task 2809 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2809 | erased invalidated context checkpoint (pos_min = 2137, pos_max = 3033, n_swa = 128, size = 21.034 MiB)
slot update_slots: id  2 | task 2809 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2809 | prompt processing progress, n_tokens = 541, batch.n_tokens = 2048, progress = 0.493163
slot update_slots: id  0 | task 2807 | n_tokens = 760, memory_seq_rm [760, end)
slot update_slots: id  0 | task 2807 | prompt processing progress, n_tokens = 824, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  0 | task 2807 | prompt done, n_tokens = 824, batch.n_tokens = 64
slot update_slots: id  0 | task 2807 | created context checkpoint 1 of 8 (pos_min = 633, pos_max = 759, size = 2.978 MiB)
slot update_slots: id  1 | task 2806 | n_tokens = 747, memory_seq_rm [747, end)
slot update_slots: id  1 | task 2806 | prompt processing progress, n_tokens = 811, batch.n_tokens = 128, progress = 1.000000
slot update_slots: id  1 | task 2806 | prompt done, n_tokens = 811, batch.n_tokens = 128
slot update_slots: id  1 | task 2806 | created context checkpoint 1 of 8 (pos_min = 518, pos_max = 746, size = 5.370 MiB)
slot update_slots: id  2 | task 2809 | n_tokens = 541, memory_seq_rm [541, end)
slot update_slots: id  2 | task 2809 | prompt processing progress, n_tokens = 1033, batch.n_tokens = 620, progress = 0.941659
slot update_slots: id  3 | task 2808 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 786
slot update_slots: id  3 | task 2808 | n_past = 65, slot.prompt.tokens.size() = 2436, seq_id = 3, pos_min = 2309, n_swa = 128
slot update_slots: id  3 | task 2808 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2808 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2808 | prompt processing progress, n_tokens = 722, batch.n_tokens = 1342, progress = 0.918575
slot update_slots: id  2 | task 2809 | n_tokens = 1033, memory_seq_rm [1033, end)
slot update_slots: id  2 | task 2809 | prompt processing progress, n_tokens = 1097, batch.n_tokens = 66, progress = 1.000000
slot update_slots: id  2 | task 2809 | prompt done, n_tokens = 1097, batch.n_tokens = 66
slot update_slots: id  2 | task 2809 | created context checkpoint 1 of 8 (pos_min = 906, pos_max = 1032, size = 2.978 MiB)
slot update_slots: id  3 | task 2808 | n_tokens = 722, memory_seq_rm [722, end)
slot update_slots: id  3 | task 2808 | prompt processing progress, n_tokens = 786, batch.n_tokens = 130, progress = 1.000000
slot update_slots: id  3 | task 2808 | prompt done, n_tokens = 786, batch.n_tokens = 130
slot update_slots: id  3 | task 2808 | created context checkpoint 2 of 8 (pos_min = 79, pos_max = 721, size = 15.078 MiB)
slot print_timing: id  1 | task 2806 |
prompt eval time =   95657.19 ms /   811 tokens (  117.95 ms per token,     8.48 tokens per second)
       eval time =  166955.00 ms /   329 tokens (  507.46 ms per token,     1.97 tokens per second)
      total time =  262612.19 ms /  1140 tokens
slot      release: id  1 | task 2806 | stop processing: n_tokens = 1139, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.296 (> 0.100 thold), f_keep = 0.291
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1139, total state size = 32.712 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.291, sim = 0.296
srv        update:  - cache state: 3 prompts, 243.872 MiB (limits: 8192.000 MiB, 131072 tokens, 253514 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv  get_availabl: prompt cache update took 522.21 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2810 | processing task
slot update_slots: id  1 | task 2810 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1118
slot update_slots: id  1 | task 2810 | n_past = 331, slot.prompt.tokens.size() = 1139, seq_id = 1, pos_min = 883, n_swa = 128
slot update_slots: id  1 | task 2810 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2810 | erased invalidated context checkpoint (pos_min = 518, pos_max = 746, n_swa = 128, size = 5.370 MiB)
slot update_slots: id  1 | task 2810 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2810 | prompt processing progress, n_tokens = 1054, batch.n_tokens = 1057, progress = 0.942755
slot update_slots: id  1 | task 2810 | n_tokens = 1054, memory_seq_rm [1054, end)
slot update_slots: id  1 | task 2810 | prompt processing progress, n_tokens = 1118, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  1 | task 2810 | prompt done, n_tokens = 1118, batch.n_tokens = 67
slot update_slots: id  1 | task 2810 | created context checkpoint 1 of 8 (pos_min = 411, pos_max = 1053, size = 15.078 MiB)
slot print_timing: id  0 | task 2807 |
prompt eval time =   95656.19 ms /   824 tokens (  116.09 ms per token,     8.61 tokens per second)
       eval time =  266007.77 ms /   463 tokens (  574.53 ms per token,     1.74 tokens per second)
      total time =  361663.96 ms /  1287 tokens
slot      release: id  0 | task 2807 | stop processing: n_tokens = 1286, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 958041133
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1286, total state size = 36.253 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.257, sim = 0.083
srv        update:  - cache state: 4 prompts, 283.103 MiB (limits: 8192.000 MiB, 131072 tokens, 255596 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv  get_availabl: prompt cache update took 639.83 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2811 | processing task
slot update_slots: id  0 | task 2811 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3971
slot update_slots: id  0 | task 2811 | n_past = 331, slot.prompt.tokens.size() = 1286, seq_id = 0, pos_min = 1026, n_swa = 128
slot update_slots: id  0 | task 2811 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 2811 | erased invalidated context checkpoint (pos_min = 633, pos_max = 759, n_swa = 128, size = 2.978 MiB)
slot update_slots: id  0 | task 2811 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 2811 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.514984
slot update_slots: id  0 | task 2811 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  0 | task 2811 | prompt processing progress, n_tokens = 3907, batch.n_tokens = 1865, progress = 0.983883
slot update_slots: id  0 | task 2811 | n_tokens = 3907, memory_seq_rm [3907, end)
slot update_slots: id  0 | task 2811 | prompt processing progress, n_tokens = 3971, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  0 | task 2811 | prompt done, n_tokens = 3971, batch.n_tokens = 67
slot update_slots: id  0 | task 2811 | created context checkpoint 1 of 8 (pos_min = 3264, pos_max = 3906, size = 15.078 MiB)
slot print_timing: id  3 | task 2808 |
prompt eval time =   46249.84 ms /   786 tokens (   58.84 ms per token,    16.99 tokens per second)
       eval time =  369453.44 ms /   476 tokens (  776.16 ms per token,     1.29 tokens per second)
      total time =  415703.28 ms /  1262 tokens
slot      release: id  3 | task 2808 | stop processing: n_tokens = 1261, truncated = 0
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.144 (> 0.100 thold), f_keep = 0.262
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1261, total state size = 32.829 MiB
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.262, sim = 0.144
srv        update:  - cache state: 5 prompts, 341.492 MiB (limits: 8192.000 MiB, 131072 tokens, 242143 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv  get_availabl: prompt cache update took 625.95 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2812 | processing task
slot update_slots: id  3 | task 2812 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 2303
slot update_slots: id  3 | task 2812 | n_past = 331, slot.prompt.tokens.size() = 1261, seq_id = 3, pos_min = 1122, n_swa = 128
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
slot update_slots: id  3 | task 2812 | failed to restore context checkpoint (pos_min = 79, pos_max = 721, size = 15.078 MiB)
slot update_slots: id  3 | task 2812 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2812 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2812 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.887972
slot update_slots: id  3 | task 2812 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  3 | task 2812 | prompt processing progress, n_tokens = 2239, batch.n_tokens = 197, progress = 0.972210
slot update_slots: id  3 | task 2812 | n_tokens = 2239, memory_seq_rm [2239, end)
slot update_slots: id  3 | task 2812 | prompt processing progress, n_tokens = 2303, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  3 | task 2812 | prompt done, n_tokens = 2303, batch.n_tokens = 67
slot update_slots: id  3 | task 2812 | created context checkpoint 3 of 8 (pos_min = 1599, pos_max = 2238, size = 15.008 MiB)
slot print_timing: id  2 | task 2809 |
prompt eval time =  100466.86 ms /  1097 tokens (   91.58 ms per token,    10.92 tokens per second)
       eval time =  429549.97 ms /   481 tokens (  893.04 ms per token,     1.12 tokens per second)
      total time =  530016.83 ms /  1578 tokens
slot      release: id  2 | task 2809 | stop processing: n_tokens = 1577, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.329 (> 0.100 thold), f_keep = 0.210
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1577, total state size = 40.051 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.210, sim = 0.329
srv        update:  - cache state: 6 prompts, 384.521 MiB (limits: 8192.000 MiB, 131072 tokens, 248643 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv  get_availabl: prompt cache update took 611.19 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 2813 | processing task
slot update_slots: id  2 | task 2813 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1005
slot update_slots: id  2 | task 2813 | n_past = 331, slot.prompt.tokens.size() = 1577, seq_id = 2, pos_min = 1446, n_swa = 128
slot update_slots: id  2 | task 2813 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2813 | erased invalidated context checkpoint (pos_min = 906, pos_max = 1032, n_swa = 128, size = 2.978 MiB)
slot update_slots: id  2 | task 2813 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2813 | prompt processing progress, n_tokens = 941, batch.n_tokens = 944, progress = 0.936318
slot update_slots: id  2 | task 2813 | n_tokens = 941, memory_seq_rm [941, end)
slot update_slots: id  2 | task 2813 | prompt processing progress, n_tokens = 1005, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  2 | task 2813 | prompt done, n_tokens = 1005, batch.n_tokens = 67
slot update_slots: id  2 | task 2813 | created context checkpoint 1 of 8 (pos_min = 299, pos_max = 940, size = 15.055 MiB)
slot print_timing: id  1 | task 2810 |
prompt eval time =   30955.28 ms /  1118 tokens (   27.69 ms per token,    36.12 tokens per second)
       eval time =  341132.77 ms /   300 tokens ( 1137.11 ms per token,     0.88 tokens per second)
      total time =  372088.05 ms /  1418 tokens
slot      release: id  1 | task 2810 | stop processing: n_tokens = 1417, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.328 (> 0.100 thold), f_keep = 0.234
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1417, total state size = 37.073 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.234, sim = 0.328
srv        update:  - cache state: 7 prompts, 436.673 MiB (limits: 8192.000 MiB, 131072 tokens, 245531 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv  get_availabl: prompt cache update took 638.57 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2814 | processing task
slot update_slots: id  1 | task 2814 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1010
slot update_slots: id  1 | task 2814 | n_past = 331, slot.prompt.tokens.size() = 1417, seq_id = 1, pos_min = 1253, n_swa = 128
slot update_slots: id  1 | task 2814 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2814 | erased invalidated context checkpoint (pos_min = 411, pos_max = 1053, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  1 | task 2814 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2814 | prompt processing progress, n_tokens = 946, batch.n_tokens = 949, progress = 0.936634
slot update_slots: id  1 | task 2814 | n_tokens = 946, memory_seq_rm [946, end)
slot update_slots: id  1 | task 2814 | prompt processing progress, n_tokens = 1010, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  1 | task 2814 | prompt done, n_tokens = 1010, batch.n_tokens = 67
slot update_slots: id  1 | task 2814 | created context checkpoint 1 of 8 (pos_min = 303, pos_max = 945, size = 15.078 MiB)
slot print_timing: id  0 | task 2811 |
prompt eval time =  101880.49 ms /  3971 tokens (   25.66 ms per token,    38.98 tokens per second)
       eval time =  258165.89 ms /   277 tokens (  932.01 ms per token,     1.07 tokens per second)
      total time =  360046.39 ms /  4248 tokens
slot      release: id  0 | task 2811 | stop processing: n_tokens = 4247, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.230 (> 0.100 thold), f_keep = 0.078
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 4247, total state size = 103.879 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.078, sim = 0.230
srv        update:  - cache state: 8 prompts, 555.630 MiB (limits: 8192.000 MiB, 131072 tokens, 255580 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv  get_availabl: prompt cache update took 775.48 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2815 | processing task
slot update_slots: id  0 | task 2815 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1441
slot update_slots: id  0 | task 2815 | n_past = 331, slot.prompt.tokens.size() = 4247, seq_id = 0, pos_min = 4064, n_swa = 128
slot update_slots: id  0 | task 2815 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 2815 | erased invalidated context checkpoint (pos_min = 3264, pos_max = 3906, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  0 | task 2815 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 2815 | prompt processing progress, n_tokens = 1377, batch.n_tokens = 1380, progress = 0.955586
slot update_slots: id  0 | task 2815 | n_tokens = 1377, memory_seq_rm [1377, end)
slot update_slots: id  0 | task 2815 | prompt processing progress, n_tokens = 1441, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  0 | task 2815 | prompt done, n_tokens = 1441, batch.n_tokens = 67
slot update_slots: id  0 | task 2815 | created context checkpoint 1 of 8 (pos_min = 734, pos_max = 1376, size = 15.078 MiB)
slot print_timing: id  2 | task 2813 |
prompt eval time =   26964.44 ms /  1005 tokens (   26.83 ms per token,    37.27 tokens per second)
       eval time =  205077.87 ms /   262 tokens (  782.74 ms per token,     1.28 tokens per second)
      total time =  232042.31 ms /  1267 tokens
slot      release: id  2 | task 2813 | stop processing: n_tokens = 1266, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.108 (> 0.100 thold), f_keep = 0.261
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1266, total state size = 32.712 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.261, sim = 0.108
srv        update:  - cache state: 9 prompts, 603.396 MiB (limits: 8192.000 MiB, 131072 tokens, 252536 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD0BD15E0:    1266 tokens, checkpoints:  1,    47.766 MiB
srv  get_availabl: prompt cache update took 582.22 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  2 | task 2816 | processing task
slot update_slots: id  2 | task 2816 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3072
slot update_slots: id  2 | task 2816 | n_past = 331, slot.prompt.tokens.size() = 1266, seq_id = 2, pos_min = 1137, n_swa = 128
slot update_slots: id  2 | task 2816 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 2816 | erased invalidated context checkpoint (pos_min = 299, pos_max = 940, n_swa = 128, size = 15.055 MiB)
slot update_slots: id  2 | task 2816 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2816 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.665690
slot update_slots: id  2 | task 2816 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  2 | task 2816 | prompt processing progress, n_tokens = 3008, batch.n_tokens = 966, progress = 0.979167
slot update_slots: id  2 | task 2816 | n_tokens = 3008, memory_seq_rm [3008, end)
slot update_slots: id  2 | task 2816 | prompt processing progress, n_tokens = 3072, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  2 | task 2816 | prompt done, n_tokens = 3072, batch.n_tokens = 67
slot update_slots: id  2 | task 2816 | created context checkpoint 1 of 8 (pos_min = 2365, pos_max = 3007, size = 15.078 MiB)
slot print_timing: id  3 | task 2812 |
prompt eval time =   58219.87 ms /  2303 tokens (   25.28 ms per token,    39.56 tokens per second)
       eval time =  400455.46 ms /   431 tokens (  929.13 ms per token,     1.08 tokens per second)
      total time =  458675.32 ms /  2734 tokens
slot      release: id  3 | task 2812 | stop processing: n_tokens = 2733, truncated = 0
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.380 (> 0.100 thold), f_keep = 0.121
srv  get_availabl: updating prompt cache
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv   prompt_save:  - saving prompt with length 2733, total state size = 70.582 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.121, sim = 0.380
srv          load:  - found better prompt with f_keep = 0.263, sim = 0.383
srv        update:  - cache state: 9 prompts, 666.779 MiB (limits: 8192.000 MiB, 131072 tokens, 246553 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD08B5B90:    2733 tokens, checkpoints:  3,   111.149 MiB
srv  get_availabl: prompt cache update took 995.10 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2817 | processing task
slot update_slots: id  3 | task 2817 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 870
slot update_slots: id  3 | task 2817 | n_past = 333, slot.prompt.tokens.size() = 1266, seq_id = 3, pos_min = 1137, n_swa = 128
slot update_slots: id  3 | task 2817 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2817 | erased invalidated context checkpoint (pos_min = 299, pos_max = 940, n_swa = 128, size = 15.055 MiB)
slot update_slots: id  3 | task 2817 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2817 | prompt processing progress, n_tokens = 806, batch.n_tokens = 809, progress = 0.926437
slot update_slots: id  3 | task 2817 | n_tokens = 806, memory_seq_rm [806, end)
slot update_slots: id  3 | task 2817 | prompt processing progress, n_tokens = 870, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  3 | task 2817 | prompt done, n_tokens = 870, batch.n_tokens = 67
slot update_slots: id  3 | task 2817 | created context checkpoint 1 of 8 (pos_min = 166, pos_max = 805, size = 15.008 MiB)
slot print_timing: id  1 | task 2814 |
prompt eval time =   26832.53 ms /  1010 tokens (   26.57 ms per token,    37.64 tokens per second)
       eval time =  360577.32 ms /   410 tokens (  879.46 ms per token,     1.14 tokens per second)
      total time =  387409.84 ms /  1420 tokens
slot      release: id  1 | task 2814 | stop processing: n_tokens = 1419, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.284 (> 0.100 thold), f_keep = 0.233
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1419, total state size = 37.941 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.233, sim = 0.284
srv        update:  - cache state: 10 prompts, 719.798 MiB (limits: 8192.000 MiB, 131072 tokens, 244542 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD08B5B90:    2733 tokens, checkpoints:  3,   111.149 MiB
srv        update:    - prompt 0000023FD02D51A0:    1419 tokens, checkpoints:  1,    53.019 MiB
srv  get_availabl: prompt cache update took 693.50 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2818 | processing task
slot update_slots: id  1 | task 2818 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1167
slot update_slots: id  1 | task 2818 | n_past = 331, slot.prompt.tokens.size() = 1419, seq_id = 1, pos_min = 1220, n_swa = 128
slot update_slots: id  1 | task 2818 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2818 | erased invalidated context checkpoint (pos_min = 303, pos_max = 945, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  1 | task 2818 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2818 | prompt processing progress, n_tokens = 1103, batch.n_tokens = 1106, progress = 0.945159
slot update_slots: id  1 | task 2818 | n_tokens = 1103, memory_seq_rm [1103, end)
slot update_slots: id  1 | task 2818 | prompt processing progress, n_tokens = 1167, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  1 | task 2818 | prompt done, n_tokens = 1167, batch.n_tokens = 67
slot update_slots: id  1 | task 2818 | created context checkpoint 1 of 8 (pos_min = 460, pos_max = 1102, size = 15.078 MiB)
slot print_timing: id  0 | task 2815 |
prompt eval time =   39020.95 ms /  1441 tokens (   27.08 ms per token,    36.93 tokens per second)
       eval time =  420117.91 ms /   546 tokens (  769.45 ms per token,     1.30 tokens per second)
      total time =  459138.86 ms /  1987 tokens
slot      release: id  0 | task 2815 | stop processing: n_tokens = 1986, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 1778651419
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1986, total state size = 52.339 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.167, sim = 0.058
srv        update:  - cache state: 11 prompts, 787.215 MiB (limits: 8192.000 MiB, 131072 tokens, 244267 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD08B5B90:    2733 tokens, checkpoints:  3,   111.149 MiB
srv        update:    - prompt 0000023FD02D51A0:    1419 tokens, checkpoints:  1,    53.019 MiB
srv        update:    - prompt 0000023F7E1CBE30:    1986 tokens, checkpoints:  1,    67.417 MiB
srv  get_availabl: prompt cache update took 904.06 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 2819 | processing task
slot update_slots: id  0 | task 2819 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 5752
slot update_slots: id  0 | task 2819 | n_past = 331, slot.prompt.tokens.size() = 1986, seq_id = 0, pos_min = 1740, n_swa = 128
slot update_slots: id  0 | task 2819 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 2819 | erased invalidated context checkpoint (pos_min = 734, pos_max = 1376, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  0 | task 2819 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 2819 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.355529
slot update_slots: id  0 | task 2819 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  0 | task 2819 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.711057
slot update_slots: id  0 | task 2819 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id  0 | task 2819 | prompt processing progress, n_tokens = 5688, batch.n_tokens = 1601, progress = 0.988873
slot update_slots: id  0 | task 2819 | n_tokens = 5688, memory_seq_rm [5688, end)
slot update_slots: id  0 | task 2819 | prompt processing progress, n_tokens = 5752, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  0 | task 2819 | prompt done, n_tokens = 5752, batch.n_tokens = 67
slot update_slots: id  0 | task 2819 | created context checkpoint 1 of 8 (pos_min = 5045, pos_max = 5687, size = 15.078 MiB)
slot print_timing: id  3 | task 2817 |
prompt eval time =   25345.06 ms /   870 tokens (   29.13 ms per token,    34.33 tokens per second)
       eval time =  430894.74 ms /   468 tokens (  920.72 ms per token,     1.09 tokens per second)
      total time =  456239.80 ms /  1338 tokens
slot      release: id  3 | task 2817 | stop processing: n_tokens = 1337, truncated = 0
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.102 (> 0.100 thold), f_keep = 0.248
srv  get_availabl: updating prompt cache
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv   prompt_save:  - saving prompt with length 1337, total state size = 35.831 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.248, sim = 0.102
srv        update:  - cache state: 12 prompts, 838.053 MiB (limits: 8192.000 MiB, 131072 tokens, 242518 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD08B5B90:    2733 tokens, checkpoints:  3,   111.149 MiB
srv        update:    - prompt 0000023FD02D51A0:    1419 tokens, checkpoints:  1,    53.019 MiB
srv        update:    - prompt 0000023F7E1CBE30:    1986 tokens, checkpoints:  1,    67.417 MiB
srv        update:    - prompt 0000023F7DCAE6E0:    1337 tokens, checkpoints:  1,    50.838 MiB
srv  get_availabl: prompt cache update took 818.14 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 2820 | processing task
slot update_slots: id  3 | task 2820 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3255
slot update_slots: id  3 | task 2820 | n_past = 331, slot.prompt.tokens.size() = 1337, seq_id = 3, pos_min = 1146, n_swa = 128
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
slot update_slots: id  3 | task 2820 | failed to restore context checkpoint (pos_min = 166, pos_max = 805, size = 15.008 MiB)
slot update_slots: id  3 | task 2820 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 2820 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 2820 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.628264
slot update_slots: id  3 | task 2820 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  3 | task 2820 | prompt processing progress, n_tokens = 3191, batch.n_tokens = 1149, progress = 0.980338
slot update_slots: id  3 | task 2820 | n_tokens = 3191, memory_seq_rm [3191, end)
slot update_slots: id  3 | task 2820 | prompt processing progress, n_tokens = 3255, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  3 | task 2820 | prompt done, n_tokens = 3255, batch.n_tokens = 67
slot update_slots: id  3 | task 2820 | created context checkpoint 2 of 8 (pos_min = 2548, pos_max = 3190, size = 15.078 MiB)
slot print_timing: id  1 | task 2818 |
prompt eval time =   33308.17 ms /  1167 tokens (   28.54 ms per token,    35.04 tokens per second)
       eval time =  422492.45 ms /   351 tokens ( 1203.68 ms per token,     0.83 tokens per second)
      total time =  455800.62 ms /  1518 tokens
slot      release: id  1 | task 2818 | stop processing: n_tokens = 1517, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = 2076151011
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 1517, total state size = 38.832 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.218, sim = 0.052
srv        update:  - cache state: 13 prompts, 891.963 MiB (limits: 8192.000 MiB, 131072 tokens, 241793 est)
srv        update:    - prompt 0000023F7DA5B150:    2436 tokens, checkpoints:  1,    70.582 MiB
srv        update:    - prompt 0000023F7E263940:    3972 tokens, checkpoints:  1,   135.207 MiB
srv        update:    - prompt 0000023FD0091AE0:    1139 tokens, checkpoints:  1,    38.082 MiB
srv        update:    - prompt 0000023F7DEC66D0:    1286 tokens, checkpoints:  1,    39.231 MiB
srv        update:    - prompt 0000023FD00829A0:    1261 tokens, checkpoints:  2,    58.389 MiB
srv        update:    - prompt 0000023FD0091A00:    1577 tokens, checkpoints:  1,    43.030 MiB
srv        update:    - prompt 0000023F7E06FD60:    1417 tokens, checkpoints:  1,    52.151 MiB
srv        update:    - prompt 0000023FD00806B0:    4247 tokens, checkpoints:  1,   118.957 MiB
srv        update:    - prompt 0000023FD08B5B90:    2733 tokens, checkpoints:  3,   111.149 MiB
srv        update:    - prompt 0000023FD02D51A0:    1419 tokens, checkpoints:  1,    53.019 MiB
srv        update:    - prompt 0000023F7E1CBE30:    1986 tokens, checkpoints:  1,    67.417 MiB
srv        update:    - prompt 0000023F7DCAE6E0:    1337 tokens, checkpoints:  1,    50.838 MiB
srv        update:    - prompt 0000023FD0B76580:    1517 tokens, checkpoints:  1,    53.910 MiB
srv  get_availabl: prompt cache update took 728.26 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 2821 | processing task
slot update_slots: id  1 | task 2821 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 6325
slot update_slots: id  1 | task 2821 | n_past = 331, slot.prompt.tokens.size() = 1517, seq_id = 1, pos_min = 1378, n_swa = 128
slot update_slots: id  1 | task 2821 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  1 | task 2821 | erased invalidated context checkpoint (pos_min = 460, pos_max = 1102, n_swa = 128, size = 15.078 MiB)
slot update_slots: id  1 | task 2821 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.323320
slot update_slots: id  1 | task 2821 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.646640
slot update_slots: id  1 | task 2821 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6135, batch.n_tokens = 2048, progress = 0.969960
slot update_slots: id  1 | task 2821 | n_tokens = 6135, memory_seq_rm [6135, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6261, batch.n_tokens = 129, progress = 0.989881
slot update_slots: id  1 | task 2821 | n_tokens = 6261, memory_seq_rm [6261, end)
slot update_slots: id  1 | task 2821 | prompt processing progress, n_tokens = 6325, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id  1 | task 2821 | prompt done, n_tokens = 6325, batch.n_tokens = 67
slot update_slots: id  1 | task 2821 | created context checkpoint 1 of 8 (pos_min = 5621, pos_max = 6260, size = 15.008 MiB)
slot print_timing: id  2 | task 2816 |
prompt eval time =   80068.15 ms /  3072 tokens (   26.06 ms per token,    38.37 tokens per second)
       eval time =  866753.46 ms /   755 tokens ( 1148.02 ms per token,     0.87 tokens per second)
      total time =  946821.62 ms /  3827 tokens
slot      release: id  2 | task 2816 | stop processing: n_tokens = 3826, truncated = 0
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id  2 | task -1 | selected slot by LCP similarity, sim_best = 0.371 (> 0.100 thold), f_keep = 0.087
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 3826, total state size = 93.632 MiB
srv  params_from_: Chat format: GPT-OSS
srv          load:  - looking for better prompt, base f_keep = 0.087, sim = 0.371
srv          load:  - found better prompt with f_keep = 0.264, sim = 0.420
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
srv          load: failed to restore state with size 39783992
D:/a/llama.cpp/llama.cpp/tools/server/server.cpp:3843: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
slot  prompt_load: id  2 | task -1 | failed to load prompt from cache
srv        update:  - cache state: 14 prompts, 1000.673 MiB (limits: 8192.000 MiB, 131072 tokens, 246847 est)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions