-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Open
Labels
Description
Name and Version
lama-server --version
version: 6992 (aa3b7a9)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa on --jinja --reasoning-format none --port 8080 --n-gpu-layers 11Problem description & steps to reproduce
Start the server and send commands to the completions endpoints.
After some request the server crashes with
slot launch_slot_: id 1 | task 2821 | processing task
slot update_slots: id 1 | task 2821 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 6325
slot update_slots: id 1 | task 2821 | n_past = 331, slot.prompt.tokens.size() = 1517, seq_id = 1, pos_min = 1378, n_swa = 128
slot update_slots: id 1 | task 2821 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 2821 | erased invalidated context checkpoint (pos_min = 460, pos_max = 1102, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 1 | task 2821 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.323320
slot update_slots: id 1 | task 2821 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.646640
slot update_slots: id 1 | task 2821 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6135, batch.n_tokens = 2048, progress = 0.969960
slot update_slots: id 1 | task 2821 | n_tokens = 6135, memory_seq_rm [6135, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6261, batch.n_tokens = 129, progress = 0.989881
slot update_slots: id 1 | task 2821 | n_tokens = 6261, memory_seq_rm [6261, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6325, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 1 | task 2821 | prompt done, n_tokens = 6325, batch.n_tokens = 67
slot update_slots: id 1 | task 2821 | created context checkpoint 1 of 8 (pos_min = 5621, pos_max = 6260, size = 15.008 MiB)
slot print_timing: id 2 | task 2816 |
prompt eval time = 80068.15 ms / 3072 tokens ( 26.06 ms per token, 38.37 tokens per second)
eval time = 866753.46 ms / 755 tokens ( 1148.02 ms per token, 0.87 tokens per second)
total time = 946821.62 ms / 3827 tokens
slot release: id 2 | task 2816 | stop processing: n_tokens = 3826, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.371 (> 0.100 thold), f_keep = 0.087
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 3826, total state size = 93.632 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.087, sim = 0.371
srv load: - found better prompt with f_keep = 0.264, sim = 0.420
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
srv load: failed to restore state with size 39783992
D:/a/llama.cpp/llama.cpp/tools/server/server.cpp:3843: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
slot prompt_load: id 2 | task -1 | failed to load prompt from cache
srv update: - cache state: 14 prompts, 1000.673 MiB (limits: 8192.000 MiB, 131072 tokens, 246847 est)
First Bad Commit
No response
Relevant log output
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 2806 | processing task
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 2807 | processing task
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = 360477965
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 2436, total state size = 60.100 MiB
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.027, sim = 0.083
srv update: - cache state: 1 prompts, 70.582 MiB (limits: 8192.000 MiB, 131072 tokens, 282730 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv get_availabl: prompt cache update took 33.75 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 2808 | processing task
slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = 590723549
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 3972, total state size = 114.173 MiB
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.016, sim = 0.059
srv update: - cache state: 2 prompts, 205.790 MiB (limits: 8192.000 MiB, 131072 tokens, 255087 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv get_availabl: prompt cache update took 56.64 ms
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 2809 | processing task
slot update_slots: id 0 | task 2807 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 824
slot update_slots: id 0 | task 2807 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2807 | prompt processing progress, n_tokens = 760, batch.n_tokens = 760, progress = 0.922330
slot update_slots: id 1 | task 2806 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 811
slot update_slots: id 1 | task 2806 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2806 | prompt processing progress, n_tokens = 747, batch.n_tokens = 1507, progress = 0.921085
slot update_slots: id 2 | task 2809 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1097
slot update_slots: id 2 | task 2809 | n_past = 65, slot.prompt.tokens.size() = 3972, seq_id = 2, pos_min = 3075, n_swa = 128
slot update_slots: id 2 | task 2809 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 2 | task 2809 | erased invalidated context checkpoint (pos_min = 2137, pos_max = 3033, n_swa = 128, size = 21.034 MiB)
slot update_slots: id 2 | task 2809 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 2809 | prompt processing progress, n_tokens = 541, batch.n_tokens = 2048, progress = 0.493163
slot update_slots: id 0 | task 2807 | n_tokens = 760, memory_seq_rm [760, end)
slot update_slots: id 0 | task 2807 | prompt processing progress, n_tokens = 824, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 2807 | prompt done, n_tokens = 824, batch.n_tokens = 64
slot update_slots: id 0 | task 2807 | created context checkpoint 1 of 8 (pos_min = 633, pos_max = 759, size = 2.978 MiB)
slot update_slots: id 1 | task 2806 | n_tokens = 747, memory_seq_rm [747, end)
slot update_slots: id 1 | task 2806 | prompt processing progress, n_tokens = 811, batch.n_tokens = 128, progress = 1.000000
slot update_slots: id 1 | task 2806 | prompt done, n_tokens = 811, batch.n_tokens = 128
slot update_slots: id 1 | task 2806 | created context checkpoint 1 of 8 (pos_min = 518, pos_max = 746, size = 5.370 MiB)
slot update_slots: id 2 | task 2809 | n_tokens = 541, memory_seq_rm [541, end)
slot update_slots: id 2 | task 2809 | prompt processing progress, n_tokens = 1033, batch.n_tokens = 620, progress = 0.941659
slot update_slots: id 3 | task 2808 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 786
slot update_slots: id 3 | task 2808 | n_past = 65, slot.prompt.tokens.size() = 2436, seq_id = 3, pos_min = 2309, n_swa = 128
slot update_slots: id 3 | task 2808 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 3 | task 2808 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 2808 | prompt processing progress, n_tokens = 722, batch.n_tokens = 1342, progress = 0.918575
slot update_slots: id 2 | task 2809 | n_tokens = 1033, memory_seq_rm [1033, end)
slot update_slots: id 2 | task 2809 | prompt processing progress, n_tokens = 1097, batch.n_tokens = 66, progress = 1.000000
slot update_slots: id 2 | task 2809 | prompt done, n_tokens = 1097, batch.n_tokens = 66
slot update_slots: id 2 | task 2809 | created context checkpoint 1 of 8 (pos_min = 906, pos_max = 1032, size = 2.978 MiB)
slot update_slots: id 3 | task 2808 | n_tokens = 722, memory_seq_rm [722, end)
slot update_slots: id 3 | task 2808 | prompt processing progress, n_tokens = 786, batch.n_tokens = 130, progress = 1.000000
slot update_slots: id 3 | task 2808 | prompt done, n_tokens = 786, batch.n_tokens = 130
slot update_slots: id 3 | task 2808 | created context checkpoint 2 of 8 (pos_min = 79, pos_max = 721, size = 15.078 MiB)
slot print_timing: id 1 | task 2806 |
prompt eval time = 95657.19 ms / 811 tokens ( 117.95 ms per token, 8.48 tokens per second)
eval time = 166955.00 ms / 329 tokens ( 507.46 ms per token, 1.97 tokens per second)
total time = 262612.19 ms / 1140 tokens
slot release: id 1 | task 2806 | stop processing: n_tokens = 1139, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 1 | task -1 | selected slot by LCP similarity, sim_best = 0.296 (> 0.100 thold), f_keep = 0.291
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1139, total state size = 32.712 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.291, sim = 0.296
srv update: - cache state: 3 prompts, 243.872 MiB (limits: 8192.000 MiB, 131072 tokens, 253514 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv get_availabl: prompt cache update took 522.21 ms
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 2810 | processing task
slot update_slots: id 1 | task 2810 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1118
slot update_slots: id 1 | task 2810 | n_past = 331, slot.prompt.tokens.size() = 1139, seq_id = 1, pos_min = 883, n_swa = 128
slot update_slots: id 1 | task 2810 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 2810 | erased invalidated context checkpoint (pos_min = 518, pos_max = 746, n_swa = 128, size = 5.370 MiB)
slot update_slots: id 1 | task 2810 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2810 | prompt processing progress, n_tokens = 1054, batch.n_tokens = 1057, progress = 0.942755
slot update_slots: id 1 | task 2810 | n_tokens = 1054, memory_seq_rm [1054, end)
slot update_slots: id 1 | task 2810 | prompt processing progress, n_tokens = 1118, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 1 | task 2810 | prompt done, n_tokens = 1118, batch.n_tokens = 67
slot update_slots: id 1 | task 2810 | created context checkpoint 1 of 8 (pos_min = 411, pos_max = 1053, size = 15.078 MiB)
slot print_timing: id 0 | task 2807 |
prompt eval time = 95656.19 ms / 824 tokens ( 116.09 ms per token, 8.61 tokens per second)
eval time = 266007.77 ms / 463 tokens ( 574.53 ms per token, 1.74 tokens per second)
total time = 361663.96 ms / 1287 tokens
slot release: id 0 | task 2807 | stop processing: n_tokens = 1286, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 958041133
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1286, total state size = 36.253 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.257, sim = 0.083
srv update: - cache state: 4 prompts, 283.103 MiB (limits: 8192.000 MiB, 131072 tokens, 255596 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv get_availabl: prompt cache update took 639.83 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 2811 | processing task
slot update_slots: id 0 | task 2811 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3971
slot update_slots: id 0 | task 2811 | n_past = 331, slot.prompt.tokens.size() = 1286, seq_id = 0, pos_min = 1026, n_swa = 128
slot update_slots: id 0 | task 2811 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2811 | erased invalidated context checkpoint (pos_min = 633, pos_max = 759, n_swa = 128, size = 2.978 MiB)
slot update_slots: id 0 | task 2811 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2811 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.514984
slot update_slots: id 0 | task 2811 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 0 | task 2811 | prompt processing progress, n_tokens = 3907, batch.n_tokens = 1865, progress = 0.983883
slot update_slots: id 0 | task 2811 | n_tokens = 3907, memory_seq_rm [3907, end)
slot update_slots: id 0 | task 2811 | prompt processing progress, n_tokens = 3971, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 0 | task 2811 | prompt done, n_tokens = 3971, batch.n_tokens = 67
slot update_slots: id 0 | task 2811 | created context checkpoint 1 of 8 (pos_min = 3264, pos_max = 3906, size = 15.078 MiB)
slot print_timing: id 3 | task 2808 |
prompt eval time = 46249.84 ms / 786 tokens ( 58.84 ms per token, 16.99 tokens per second)
eval time = 369453.44 ms / 476 tokens ( 776.16 ms per token, 1.29 tokens per second)
total time = 415703.28 ms / 1262 tokens
slot release: id 3 | task 2808 | stop processing: n_tokens = 1261, truncated = 0
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.144 (> 0.100 thold), f_keep = 0.262
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1261, total state size = 32.829 MiB
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.262, sim = 0.144
srv update: - cache state: 5 prompts, 341.492 MiB (limits: 8192.000 MiB, 131072 tokens, 242143 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv get_availabl: prompt cache update took 625.95 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 2812 | processing task
slot update_slots: id 3 | task 2812 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 2303
slot update_slots: id 3 | task 2812 | n_past = 331, slot.prompt.tokens.size() = 1261, seq_id = 3, pos_min = 1122, n_swa = 128
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
slot update_slots: id 3 | task 2812 | failed to restore context checkpoint (pos_min = 79, pos_max = 721, size = 15.078 MiB)
slot update_slots: id 3 | task 2812 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 3 | task 2812 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 2812 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.887972
slot update_slots: id 3 | task 2812 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 3 | task 2812 | prompt processing progress, n_tokens = 2239, batch.n_tokens = 197, progress = 0.972210
slot update_slots: id 3 | task 2812 | n_tokens = 2239, memory_seq_rm [2239, end)
slot update_slots: id 3 | task 2812 | prompt processing progress, n_tokens = 2303, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 3 | task 2812 | prompt done, n_tokens = 2303, batch.n_tokens = 67
slot update_slots: id 3 | task 2812 | created context checkpoint 3 of 8 (pos_min = 1599, pos_max = 2238, size = 15.008 MiB)
slot print_timing: id 2 | task 2809 |
prompt eval time = 100466.86 ms / 1097 tokens ( 91.58 ms per token, 10.92 tokens per second)
eval time = 429549.97 ms / 481 tokens ( 893.04 ms per token, 1.12 tokens per second)
total time = 530016.83 ms / 1578 tokens
slot release: id 2 | task 2809 | stop processing: n_tokens = 1577, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.329 (> 0.100 thold), f_keep = 0.210
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1577, total state size = 40.051 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.210, sim = 0.329
srv update: - cache state: 6 prompts, 384.521 MiB (limits: 8192.000 MiB, 131072 tokens, 248643 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv get_availabl: prompt cache update took 611.19 ms
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 2813 | processing task
slot update_slots: id 2 | task 2813 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1005
slot update_slots: id 2 | task 2813 | n_past = 331, slot.prompt.tokens.size() = 1577, seq_id = 2, pos_min = 1446, n_swa = 128
slot update_slots: id 2 | task 2813 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 2 | task 2813 | erased invalidated context checkpoint (pos_min = 906, pos_max = 1032, n_swa = 128, size = 2.978 MiB)
slot update_slots: id 2 | task 2813 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 2813 | prompt processing progress, n_tokens = 941, batch.n_tokens = 944, progress = 0.936318
slot update_slots: id 2 | task 2813 | n_tokens = 941, memory_seq_rm [941, end)
slot update_slots: id 2 | task 2813 | prompt processing progress, n_tokens = 1005, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 2 | task 2813 | prompt done, n_tokens = 1005, batch.n_tokens = 67
slot update_slots: id 2 | task 2813 | created context checkpoint 1 of 8 (pos_min = 299, pos_max = 940, size = 15.055 MiB)
slot print_timing: id 1 | task 2810 |
prompt eval time = 30955.28 ms / 1118 tokens ( 27.69 ms per token, 36.12 tokens per second)
eval time = 341132.77 ms / 300 tokens ( 1137.11 ms per token, 0.88 tokens per second)
total time = 372088.05 ms / 1418 tokens
slot release: id 1 | task 2810 | stop processing: n_tokens = 1417, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 1 | task -1 | selected slot by LCP similarity, sim_best = 0.328 (> 0.100 thold), f_keep = 0.234
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1417, total state size = 37.073 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.234, sim = 0.328
srv update: - cache state: 7 prompts, 436.673 MiB (limits: 8192.000 MiB, 131072 tokens, 245531 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv get_availabl: prompt cache update took 638.57 ms
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 2814 | processing task
slot update_slots: id 1 | task 2814 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1010
slot update_slots: id 1 | task 2814 | n_past = 331, slot.prompt.tokens.size() = 1417, seq_id = 1, pos_min = 1253, n_swa = 128
slot update_slots: id 1 | task 2814 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 2814 | erased invalidated context checkpoint (pos_min = 411, pos_max = 1053, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 1 | task 2814 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2814 | prompt processing progress, n_tokens = 946, batch.n_tokens = 949, progress = 0.936634
slot update_slots: id 1 | task 2814 | n_tokens = 946, memory_seq_rm [946, end)
slot update_slots: id 1 | task 2814 | prompt processing progress, n_tokens = 1010, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 1 | task 2814 | prompt done, n_tokens = 1010, batch.n_tokens = 67
slot update_slots: id 1 | task 2814 | created context checkpoint 1 of 8 (pos_min = 303, pos_max = 945, size = 15.078 MiB)
slot print_timing: id 0 | task 2811 |
prompt eval time = 101880.49 ms / 3971 tokens ( 25.66 ms per token, 38.98 tokens per second)
eval time = 258165.89 ms / 277 tokens ( 932.01 ms per token, 1.07 tokens per second)
total time = 360046.39 ms / 4248 tokens
slot release: id 0 | task 2811 | stop processing: n_tokens = 4247, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.230 (> 0.100 thold), f_keep = 0.078
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 4247, total state size = 103.879 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.078, sim = 0.230
srv update: - cache state: 8 prompts, 555.630 MiB (limits: 8192.000 MiB, 131072 tokens, 255580 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv get_availabl: prompt cache update took 775.48 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 2815 | processing task
slot update_slots: id 0 | task 2815 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1441
slot update_slots: id 0 | task 2815 | n_past = 331, slot.prompt.tokens.size() = 4247, seq_id = 0, pos_min = 4064, n_swa = 128
slot update_slots: id 0 | task 2815 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2815 | erased invalidated context checkpoint (pos_min = 3264, pos_max = 3906, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 0 | task 2815 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2815 | prompt processing progress, n_tokens = 1377, batch.n_tokens = 1380, progress = 0.955586
slot update_slots: id 0 | task 2815 | n_tokens = 1377, memory_seq_rm [1377, end)
slot update_slots: id 0 | task 2815 | prompt processing progress, n_tokens = 1441, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 0 | task 2815 | prompt done, n_tokens = 1441, batch.n_tokens = 67
slot update_slots: id 0 | task 2815 | created context checkpoint 1 of 8 (pos_min = 734, pos_max = 1376, size = 15.078 MiB)
slot print_timing: id 2 | task 2813 |
prompt eval time = 26964.44 ms / 1005 tokens ( 26.83 ms per token, 37.27 tokens per second)
eval time = 205077.87 ms / 262 tokens ( 782.74 ms per token, 1.28 tokens per second)
total time = 232042.31 ms / 1267 tokens
slot release: id 2 | task 2813 | stop processing: n_tokens = 1266, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.108 (> 0.100 thold), f_keep = 0.261
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1266, total state size = 32.712 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.261, sim = 0.108
srv update: - cache state: 9 prompts, 603.396 MiB (limits: 8192.000 MiB, 131072 tokens, 252536 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD0BD15E0: 1266 tokens, checkpoints: 1, 47.766 MiB
srv get_availabl: prompt cache update took 582.22 ms
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 2816 | processing task
slot update_slots: id 2 | task 2816 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3072
slot update_slots: id 2 | task 2816 | n_past = 331, slot.prompt.tokens.size() = 1266, seq_id = 2, pos_min = 1137, n_swa = 128
slot update_slots: id 2 | task 2816 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 2 | task 2816 | erased invalidated context checkpoint (pos_min = 299, pos_max = 940, n_swa = 128, size = 15.055 MiB)
slot update_slots: id 2 | task 2816 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 2816 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.665690
slot update_slots: id 2 | task 2816 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 2 | task 2816 | prompt processing progress, n_tokens = 3008, batch.n_tokens = 966, progress = 0.979167
slot update_slots: id 2 | task 2816 | n_tokens = 3008, memory_seq_rm [3008, end)
slot update_slots: id 2 | task 2816 | prompt processing progress, n_tokens = 3072, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 2 | task 2816 | prompt done, n_tokens = 3072, batch.n_tokens = 67
slot update_slots: id 2 | task 2816 | created context checkpoint 1 of 8 (pos_min = 2365, pos_max = 3007, size = 15.078 MiB)
slot print_timing: id 3 | task 2812 |
prompt eval time = 58219.87 ms / 2303 tokens ( 25.28 ms per token, 39.56 tokens per second)
eval time = 400455.46 ms / 431 tokens ( 929.13 ms per token, 1.08 tokens per second)
total time = 458675.32 ms / 2734 tokens
slot release: id 3 | task 2812 | stop processing: n_tokens = 2733, truncated = 0
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.380 (> 0.100 thold), f_keep = 0.121
srv get_availabl: updating prompt cache
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv prompt_save: - saving prompt with length 2733, total state size = 70.582 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.121, sim = 0.380
srv load: - found better prompt with f_keep = 0.263, sim = 0.383
srv update: - cache state: 9 prompts, 666.779 MiB (limits: 8192.000 MiB, 131072 tokens, 246553 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD08B5B90: 2733 tokens, checkpoints: 3, 111.149 MiB
srv get_availabl: prompt cache update took 995.10 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 2817 | processing task
slot update_slots: id 3 | task 2817 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 870
slot update_slots: id 3 | task 2817 | n_past = 333, slot.prompt.tokens.size() = 1266, seq_id = 3, pos_min = 1137, n_swa = 128
slot update_slots: id 3 | task 2817 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 3 | task 2817 | erased invalidated context checkpoint (pos_min = 299, pos_max = 940, n_swa = 128, size = 15.055 MiB)
slot update_slots: id 3 | task 2817 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 2817 | prompt processing progress, n_tokens = 806, batch.n_tokens = 809, progress = 0.926437
slot update_slots: id 3 | task 2817 | n_tokens = 806, memory_seq_rm [806, end)
slot update_slots: id 3 | task 2817 | prompt processing progress, n_tokens = 870, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 3 | task 2817 | prompt done, n_tokens = 870, batch.n_tokens = 67
slot update_slots: id 3 | task 2817 | created context checkpoint 1 of 8 (pos_min = 166, pos_max = 805, size = 15.008 MiB)
slot print_timing: id 1 | task 2814 |
prompt eval time = 26832.53 ms / 1010 tokens ( 26.57 ms per token, 37.64 tokens per second)
eval time = 360577.32 ms / 410 tokens ( 879.46 ms per token, 1.14 tokens per second)
total time = 387409.84 ms / 1420 tokens
slot release: id 1 | task 2814 | stop processing: n_tokens = 1419, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 1 | task -1 | selected slot by LCP similarity, sim_best = 0.284 (> 0.100 thold), f_keep = 0.233
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1419, total state size = 37.941 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.233, sim = 0.284
srv update: - cache state: 10 prompts, 719.798 MiB (limits: 8192.000 MiB, 131072 tokens, 244542 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD08B5B90: 2733 tokens, checkpoints: 3, 111.149 MiB
srv update: - prompt 0000023FD02D51A0: 1419 tokens, checkpoints: 1, 53.019 MiB
srv get_availabl: prompt cache update took 693.50 ms
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 2818 | processing task
slot update_slots: id 1 | task 2818 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1167
slot update_slots: id 1 | task 2818 | n_past = 331, slot.prompt.tokens.size() = 1419, seq_id = 1, pos_min = 1220, n_swa = 128
slot update_slots: id 1 | task 2818 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 2818 | erased invalidated context checkpoint (pos_min = 303, pos_max = 945, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 1 | task 2818 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2818 | prompt processing progress, n_tokens = 1103, batch.n_tokens = 1106, progress = 0.945159
slot update_slots: id 1 | task 2818 | n_tokens = 1103, memory_seq_rm [1103, end)
slot update_slots: id 1 | task 2818 | prompt processing progress, n_tokens = 1167, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 1 | task 2818 | prompt done, n_tokens = 1167, batch.n_tokens = 67
slot update_slots: id 1 | task 2818 | created context checkpoint 1 of 8 (pos_min = 460, pos_max = 1102, size = 15.078 MiB)
slot print_timing: id 0 | task 2815 |
prompt eval time = 39020.95 ms / 1441 tokens ( 27.08 ms per token, 36.93 tokens per second)
eval time = 420117.91 ms / 546 tokens ( 769.45 ms per token, 1.30 tokens per second)
total time = 459138.86 ms / 1987 tokens
slot release: id 0 | task 2815 | stop processing: n_tokens = 1986, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 1778651419
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1986, total state size = 52.339 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.167, sim = 0.058
srv update: - cache state: 11 prompts, 787.215 MiB (limits: 8192.000 MiB, 131072 tokens, 244267 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD08B5B90: 2733 tokens, checkpoints: 3, 111.149 MiB
srv update: - prompt 0000023FD02D51A0: 1419 tokens, checkpoints: 1, 53.019 MiB
srv update: - prompt 0000023F7E1CBE30: 1986 tokens, checkpoints: 1, 67.417 MiB
srv get_availabl: prompt cache update took 904.06 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 2819 | processing task
slot update_slots: id 0 | task 2819 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 5752
slot update_slots: id 0 | task 2819 | n_past = 331, slot.prompt.tokens.size() = 1986, seq_id = 0, pos_min = 1740, n_swa = 128
slot update_slots: id 0 | task 2819 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2819 | erased invalidated context checkpoint (pos_min = 734, pos_max = 1376, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 0 | task 2819 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2819 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.355529
slot update_slots: id 0 | task 2819 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 0 | task 2819 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.711057
slot update_slots: id 0 | task 2819 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id 0 | task 2819 | prompt processing progress, n_tokens = 5688, batch.n_tokens = 1601, progress = 0.988873
slot update_slots: id 0 | task 2819 | n_tokens = 5688, memory_seq_rm [5688, end)
slot update_slots: id 0 | task 2819 | prompt processing progress, n_tokens = 5752, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 0 | task 2819 | prompt done, n_tokens = 5752, batch.n_tokens = 67
slot update_slots: id 0 | task 2819 | created context checkpoint 1 of 8 (pos_min = 5045, pos_max = 5687, size = 15.078 MiB)
slot print_timing: id 3 | task 2817 |
prompt eval time = 25345.06 ms / 870 tokens ( 29.13 ms per token, 34.33 tokens per second)
eval time = 430894.74 ms / 468 tokens ( 920.72 ms per token, 1.09 tokens per second)
total time = 456239.80 ms / 1338 tokens
slot release: id 3 | task 2817 | stop processing: n_tokens = 1337, truncated = 0
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.102 (> 0.100 thold), f_keep = 0.248
srv get_availabl: updating prompt cache
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv prompt_save: - saving prompt with length 1337, total state size = 35.831 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.248, sim = 0.102
srv update: - cache state: 12 prompts, 838.053 MiB (limits: 8192.000 MiB, 131072 tokens, 242518 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD08B5B90: 2733 tokens, checkpoints: 3, 111.149 MiB
srv update: - prompt 0000023FD02D51A0: 1419 tokens, checkpoints: 1, 53.019 MiB
srv update: - prompt 0000023F7E1CBE30: 1986 tokens, checkpoints: 1, 67.417 MiB
srv update: - prompt 0000023F7DCAE6E0: 1337 tokens, checkpoints: 1, 50.838 MiB
srv get_availabl: prompt cache update took 818.14 ms
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 2820 | processing task
slot update_slots: id 3 | task 2820 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 3255
slot update_slots: id 3 | task 2820 | n_past = 331, slot.prompt.tokens.size() = 1337, seq_id = 3, pos_min = 1146, n_swa = 128
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
slot update_slots: id 3 | task 2820 | failed to restore context checkpoint (pos_min = 166, pos_max = 805, size = 15.008 MiB)
slot update_slots: id 3 | task 2820 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 3 | task 2820 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 2820 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.628264
slot update_slots: id 3 | task 2820 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 3 | task 2820 | prompt processing progress, n_tokens = 3191, batch.n_tokens = 1149, progress = 0.980338
slot update_slots: id 3 | task 2820 | n_tokens = 3191, memory_seq_rm [3191, end)
slot update_slots: id 3 | task 2820 | prompt processing progress, n_tokens = 3255, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 3 | task 2820 | prompt done, n_tokens = 3255, batch.n_tokens = 67
slot update_slots: id 3 | task 2820 | created context checkpoint 2 of 8 (pos_min = 2548, pos_max = 3190, size = 15.078 MiB)
slot print_timing: id 1 | task 2818 |
prompt eval time = 33308.17 ms / 1167 tokens ( 28.54 ms per token, 35.04 tokens per second)
eval time = 422492.45 ms / 351 tokens ( 1203.68 ms per token, 0.83 tokens per second)
total time = 455800.62 ms / 1518 tokens
slot release: id 1 | task 2818 | stop processing: n_tokens = 1517, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 1 | task -1 | selected slot by LRU, t_last = 2076151011
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 1517, total state size = 38.832 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.218, sim = 0.052
srv update: - cache state: 13 prompts, 891.963 MiB (limits: 8192.000 MiB, 131072 tokens, 241793 est)
srv update: - prompt 0000023F7DA5B150: 2436 tokens, checkpoints: 1, 70.582 MiB
srv update: - prompt 0000023F7E263940: 3972 tokens, checkpoints: 1, 135.207 MiB
srv update: - prompt 0000023FD0091AE0: 1139 tokens, checkpoints: 1, 38.082 MiB
srv update: - prompt 0000023F7DEC66D0: 1286 tokens, checkpoints: 1, 39.231 MiB
srv update: - prompt 0000023FD00829A0: 1261 tokens, checkpoints: 2, 58.389 MiB
srv update: - prompt 0000023FD0091A00: 1577 tokens, checkpoints: 1, 43.030 MiB
srv update: - prompt 0000023F7E06FD60: 1417 tokens, checkpoints: 1, 52.151 MiB
srv update: - prompt 0000023FD00806B0: 4247 tokens, checkpoints: 1, 118.957 MiB
srv update: - prompt 0000023FD08B5B90: 2733 tokens, checkpoints: 3, 111.149 MiB
srv update: - prompt 0000023FD02D51A0: 1419 tokens, checkpoints: 1, 53.019 MiB
srv update: - prompt 0000023F7E1CBE30: 1986 tokens, checkpoints: 1, 67.417 MiB
srv update: - prompt 0000023F7DCAE6E0: 1337 tokens, checkpoints: 1, 50.838 MiB
srv update: - prompt 0000023FD0B76580: 1517 tokens, checkpoints: 1, 53.910 MiB
srv get_availabl: prompt cache update took 728.26 ms
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 2821 | processing task
slot update_slots: id 1 | task 2821 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 6325
slot update_slots: id 1 | task 2821 | n_past = 331, slot.prompt.tokens.size() = 1517, seq_id = 1, pos_min = 1378, n_swa = 128
slot update_slots: id 1 | task 2821 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 1 | task 2821 | erased invalidated context checkpoint (pos_min = 460, pos_max = 1102, n_swa = 128, size = 15.078 MiB)
slot update_slots: id 1 | task 2821 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 2045, batch.n_tokens = 2048, progress = 0.323320
slot update_slots: id 1 | task 2821 | n_tokens = 2045, memory_seq_rm [2045, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 4090, batch.n_tokens = 2048, progress = 0.646640
slot update_slots: id 1 | task 2821 | n_tokens = 4090, memory_seq_rm [4090, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6135, batch.n_tokens = 2048, progress = 0.969960
slot update_slots: id 1 | task 2821 | n_tokens = 6135, memory_seq_rm [6135, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6261, batch.n_tokens = 129, progress = 0.989881
slot update_slots: id 1 | task 2821 | n_tokens = 6261, memory_seq_rm [6261, end)
slot update_slots: id 1 | task 2821 | prompt processing progress, n_tokens = 6325, batch.n_tokens = 67, progress = 1.000000
slot update_slots: id 1 | task 2821 | prompt done, n_tokens = 6325, batch.n_tokens = 67
slot update_slots: id 1 | task 2821 | created context checkpoint 1 of 8 (pos_min = 5621, pos_max = 6260, size = 15.008 MiB)
slot print_timing: id 2 | task 2816 |
prompt eval time = 80068.15 ms / 3072 tokens ( 26.06 ms per token, 38.37 tokens per second)
eval time = 866753.46 ms / 755 tokens ( 1148.02 ms per token, 0.87 tokens per second)
total time = 946821.62 ms / 3827 tokens
slot release: id 2 | task 2816 | stop processing: n_tokens = 3826, truncated = 0
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.371 (> 0.100 thold), f_keep = 0.087
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 3826, total state size = 93.632 MiB
srv params_from_: Chat format: GPT-OSS
srv load: - looking for better prompt, base f_keep = 0.087, sim = 0.371
srv load: - found better prompt with f_keep = 0.264, sim = 0.420
state_read_meta: failed to find available cells in kv cache
state_seq_set_data: error loading state: failed to restore kv cache
srv load: failed to restore state with size 39783992
D:/a/llama.cpp/llama.cpp/tools/server/server.cpp:3843: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237
slot prompt_load: id 2 | task -1 | failed to load prompt from cache
srv update: - cache state: 14 prompts, 1000.673 MiB (limits: 8192.000 MiB, 131072 tokens, 246847 est)