[Bug]: Output includes tokens from other prompts when host kv cache enabled

### System Info

- CPU architecture (e.g., x86_64, aarch64): x86_64
- CPU/Host memory size (if known): 2T
- GPU properties: NVIDIA H100 and NVIDIA B200
- Docker image: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
  (Also tried 2aade46d18afbcaf9ee5be2fc1548f077ce03bbf)
- NVIDIA driver version

H100 System:
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |

B200 System:
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |

- OS (Ubuntu 24.04, CentOS 8): Ubuntu 22.04 (H100) and 24.04 (B200)



### Who can help?

@eopXD 

This is a follow-up to issue #8274.
This issue was easier to reproduce prior to `nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1`, but it still happens after the fix.

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

1. docker run -it --rm --shm-size=250g --gpus "device=4" -p 8000:8000 --entrypoint /bin/bash -v /data:/data nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
2. Use the following `extra_llm.yaml`
```
enable_chunked_prefill: true
cuda_graph_config:
    max_batch_size: 60
enable_attention_dp: false
kv_cache_config:
    free_gpu_memory_fraction: 0.85
    host_cache_size: 2000000000
```
3. `trtllm-serve nvidia/Meta-Llama-3.1-8B-Instruct-FP8 --tp_size=1 --backend=pytorch --host=0.0.0.0 --port=8000 --max_seq_len=131072 --max_batch_size=60 --max_num_tokens=2048 --extra_llm_api_options extra_llm.yaml`
4. Open two terminal windows or screen. In the first window, do this:
`pkill -f curl`
`for i in {0..10000}; do ( curl -Z -s http://localhost:8000/v1/chat/completions'?[0-5]'   -H "Content-Type: application/json"   -d '{"model": "none", "messages": [{"role":"system","content":"Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence."}, {"role": "user","content": "The quick brown fox jumps"}],"reasoning_effort":"high","response_format":{"type":"text"},"temperature":0,"top_k":1,"max_tokens":100,"n":1}' | tr } '\n' | grep --color=never -o 'content":".*' & ); sleep 0.51; done`
This will spam a bunch of curl processes. Make sure to `pkill -f curl` between tests as some get stuck.
5. In the second terminal window, use the following python script and txt file. Tested with `pip install --user httpx==0.28.1`

[client_bench.py](https://github.com/user-attachments/files/23245819/client_bench.py)
[war_and_peace_excerpt.txt](https://github.com/user-attachments/files/23245818/war_and_peace_excerpt.txt)

Run `python client_bench.py` - it prints lots of httpx.ReadError errors. The exceptions will be ignored, and this is  normal. It will create many cancelled requests, but this is fine and may help reproduce the error.
6. Keep both commands running for a while. It may take 5-10 minutes to reproduce the error. Since version 1.2.0rc1, the error happens far less frequently and usually only a few bad responses will show up. When done, make sure to `pkill -f curl`
7. Test again with `host_cache_size: 0`. There will be no bad responses, no matter how many times the test is run.

### Expected behavior

The responses should all be a sentence, something like this:
```
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That's a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That is a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
```

There should be no repeated words or # or repeated whitespace characters.

### actual behavior

The prompt in the above testcase is "`Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence. ... The quick brown fox jumps`" - so seeing these words verbatim in the output is unexpected.

Most responses are good. Bad responses will include mostly or entirely words from the prompt + chat template like some combination of the following:
```
content":"#     assistant#   assistant# assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#             #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#        ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"assistant #quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661. Respond concisely in one sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#           #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#      ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox","reasoning_content":null,"reasoning":null,"tool_calls":[]

```

Note in particular the above: My script picks a random number "You are assistant #12345" and performs 5 requests per random number. If there are 6 responses that include the same assistant number, this is proof that prompts are being mixed up across multiple requests, which is a security issue.

I think in general it seems like when the issue happens, it includes a subset of prompt tokens, so running some string comparison against the prompt might be a way to automate detecting this bug.

### additional notes

This has been very difficult to get a reproduction for, but it has been an issue for a long time, at least August, probably longer. The issue is also confirmed in 1.1.0x and so on.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Output includes tokens from other prompts when host kv cache enabled #8813

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Output includes tokens from other prompts when host kv cache enabled #8813

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions