Skip to content

[Bug]: Output includes tokens from other prompts when host kv cache enabled #8813

@pathorn

Description

@pathorn

System Info

  • CPU architecture (e.g., x86_64, aarch64): x86_64
  • CPU/Host memory size (if known): 2T
  • GPU properties: NVIDIA H100 and NVIDIA B200
  • Docker image: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
    (Also tried 2aade46)
  • NVIDIA driver version

H100 System:
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |

B200 System:
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |

  • OS (Ubuntu 24.04, CentOS 8): Ubuntu 22.04 (H100) and 24.04 (B200)

Who can help?

@eopXD

This is a follow-up to issue #8274.
This issue was easier to reproduce prior to nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1, but it still happens after the fix.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. docker run -it --rm --shm-size=250g --gpus "device=4" -p 8000:8000 --entrypoint /bin/bash -v /data:/data nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
  2. Use the following extra_llm.yaml
enable_chunked_prefill: true
cuda_graph_config:
    max_batch_size: 60
enable_attention_dp: false
kv_cache_config:
    free_gpu_memory_fraction: 0.85
    host_cache_size: 2000000000
  1. trtllm-serve nvidia/Meta-Llama-3.1-8B-Instruct-FP8 --tp_size=1 --backend=pytorch --host=0.0.0.0 --port=8000 --max_seq_len=131072 --max_batch_size=60 --max_num_tokens=2048 --extra_llm_api_options extra_llm.yaml
  2. Open two terminal windows or screen. In the first window, do this:
    pkill -f curl
    for i in {0..10000}; do ( curl -Z -s http://localhost:8000/v1/chat/completions'?[0-5]' -H "Content-Type: application/json" -d '{"model": "none", "messages": [{"role":"system","content":"Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence."}, {"role": "user","content": "The quick brown fox jumps"}],"reasoning_effort":"high","response_format":{"type":"text"},"temperature":0,"top_k":1,"max_tokens":100,"n":1}' | tr } '\n' | grep --color=never -o 'content":".*' & ); sleep 0.51; done
    This will spam a bunch of curl processes. Make sure to pkill -f curl between tests as some get stuck.
  3. In the second terminal window, use the following python script and txt file. Tested with pip install --user httpx==0.28.1

client_bench.py
war_and_peace_excerpt.txt

Run python client_bench.py - it prints lots of httpx.ReadError errors. The exceptions will be ignored, and this is normal. It will create many cancelled requests, but this is fine and may help reproduce the error.
6. Keep both commands running for a while. It may take 5-10 minutes to reproduce the error. Since version 1.2.0rc1, the error happens far less frequently and usually only a few bad responses will show up. When done, make sure to pkill -f curl
7. Test again with host_cache_size: 0. There will be no bad responses, no matter how many times the test is run.

Expected behavior

The responses should all be a sentence, something like this:

content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That's a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That is a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]

There should be no repeated words or # or repeated whitespace characters.

actual behavior

The prompt in the above testcase is "Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence. ... The quick brown fox jumps" - so seeing these words verbatim in the output is unexpected.

Most responses are good. Bad responses will include mostly or entirely words from the prompt + chat template like some combination of the following:

content":"#     assistant#   assistant# assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#             #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#        ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#  #   quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"assistant #quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661. Respond concisely in one sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#           #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#      ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#       ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox","reasoning_content":null,"reasoning":null,"tool_calls":[]

Note in particular the above: My script picks a random number "You are assistant #12345" and performs 5 requests per random number. If there are 6 responses that include the same assistant number, this is proof that prompts are being mixed up across multiple requests, which is a security issue.

I think in general it seems like when the issue happens, it includes a subset of prompt tokens, so running some string comparison against the prompt might be a way to automate detecting this bug.

additional notes

This has been very difficult to get a reproduction for, but it has been an issue for a long time, at least August, probably longer. The issue is also confirmed in 1.1.0x and so on.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

KV-Cache Managementkv-cache management for efficient LLM inferencebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions