-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
- CPU architecture (e.g., x86_64, aarch64): x86_64
- CPU/Host memory size (if known): 2T
- GPU properties: NVIDIA H100 and NVIDIA B200
- Docker image: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
(Also tried 2aade46) - NVIDIA driver version
H100 System:
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
B200 System:
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
- OS (Ubuntu 24.04, CentOS 8): Ubuntu 22.04 (H100) and 24.04 (B200)
Who can help?
This is a follow-up to issue #8274.
This issue was easier to reproduce prior to nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1, but it still happens after the fix.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- docker run -it --rm --shm-size=250g --gpus "device=4" -p 8000:8000 --entrypoint /bin/bash -v /data:/data nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
- Use the following
extra_llm.yaml
enable_chunked_prefill: true
cuda_graph_config:
max_batch_size: 60
enable_attention_dp: false
kv_cache_config:
free_gpu_memory_fraction: 0.85
host_cache_size: 2000000000
trtllm-serve nvidia/Meta-Llama-3.1-8B-Instruct-FP8 --tp_size=1 --backend=pytorch --host=0.0.0.0 --port=8000 --max_seq_len=131072 --max_batch_size=60 --max_num_tokens=2048 --extra_llm_api_options extra_llm.yaml- Open two terminal windows or screen. In the first window, do this:
pkill -f curl
for i in {0..10000}; do ( curl -Z -s http://localhost:8000/v1/chat/completions'?[0-5]' -H "Content-Type: application/json" -d '{"model": "none", "messages": [{"role":"system","content":"Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence."}, {"role": "user","content": "The quick brown fox jumps"}],"reasoning_effort":"high","response_format":{"type":"text"},"temperature":0,"top_k":1,"max_tokens":100,"n":1}' | tr } '\n' | grep --color=never -o 'content":".*' & ); sleep 0.51; done
This will spam a bunch of curl processes. Make sure topkill -f curlbetween tests as some get stuck. - In the second terminal window, use the following python script and txt file. Tested with
pip install --user httpx==0.28.1
client_bench.py
war_and_peace_excerpt.txt
Run python client_bench.py - it prints lots of httpx.ReadError errors. The exceptions will be ignored, and this is normal. It will create many cancelled requests, but this is fine and may help reproduce the error.
6. Keep both commands running for a while. It may take 5-10 minutes to reproduce the error. Since version 1.2.0rc1, the error happens far less frequently and usually only a few bad responses will show up. When done, make sure to pkill -f curl
7. Test again with host_cache_size: 0. There will be no bad responses, no matter how many times the test is run.
Expected behavior
The responses should all be a sentence, something like this:
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That's a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"That is a well-known pangram, a sentence that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"Over the lazy dog, which is a well-known pangram sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The sentence \"The quick brown fox jumps\" is a well-known pangram, a phrase that uses all the letters of the alphabet at least once.","reasoning_content":null,"reasoning":null,"tool_calls":[]
There should be no repeated words or # or repeated whitespace characters.
actual behavior
The prompt in the above testcase is "Be a helpful assistant #'$RANDOM'. Respond concisely in one sentence. ... The quick brown fox jumps" - so seeing these words verbatim in the output is unexpected.
Most responses are good. Bad responses will include mostly or entirely words from the prompt + chat template like some combination of the following:
content":"# assistant# assistant# assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#assistant#","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # quickbrownfoxjumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"assistant #quick brown fox jumps","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#26601.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"#2661. Respond concisely in one sentence.","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox jumps\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"# ","reasoning_content":null,"reasoning":null,"tool_calls":[]
content":"The quick brown fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox fox","reasoning_content":null,"reasoning":null,"tool_calls":[]
Note in particular the above: My script picks a random number "You are assistant #12345" and performs 5 requests per random number. If there are 6 responses that include the same assistant number, this is proof that prompts are being mixed up across multiple requests, which is a security issue.
I think in general it seems like when the issue happens, it includes a subset of prompt tokens, so running some string comparison against the prompt might be a way to automate detecting this bug.
additional notes
This has been very difficult to get a reproduction for, but it has been an issue for a long time, at least August, probably longer. The issue is also confirmed in 1.1.0x and so on.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.