Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Shortfin LLM Server] Server continues generating after client cancels request #1111

Open
stbaione opened this issue Mar 18, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@stbaione
Copy link
Contributor

Description

When a request is submitted to our server, the connection remains open until the generation process is complete. However, when we kill the client request, the server does not interrupt the generation process, and continues to internally finish the request. This is a waste of throughput and compute. If a request is cancelled, there's no point in finishing the generation.

We should gracefully kill the GenerateItemProcess of a given request, if that request closes connection on the client side, prior to generation being complete.

Repro

  1. Set SHORTFIN_APPS_LOG_LEVEL=DEBUG, to view the log outputs as the server generates.
  2. Can follow llama_serving guide to spin up the shortfin server.
  3. Send a request, leaving yourself enough time to kill it before it's complete:
curl http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "<|begin_of_text|>Generate 50 random characters.<|eot_id|>",
        "sampling_params": {"max_completion_tokens": 50}
    }'
  1. ctrl+c the curl request, to close the connection from the client side
  2. Observe that the server still continues to generate tokens, i.e:
[2025-03-18 15:14:56.558] [info] [batcher.py:254] INVOKE ProgramFunction(decode_bs4$async: 0rrrrrrr_r): 
  0: [4, 1]
  1: [4]
  2: [4]
  3: [4, 5]
  4: [512, 2097152]
DEBUG:shortfin_apps.llm.components.batcher:Prefill bs=4, bsl=160
INFO:shortfin_apps.llm.components.batcher:INVOKE ProgramFunction(decode_bs4$async: 0rrrrrrr_r): 
  0: [4, 1]
  1: [4]
  2: [4]
  3: [4, 5]
  4: [512, 2097152]
[2025-03-18 15:14:56.559] [info] [batcher.py:254] INVOKE ProgramFunction(decode_bs4$async: 0rrrrrrr_r): 
  0: [4, 1]
  1: [4]
  2: [4]
  3: [4, 5]
  4: [512, 2097152]
@stbaione stbaione added the bug Something isn't working label Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant