-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix vLLM queue overflow with serialized semaphore release #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hmm, what GPU were you testing with? On an H100, you can do 600 pages at a time and it's quite good. How much RAM is on your machine? Also, did you try with v0.3.3? Because I had fixed a root cause issue of VLLM using too many worker threads |
I'm testing on a 4090 (24GB of VRAM) on a system with 20 GB of DDR4. What I see happening (after adding some debug code) with pages_per_group=100 looks more or less like this: Yes, I did use v0.3.3. |
Weird, the idea is that it should not submit more until the queue goes down a bit more. I feel like there is some simpler way to fix this, like don't unlock the semaphore until at least one page processes in the previous worker. |
The only real advantage of my approach is smoother queue depth. With your approach the queue would jump between 50→550→50→550. |
b31376d
to
0c74322
Compare
Multiple workers could acquire semaphore in rapid succession when queue dropped, causing bursts of 1000+ page submissions and vLLM crashes. Race condition in semaphore release logic - multiple threads could evaluate conditions and release simultaneously before queue updated. Add asyncio.Lock() to serialize release checks, ensuring atomic evaluation and release. All condition checks now happen inside the lock.
0c74322
to
0742014
Compare
Tested locally on my 4090. |
Problem
When processing large batches of PDFs, multiple workers could acquire the semaphore in rapid succession, causing bursts of 1000+ page submissions that led to vLLM crashes (queue depth 600+, KV cache 97-99%).
Solution
Add
asyncio.Lock()
to serialize semaphore release checks. This ensures atomic evaluation of conditions and prevents race conditions where multiple workers start simultaneously.Changes
release_lock
to make release decisions atomicTesting
Tested with 150 PDFs - queue stayed under 460 (previously crashed at 600+) and no double releases occurred.