Skip to content

Conversation

lgibelli
Copy link
Contributor

@lgibelli lgibelli commented Aug 19, 2025

Problem

When processing large batches of PDFs, multiple workers could acquire the semaphore in rapid succession, causing bursts of 1000+ page submissions that led to vLLM crashes (queue depth 600+, KV cache 97-99%).

Solution

Add asyncio.Lock() to serialize semaphore release checks. This ensures atomic evaluation of conditions and prevents race conditions where multiple workers start simultaneously.

Changes

  • Added release_lock to make release decisions atomic
  • All condition checks now happen inside the lock
  • Preserves original backpressure logic (10% of peak, 30s cooldown)

Testing

Tested with 150 PDFs - queue stayed under 460 (previously crashed at 600+) and no double releases occurred.

@jakep-allenai
Copy link
Collaborator

jakep-allenai commented Aug 19, 2025

Hmm, what GPU were you testing with? On an H100, you can do 600 pages at a time and it's quite good. How much RAM is on your machine?
And did you try setting --pages_per_group to a smaller number?

Also, did you try with v0.3.3? Because I had fixed a root cause issue of VLLM using too many worker threads

@lgibelli
Copy link
Contributor Author

lgibelli commented Aug 19, 2025

I'm testing on a 4090 (24GB of VRAM) on a system with 20 GB of DDR4.
The reason why lowering --pages_per_groups doesn't help is that each worker submits all pages at once and the vLLM queue still overflows. On a 4090 the vLLM processes maybe 10-20 pages per second, and even with a limit of 100 pages per worker the queue builds up faster than vLLM can drain it and each queued request uses GPU KV cache memory.

What I see happening (after adding some debug code) with pages_per_group=100 looks more or less like this:
Worker 1: Submits 100 pages → Queue: 100
Worker 2: Submits 100 pages → Queue: 200
Worker 3: Submits 100 pages → Queue: 300
Worker 4: Submits 100 pages → Queue: 400
Worker 5: Submits 100 pages → Queue: 500
vLLM processes 20 pages... → Queue: 480
Workers keep submitting... → Queue: 600+
-> crash

Yes, I did use v0.3.3.

@jakep-allenai
Copy link
Collaborator

Weird, the idea is that it should not submit more until the queue goes down a bit more. I feel like there is some simpler way to fix this, like don't unlock the semaphore until at least one page processes in the previous worker.

@lgibelli
Copy link
Contributor Author

The only real advantage of my approach is smoother queue depth. With your approach the queue would jump between 50→550→50→550.
I will try your approach tomorrow and report back if it is stable on the 4090.

@lgibelli lgibelli marked this pull request as draft August 27, 2025 09:04
@lgibelli lgibelli force-pushed the fix-vllm-backpressure branch from b31376d to 0c74322 Compare August 27, 2025 11:39
@lgibelli lgibelli changed the title Fix vLLM queue overflow from burst page submissions Fix vLLM queue overflow with serialized semaphore release Aug 27, 2025
@lgibelli lgibelli marked this pull request as ready for review August 27, 2025 11:41
Multiple workers could acquire semaphore in rapid succession when queue dropped,
causing bursts of 1000+ page submissions and vLLM crashes.

Race condition in semaphore release logic - multiple threads could evaluate
conditions and release simultaneously before queue updated.

Add asyncio.Lock() to serialize release checks, ensuring atomic evaluation
and release. All condition checks now happen inside the lock.
@lgibelli lgibelli force-pushed the fix-vllm-backpressure branch from 0c74322 to 0742014 Compare August 27, 2025 11:45
@lgibelli
Copy link
Contributor Author

Tested locally on my 4090.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants