vLLM queue overflow from burst page submissions


I'm testing olmocr on 3500+ PDFs with an average of 80 pages per PDF and a 4090 GPU.

Despite pages_per_group limiting (default 500 pages/work item), the pipeline submits all pages within a work item simultaneously. When a worker processes its work item, it runs all PDFs in parallel, with each PDF submitting all its pages at once - creating bursts of 500+ requests.
    
The backpressure mechanism has issues:
- Initially waits for queue=0 (peak_running_req starts at 0)
- After peak established, waits for queue ≤ 10% of peak
- Only checks every 30 seconds
- When semaphore releases, the next worker creates another 500-request burst
    
This causes vLLM queue overflow (600+ observed) and causes frequent crashes on my 4090. I opened PR #316 
    


### Versions

Python 3.11.13
aiohappyeyeballs==2.6.1
aiohttp==3.12.13
aiosignal==1.4.0
airportsdata==20250622
annotated-types==0.7.0
anyio==4.9.0
astor==0.8.1
attrs==25.3.0
azure-core==1.35.0
azure-identity==1.24.0
azure-storage-blob==12.26.0
beaker-py==2.4.4
beautifulsoup4==4.13.4
blake3==1.0.5
bleach==6.2.0
blinker==1.9.0
boto3==1.39.3
botocore==1.39.3
cached_path==1.7.3
cachetools==5.5.2
certifi==2025.6.15
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
cloudpickle==3.1.1
compressed-tensors==0.10.1
cryptography==45.0.5
cupy-cuda12x==13.4.1
defusedxml==0.7.1
Deprecated==1.2.18
depyf==0.18.0
dill==0.4.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.7.0
einops==0.8.1
email_validator==2.2.0
eval_type_backport==0.2.2
fastapi==0.115.14
fastapi-cli==0.0.7
fastjsonschema==2.21.1
fastrlock==0.8.3
filelock==3.18.0
flashinfer-python @ https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl#sha256=08e50216841dcf07a0f097b040cd4e7012117e9e59fc97c14a95daafc
69633d7
Flask==3.1.1
frozenlist==1.7.0
fsspec==2025.5.1
ftfy==6.3.1
fuzzysearch==0.8.0
gguf==0.17.1
google-api-core==2.25.1
google-auth==2.40.3
google-cloud-core==2.4.3
google-cloud-storage==2.19.0
google-crc32c==1.7.1
google-genai==1.24.0
google-resumable-media==2.7.2
googleapis-common-protos==1.70.0
greenlet==3.2.3
grpcio==1.73.1
h11==0.16.0
hf-xet==1.1.5
httpcore==1.0.9
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.33.2
idna==3.10
img2pdf==0.6.1
importlib_metadata==8.7.0
interegular==0.3.3
isodate==0.7.2
itsdangerous==2.2.0
Jinja2==3.1.6
jiter==0.10.0
jmespath==1.0.1
jsonschema==4.24.0
jsonschema-specifications==2025.4.1
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyterlab_pygments==0.3.0
lark==1.2.2
lingua-language-detector==2.1.1
llguidance==0.7.30
llvmlite==0.44.0
lm-format-enforcer==0.10.11
lxml==6.0.0
markdown-it-py==3.0.0
markdown2==2.5.3
MarkupSafe==3.0.2
mdurl==0.1.2
mistral_common==1.6.3
mistralai==1.9.1
mistune==3.1.3
mpmath==1.3.0
msal==1.33.0
msal-extensions==1.3.1
msgpack==1.1.1
msgspec==0.19.0
multidict==6.6.3
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.5
ninja==1.11.1.4
numba==0.61.2
numpy==2.2.6
nvidia-cublas-cu12==12.8.3.14
nvidia-cuda-cupti-cu12==12.8.57
nvidia-cuda-nvrtc-cu12==12.8.61
nvidia-cuda-runtime-cu12==12.8.57
nvidia-cudnn-cu12==9.7.1.26
nvidia-cufft-cu12==11.3.3.41
nvidia-cufile-cu12==1.13.0.11
nvidia-curand-cu12==10.3.9.55
nvidia-cusolver-cu12==11.7.2.55
nvidia-cusparse-cu12==12.5.7.53
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.8.61
nvidia-nvtx-cu12==12.8.55
-e git+https://github.com/allenai/olmocr@b31376df0d7e3129e5472e041bcff0e044a094f4#egg=olmocr
openai==1.93.0
opencv-python-headless==4.11.0.86
opentelemetry-api==1.34.1
opentelemetry-exporter-otlp==1.34.1
opentelemetry-exporter-otlp-proto-common==1.34.1
opentelemetry-exporter-otlp-proto-grpc==1.34.1
opentelemetry-exporter-otlp-proto-http==1.34.1
opentelemetry-proto==1.34.1
opentelemetry-sdk==1.34.1
opentelemetry-semantic-conventions==0.55b1
opentelemetry-semantic-conventions-ai==0.4.10
orjson==3.10.18
outlines==0.1.11
outlines_core==0.1.26
packaging==25.0
pandocfilters==1.5.1
partial-json-parser==0.2.1.1.post6
pikepdf==9.9.0
pillow==11.3.0
platformdirs==4.3.8
playwright==1.53.0
prometheus-fastapi-instrumentator==7.1.0
prometheus_client==0.22.1
propcache==0.3.2
proto-plus==1.26.1
protobuf==5.29.5
psutil==7.0.0
py-cpuinfo==9.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.2
pycountry==24.6.1
pycparser==2.22
pydantic==2.11.7
pydantic_core==2.33.2
pyee==13.0.0
Pygments==2.19.2
PyJWT==2.10.1
pypdf==5.7.0
pypdfium2==4.30.1
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
python-json-logger==3.3.0
python-magic==0.4.27
python-multipart==0.0.20
PyYAML==6.0.2
pyzmq==27.0.0
RapidFuzz==3.13.0
ray==2.47.1
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rich==13.9.4
rich-toolkit==0.14.8
rpds-py==0.26.0
rsa==4.9.1
s3transfer==0.13.0
safetensors==0.5.3
scipy==1.16.0
sentencepiece==0.2.0
sequence_align==0.3.0
shellingham==1.5.4
six==1.17.0
smart_open==7.3.0.post1
sniffio==1.3.1
soupsieve==2.7
starlette==0.46.2
sympy==1.14.0
syntok==1.4.4
tenacity==8.5.0
tiktoken==0.9.0
tinycss2==1.4.0
tinyhost==0.4.18
tokenizers==0.21.2
torch==2.7.0+cu128
torchaudio==2.7.0+cu128
torchvision==0.22.0+cu128
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.52.4
triton==3.3.0
typer==0.16.0
typing-inspection==0.4.1
typing_extensions==4.14.1
urllib3==2.5.0
uvicorn==0.35.0
uvloop==0.21.0
vllm==0.9.1
watchfiles==1.1.0
wcwidth==0.2.13
webencodings==0.5.1
websockets==15.0.1
Werkzeug==3.1.3
wrapt==1.17.2
xformers==0.0.30
xgrammar==0.1.19
yarl==1.20.1
zipp==3.23.0
zstandard==0.23.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM queue overflow from burst page submissions #317

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM queue overflow from burst page submissions #317

Description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions