fix(deps): update dependency vllm to ^0.10.0 #257

dreadnode-renovate-bot · 2025-08-21T20:04:22Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`^0.5.0` -> `^0.10.0`

Release Notes

vllm-project/vllm (vllm)

`v0.10.2`

Compare Source

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
[Misc] enhance static type hint by @andyxning in #23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
[Misc] refactor function name by @andyxning in #23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
[XPU] Fix compile size for xpu by @jikunshang in #23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in #23027
[Refactor] Get prompt updates earlier by @DarkLight1337 in #23097
chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in #23090
[Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in #23095
[CI Bugfix] Pin openai<1.100 to unblock CI by @mgoin in #23118
fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in #23126
Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in #23008
Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in #23135
[Misc] Minor refactoring for prepare_inputs by @WoosukKwon in #23116
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT by @WoosukKwon in #23041
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in #23122
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in #22871
[V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in #22776
chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in #23048
[TPU] make ptxla not imported when using tpu_commons by @yaochengji in #23081
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes by @nikheal2 in #22725
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema by @bbeckca in #22023
[Log] Warning Once for Cutlass MLA by @yewentao256 in #23137
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 by @ZJY0516 in #23114
[misc] split engine_model into json file for nsys profile tool by @gracehonv in #23117
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn by @pliops-daniels in #22889
Fix GLM-4.5V-FP8 numerical issue by @zixi-qi in #22949
[Misc] Add request_id into benchmark_serve.py by @hustxiayang in #23065
[Bugfix] Fix broken Minimax-01-VL model by @Isotr0py in #22116
[bug fix] Fix llama4 spec decoding by @zixi-qi in #22691
[Misc] Avoid accessing req_ids inside a loop by @WoosukKwon in #23159
[Doc] use power of 2 by @Tialo in #23172
[Misc] Fix seq_lens for graph capture by @WoosukKwon in #23175
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel by @elvischenv in #21716
[Model] Add transformers problem_type (e.g. multi_label_classification) support by @noooop in #23173
[Model] support new model ovis2.5 by @myselvess in #23084
[Bugfix] Fix benchmark_moe.py by @jeejeelee in #23177
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL by @tjtanaa in #22742
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock by @yiz-liu in #23169
Add return_token_ids parameter to OpenAI API endpoints by @ultmaster in #22587
Migrate LlavaOnevisionMultiInputs to TensorSchema by @bbeckca in #21844
[CI/Build] Update transformers to v4.55.2 by @Isotr0py in #23093
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks by @tanruixiang in #22654
[Frontend] Add /collective_rpc API endpoint by @22quinn in #23075
[Misc] Enable yapf for FlashInfer backend by @WoosukKwon in #23193
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. by @bnellnm in #23125
fix: use cache_salt for gpt-oss by @dr75 in #23186
[Misc] Minor refactoring for FlashInfer backend by @WoosukKwon in #23147
[CI/Build] Add support for Python 3.13 by @mgoin in #13164
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend by @amirkl94 in #22357
[CI/Build] Replace lm-eval gsm8k tests with faster implementation by @mgoin in #23002
[BugFix] fix CUTLASS MLA full cudagraph by @LucasWilkinson in #23200
[Benchmarks] Add video inputs to ShareGPTDataset. by @huachenheli in #23199
[Quantization] Bump Compressed Tensors Version by @kylesayrs in #23202
[Core] Optimize scheduler request removal for single completions by @chi2liu in #21917
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py by @mgoin in #23132
[Core] Add torch profiler CPU traces for AsyncLLM. by @huachenheli in #21794
[Doc] Update V1 status of various pooling models by @DarkLight1337 in #23189
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention by @linzebing in #23185
Fix a performance comparison issue in Benchmark Suite by @louie-tsai in #23047
chore: support pytorch format in lora by @KilJaeeun in #22790
[CI/Build] Also check DP in benchmarks throughput script by @zhewenl in #23038
[CI/Build] Sync multimodal tests by @DarkLight1337 in #23181
[BugFix] Fix stuck stats/metrics after requests are aborted by @njhill in #22995
fix cuda graph by @fsx950223 in #22721
[Model] use autoWeightsLoader for gptoss by @calvin0327 in #22446
Fix missing quotes by @wzshiming in #23242
[Model] Support deepseek with eagle by @xyang16 in #21086
[Bugfix] Ensure correctness of Cohere2Vision processing by @DarkLight1337 in #23245
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image by @mgoin in #23129
[Model][V1] Support Ernie MTP by @xyxinyang in #22169
[Model] Improve olmo and olmo2 by @jeejeelee in #23228
[Fix] fix offline env use local mode path by @lengrongfu in #22526
[Bugfix] Ensure correctness of HCXVision processing by @DarkLight1337 in #23254
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute by @shixianc in #23045
[CLI][Doc] Formalize --mm-encoder-tp-mode by @DarkLight1337 in #23190
[Misc] Add max_seq_len to CommonAttentionMetadata by @WoosukKwon in #23216
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER by @JartX in #22795
Support conditional torch.compile per module by @sarckk in #22269
Migrate Mistral3ImagePixelInputs to TensorSchema by @bbeckca in #21945
Limit HTTP header count and size by @russellb in #23267
Small fix for Command-A-Vision by @dongluw in #23268
[Kernel/Quant] Remove the original marlin format and qqq by @mgoin in #23204
[Fix] correct tool_id for kimi-k2 when use tool_choice=required by @MoyanZitto in #21259
[Frontend] improve error logging of chat completion by @heheda12345 in #22957
[Optimization] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x by @misrasaurabh1 in #20413
Do not use eval() to convert unknown types by @russellb in #23266
[Feature] use --eplb_config to set eplb param by @lengrongfu in #20562
[misc] fix multiple arch wheels for the nightly index by @youkaichao in #23110
Remove chunked_prefill_enabled flag in V1 MLA by @MatthewBonanni in #23183
Feature/mla tests by @MatthewBonanni in #23195
[Fix] remove is_marlin param in benchmark_moe by @shixianc in #23286
[EP] Add logging for experts map by @22quinn in #22685
Remove duplicate entry in vllm.attention.all by @russellb in #23296
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter by @mgoin in #23284
[Optimization] Make new_block_ids None if empty by @WoosukKwon in #23262
[CPU] Refactor CPU W8A8 scaled_mm by @bigPYJ1151 in #23071
[CI/Build] Split out mm processor tests by @DarkLight1337 in #23260
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support by @Josephasafg in #23035
[Compile] Fix Compile Warning SM100 Cutlass MLA by @yewentao256 in #23287
[Model][VLM] Support R-4B Model by @yannqi in #23246
Delete images older than 24h. by @QiliangCui in #23291
[CI] Block the cu126 wheel build while broken by @mgoin in #23285
[Sampler] Support returning final logprobs by @22quinn in #22387
[Bugfix] Fix extra whitespace in strings caused by newline by @DarkLight1337 in #23272
[BugFix] Fix Python 3.9 Support by @jaredoconnell in #23306
[Model] Add LFM2 architecture by @paulpak58 in #22845
[Refactor] Simplify code for MM budget by @DarkLight1337 in #23310
[Doc] Fix batch-level DP example by @DarkLight1337 in #23325
[Performance] V1 Pooling Models E2E Performance Optimization by @noooop in #23162
[V1] Remove unnecessary check for main thread by @robertgshaw2-redhat in #23298
[Bugfix] set system_message in phi4mini chat template by @zhuangqh in #23309
[Multimodal] Always enable hashing mm data by @ywang96 in #23308
[ci/build] Fix abi tag for aarch64 by @youkaichao in #23329
Migrate MolmoImageInputs to TensorSchema by @bbeckca in #22022
Fix nvfp4 swizzling by @yiliu30 in #23140
add tg-mxfp4-moe-test by @IwakuraRein in #22540
[Bug] Fix R1 Accuracy 0 Bug by @yewentao256 in #23294
[Bugfix] Fix port conflict by obtaining a list of open ports upfront by @minosfuture in #21894
[Misc] Misc code cleanup/simplification by @njhill in #23304
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message by @heheda12345 in #23318
[Misc] fix VLLM_TORCH_PROFILER_DIR to absolute path by @andyxning in #23191
[Core] Always use tensor cores for Flashinfer Decode Wrapper by @pavanimajety in #23214
Make sure that vectorize_with_alignment produced vectorized global loads by @elvircrn in #23182
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask by @WoosukKwon in #23361
[CI] Clean up actions: remove helm, publish workflows and improve pr … by @simon-mo in #23377
[CI] improve pr comments bot by @simon-mo in #23380
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm by @mgoin in #23265
Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. by @tvalentyn in #23270
[Feature][Responses API] Support logprobs(non-stream) by @kebe7jun in #23319
[Core] Support custom executor qualname by @22quinn in #23314
[Kernel] Add FP8 support with FlashMLA backend by @MatthewBonanni in #22668
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed by @DarkLight1337 in #18800
Migrate MllamaImagePixelInputs to TensorSchema by @bbeckca in #22020
[CI/Build] Skip Idefics3 and SmolVLM generation test again by @Isotr0py in #23356
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement by @yewentao256 in #23351
[CI] Add end-to-end V1 min_tokens test coverage by @arjunbreddy22 in #22495
[Misc] Add gemma3 chat template with pythonic-style function calling by @philipchung in #17149
[New Model] Add Seed-Oss model by @FoolPlayer in #23241
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models by @heheda12345 in #23154
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator by @sfeng33 in #23079
[gpt-oss] add input/output usage in responses api when harmony context is leveraged by @gcalmettes in #22667
Migrate MiniCPMOAudioInputs to TensorSchema by @bbeckca in #21847
[Bugfix] Fix pooling models on non-CUDA devices by @bigPYJ1151 in #23392
[V0 Deprecation] Remove V0 LoRA test by @jeejeelee in #23418
[Misc] Move M-RoPE init logic to _init_mrope_positions by @WoosukKwon in #23422
[Attention] Allow V1 flash_attn to support cross-attention by @russellb in #23297
[misc] Remove outdate comment about runai_model_streamer by @carlory in #23421
[Doc] Update the doc for log probs + prefix caching by @heheda12345 in #23399
[Misc] local import code clean by @andyxning in #23420
[Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script by @namanlalitnyu in #23375
[Fix] Bump triton version in rocm-build requirements by @bringlein in #21630
[Bugfix]: Installing dev environment due to pydantic incompatible version by @hickeyma in #23353
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support by @PapaGoose in #23337
[BugFix] Fix the issue where image embeddings were incorrectly split.… by @bppps in #23366
fix(tests): Ensure reliable CUDA cache clearing in MoE test by @AzizCode92 in #23416
Add unit tests for batched guided and non-guided requests by @sarckk in #23389
[Doc]: fix various typos in multiple files by @didier-durand in #23179
[Model] Add Ovis2.5 PP support by @Isotr0py in #23405
[Bugfix] Fix broken Florence-2 model by @Isotr0py in #23426
[Quantization] Allow GGUF quantization to skip unquantized layer by @Isotr0py in #23188
add an env var for path to pre-downloaded flashinfer cubin files by @842974287 in #22675
[CI/Build] add EP dependencies to docker by @zhewenl in #21976
[PERF] PyTorch Symmetric Memory All-Reduce by @ilmarkov in #20759
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model by @rasmith in #22281
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel by @elvischenv in #22703
[BugFix] Fix batch updates for pooling models by @njhill in #23398
[BugFix] Fix MinPLogitsProcessor.update_states() by @njhill in #23401
[Model] Support DP for ViT on MiniCPM-V-4 by @david6666666 in #23327
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh by @mgoin in #23360
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs by @fengli1702 in #22527
Add glm4.5v tp2,4 fp8 config on H100_80GB by @chenxi-yang in #23443
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" by @DarkLight1337 in #23396
fix(tests): Correct unreachable assertion in truncation test by @AzizCode92 in #23425
Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #23454
[Misc] Modify CacheConfig import by @jeejeelee in #23459
[gpt-oss] Streaming Output for Python Tool by @ZJY0516 in #23409
Migrate Pixtral inputs to TensorSchema by @bbeckca in #23472
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks by @22quinn in #23477
Migrate Paligemma inputs to TensorSchema by @bbeckca in #23470
[kernel] Support W4A8 on Hopper by @czhu-cohere in #23198
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking by @lengrongfu in #23305
(Misc): add missing test for zero truncation size. by @teekenl in #23457
[New Model]Donut model by @princepride in #23229
[Model] Enable BLOOM on V1 by @DarkLight1337 in #23488
[Misc] Remove unused slot_mapping buffer by @WoosukKwon in #23502
fix incompatibililty with non cuda platform for nvfp4 by @luccafong in #23478
[Doc: ]fix various typos in multiple files by [@didier-durand](https://r

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.10.1 |

dreadnode-renovate-bot · 2025-08-21T20:04:24Z

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

any of the package files in this branch needs updating, or
the branch becomes conflicted, or
you click the rebase/retry checkbox if found above, or
you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock

Updating dependencies
Resolving dependencies...


The current project's supported Python range (>=3.10,<3.14) is not compatible with some of the required packages Python requirement:
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14

Because no versions of vllm match >0.10.0,<0.10.1 || >0.10.1,<0.10.1.1 || >0.10.1.1,<0.11.0
 and vllm (0.10.0) requires Python <3.13,>=3.9, vllm is forbidden.
And because vllm (0.10.1) requires Python <3.13,>=3.9, vllm is forbidden.
So, because vllm (0.10.1.1) requires Python <3.13,>=3.9
 and rigging depends on vllm (^0.10.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

fix(deps): update dependency vllm to ^0.10.0

b262dd5

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.10.1 |

dreadnode-renovate-bot bot requested a review from a team as a code owner August 21, 2025 20:04

dreadnode-renovate-bot bot added type/digest Dependency digest updates area/python Changes to Python package configuration and dependencies labels Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency vllm to ^0.10.0 #257

fix(deps): update dependency vllm to ^0.10.0 #257

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025 •

edited

Loading

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025

Uh oh!

Uh oh!

fix(deps): update dependency vllm to ^0.10.0 #257

Are you sure you want to change the base?

fix(deps): update dependency vllm to ^0.10.0 #257

Uh oh!

Conversation

dreadnode-renovate-bot bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v0.10.2

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

Breaking Changes

What's Changed

Configuration

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025

⚠️ Artifact update problem

File name: poetry.lock

Uh oh!

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025 •

edited

Loading

`v0.10.2`