Skip to content

Conversation

dreadnode-renovate-bot[bot]
Copy link
Contributor

@dreadnode-renovate-bot dreadnode-renovate-bot bot commented Aug 21, 2025

This PR contains the following updates:

Package Change Age Confidence
vllm ^0.5.0 -> ^0.10.0 age confidence

Release Notes

vllm-project/vllm (vllm)

v0.10.2

Compare Source

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto
Model Support
Engine Core
  • V1 engine maturation: Extended V1 support to compute capability < 8.0 (#​23614, #​24022), added cross-attention KV cache for encoder-decoder models (#​23664), request-level logits processor integration (#​23656), and KV events from connectors (#​19737).
  • Backend expansion: Terratorch backend integration (#​23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
  • Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#​22594), disabled prefix caching for hybrid/Mamba models (#​23716), added FP32 SSM kernel support (#​23506), full CUDA graph support for Mamba1 (#​23035), and V1 as default for Mamba models (#​23650).
  • Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#​24469), critical CUDA graph capture throughput fix (#​24128), scheduler optimization for single completions (#​21917), multi-threaded model weight loading (#​23928), and tensor core usage enforcement for FlashInfer decode (#​23214).
  • Multimodal enhancements: Multimodal cache tracking with mm_hash (#​22711), UUID-based multimodal identifiers (#​23394), improved V1 video embedding estimation (#​24312), and simplified multimodal UUID handling (#​24271).
  • Sampling and structured outputs: Support for all prompt logprobs (#​23868), final logprobs (#​22387), grammar bitmask optimization (#​23361), and user-configurable KV cache memory size (#​21489).
  • Distributed: Support Decode Context Parallel (DCP) for MLA (#​23734)
Hardware & Performance
  • NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#​23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#​23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#​23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#​22357), MXFP4 fused CUTLASS MoE (#​23696), default MXFP4 MoE on Blackwell (#​23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#​23608).
  • Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#​24521).
  • Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#​14258, #​23958), V1 cross-attention support (#​23297), FP8 support for FlashMLA (#​22668), fused grouped TopK for MoE (#​23274), Flash Linear Attention kernels (#​24518), and W4A8 support on Hopper (#​23198).
  • Performance improvements: 13.7x speedup for token conversion (#​20413), TTIT/TTFT improvements for disaggregated serving (#​22760), symmetric memory all-reduce by default (#​24111), FlashInfer warmup during startup (#​23439), V1 model execution overlap (#​23569), and various Triton configuration tuning (#​23748, #​23939).
  • Platform expansion: Apple Silicon bfloat16 support for M2+ (#​24129), IBM Z V1 engine support (#​22725), Intel XPU torch.compile (#​22609), XPU MoE data parallelism (#​22887), XPU Triton attention (#​24149), XPU FP8 quantization (#​23148), and ROCm pipeline parallelism with Ray (#​24275).
  • Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#​24698, #​24688, #​24699, #​24695), GLM-4.5-Air-FP8 B200 configs (#​23695), Kimi K2 optimization (#​24597), and QWEN3 Coder/Thinking configs (#​24266, #​24330).
Quantization
  • New quantization capabilities: Per-layer quantization routing (#​23556), GGUF quantization with layer skipping (#​23188), NFP4+FP8 MoE support (#​22674), W4A8 channel scales (#​23570), and AMD CDNA2/CDNA3 FP4 support (#​22527).
  • Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#​22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
  • FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#​24197), FP8-qkv attention kernels (#​23647), and FP8 per-tensor GEMMs (#​22895).
  • Platform-specific quantization: ROCm TorchAO quantization enablement (#​24400) and TorchAO module swap configuration (#​21982).
  • Performance optimizations: MXFP4 MoE loading cache optimization (#​24154) and compressed tensors version updates (#​23202).
  • Breaking change: Removed original Marlin quantization format (#​23204).
API & Frontend
  • OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#​23735), transcription response usage statistics (#​23576), and return_token_ids parameter (#​22587).
  • Response API improvements: Streaming support for non-harmony responses (#​23741), non-streaming logprobs (#​23319), MCP tool background mode (#​23494), MCP streaming+background support (#​23927), and tool output token reporting (#​24285).
  • Frontend optimizations: Error stack traces with --log-error-stack (#​22960), collective RPC endpoint (#​23075), beam search concurrency optimization (#​23599), unnecessary detokenization skipping (#​24236), and custom media UUIDs (#​23449).
  • Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#​23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#​23595), EPLB configuration parameter (#​20562), embedding endpoint chat request support (#​23931), and LM Format Enforcer V1 integration (#​22564).
Dependencies
  • Major updates: PyTorch 2.8.0 upgrade (#​20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#​24086), and FlashInfer 0.2.14.post1 maintenance update (#​23537).
  • Supporting updates: XGrammar 0.1.23 (#​22988), TPU core dump fix with tpu_info 0.4.0 (#​23135), and compressed tensors version bump (#​23202).
  • Deployment improvements: FlashInfer cubin directory environment variable (#​22675) for offline environments and pre-cached CUDA binaries.
V0 Deprecation
  • Backend removals: V0 Neuron backend deprecation (#​21159), V0 pooling model support removal (#​23434), V0 FlashInfer attention backend removal (#​22776), and V0 test cleanup (#​23418, #​23862).
  • API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#​18800), LoRA extra vocab size deprecation warning (#​23635), LoRA bias parameter deprecation (#​24339), and metrics naming change from TPOT to ITL (#​24110).
Breaking Changes
  1. PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
  2. FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
  3. V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
  4. Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
  5. Metrics renaming - TPOT deprecated in favor of ITL

What's Changed


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

| datasource | package | from  | to     |
| ---------- | ------- | ----- | ------ |
| pypi       | vllm    | 0.5.5 | 0.10.1 |
@dreadnode-renovate-bot dreadnode-renovate-bot bot requested a review from a team as a code owner August 21, 2025 20:04
@dreadnode-renovate-bot
Copy link
Contributor Author

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

  • any of the package files in this branch needs updating, or
  • the branch becomes conflicted, or
  • you click the rebase/retry checkbox if found above, or
  • you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock
Updating dependencies
Resolving dependencies...


The current project's supported Python range (>=3.10,<3.14) is not compatible with some of the required packages Python requirement:
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14

Because no versions of vllm match >0.10.0,<0.10.1 || >0.10.1,<0.10.1.1 || >0.10.1.1,<0.11.0
 and vllm (0.10.0) requires Python <3.13,>=3.9, vllm is forbidden.
And because vllm (0.10.1) requires Python <3.13,>=3.9, vllm is forbidden.
So, because vllm (0.10.1.1) requires Python <3.13,>=3.9
 and rigging depends on vllm (^0.10.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers


@dreadnode-renovate-bot dreadnode-renovate-bot bot added type/digest Dependency digest updates area/python Changes to Python package configuration and dependencies labels Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants