Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
644d57d
[Model] Add Ernie4.5 VL Model Support (#22514)
CSWYF3634076 Aug 27, 2025
3210264
[Frontend] Add --log-error-stack to print stack trace for error respo…
heheda12345 Aug 27, 2025
142ac08
[Frontend] Optimize beam search performance by limiting concurrency (…
heheda12345 Aug 27, 2025
d272415
[Quantization] Expand compressed-tensors MoE matching logic to suppor…
dsikka Aug 27, 2025
fce10db
[XPU] Add xpu torch.compile support (#22609)
jikunshang Aug 27, 2025
9de25c2
[CI/Build] Remove redundant LoRA model tests (#23706)
jeejeelee Aug 27, 2025
8dbf6ed
[Bugfix] fix when config.yaml config value is list parse error (#23528)
lengrongfu Aug 27, 2025
69244e6
[Core] Use key-only cache for `BaseMultiModalProcessor` (#23018)
DarkLight1337 Aug 27, 2025
6446677
[XPU]fix cuda event used in XPU model runner (#23708)
jikunshang Aug 27, 2025
91e382c
[CI/Build] Remove redundant register in model init tests (#23715)
DarkLight1337 Aug 27, 2025
5bd9f84
[Docs] Fix an admonition important (#23726)
windsonsea Aug 27, 2025
6578e87
Optimize input preparation for FlashInfer [2/N] (#23174)
WoosukKwon Aug 27, 2025
04ff1e4
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py (#23728)
WoosukKwon Aug 27, 2025
11eddf0
[FlashInfer] Cache hyper params in metadata builder (#23732)
WoosukKwon Aug 27, 2025
e039407
[CI/Build] Reduce LoRA layer test cases (#23721)
jeejeelee Aug 27, 2025
8f0d7ea
[XPU] Fix OOM issue for data parallel with Ray backend (#22500)
faaany Aug 27, 2025
1f7a9c9
[Docs] Fix a 1-2-3 list and style issues in tpu.md (#23729)
windsonsea Aug 27, 2025
9d30de4
[model] Support MiniCPM-V 4.5 (#23586)
tc-mb Aug 27, 2025
8c13820
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled…
cndoit18 Aug 27, 2025
a403d0f
[Misc] Remove unnecessary `_send_reconfig_message()` in `core_client.…
njhill Aug 27, 2025
704432a
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-b…
tdoublep Aug 27, 2025
5eeef1b
[Model] Explicit `default_pooling_type` interface (#23736)
DarkLight1337 Aug 27, 2025
8dd2baa
Add vLLM Korea Meetup in the README.md and meetups.md (#23746)
rebel-hongseok Aug 27, 2025
16dc405
Fix pre-commit on main (#23747)
hmellor Aug 27, 2025
fe8d7b6
[Model] Interface to enable batch-level DP support (#23733)
DarkLight1337 Aug 27, 2025
513c1fe
Only run `get_attr_docs` if generating help text (#23723)
hmellor Aug 27, 2025
3af47c3
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt (#23666)
yewentao256 Aug 27, 2025
8414904
[Model] Enable native HF format InternVL support (#23742)
Isotr0py Aug 27, 2025
83f555f
[Doc]: upgrade version of crate-ci tool for improved typo detection (…
didier-durand Aug 27, 2025
3ce8285
[LogitsProcs] Deduplicate built-in LP implementation logic (#23362)
njhill Aug 27, 2025
2b61d2e
[Docs] Remove in-tree Gaudi install instructions (#23628)
hmellor Aug 27, 2025
4f35be1
[BugFix] Fix topk_softmax assert (#19764)
ProExpertProg Aug 27, 2025
52883ed
[Model] Merge `SupportsMultiModalWithRawInput` with `SupportsMultiMod…
DarkLight1337 Aug 27, 2025
dd58932
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Tex…
tdoublep Aug 27, 2025
4e4d017
[Docs] Fix warnings in `mkdocs build` (continued) (#23743)
Zerohertz Aug 27, 2025
3c0ef76
ci: Add arm64 docker build to release pipeline (#23210)
seemethere Aug 27, 2025
0585a9e
Disable `torch.compile` for dynamic rope models in Transformers backe…
hmellor Aug 27, 2025
8bf6266
[Multimodal] Generate mm_hash based on request metadata when caching …
ywang96 Aug 27, 2025
853c371
[V1][Mamba] - Enable V1 by default for Mamba Models (#23650)
Josephasafg Aug 27, 2025
082cc07
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 (#23608)
zyongye Aug 27, 2025
f9ca2b4
[Bugfix] Fix Marlin NVFP4 for modelopt (#23659)
mgoin Aug 27, 2025
321938e
[Feature] Add `VLLM_DISABLE_PAD_FOR_CUDAGRAPH` to Avoid Hang Issue (#…
yewentao256 Aug 27, 2025
5da4f5d
[Bugfix] Fix for V1 priority scheduling crashes at preemption (#23713)
Hanchenli Aug 28, 2025
a69693e
Migrate Qwen inputs to TensorSchema (#23473)
bbeckca Aug 28, 2025
1b7b161
[Feature] models: pass layer prefix to replace_linear_class for per-l…
Shrey1306 Aug 28, 2025
a781e84
[Perf] Tune configs for triton block fp8 gemm H100/H200 (#23748)
mgoin Aug 28, 2025
a11adaf
Gracefully handle edge cases in harmony utils (#23155)
Ithanil Aug 28, 2025
f48a9af
[CI] make all multi-gpu weight loading tests run nightly (#23792)
killershrimp Aug 28, 2025
c8851a4
Add deprecation warning for lora_extra_vocab_size (#23635)
ahengljh Aug 28, 2025
22feac8
[Transform] [Quantization] Add transforms to compressed tensors (#22486)
kylesayrs Aug 28, 2025
c07a733
[CI] enable idefics3 and fuyu-8b test in multimodal test (#23790)
ZJY0516 Aug 28, 2025
daa1273
[Bugfix] when set offline model running error (#23711)
lengrongfu Aug 28, 2025
186aced
[Kernel] cuda kernels for upcoming decode context parallel feature (#…
youzhedian Aug 28, 2025
11a7faf
[New Model]: Support GteNewModelForSequenceClassification (#23524)
noooop Aug 28, 2025
c5d004a
[Model] Add PP support and VLM backbone compatability for GPT-OSS (#2…
Isotr0py Aug 28, 2025
3462c1c
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader func…
JartX Aug 28, 2025
d99c3a4
[Doc]: fix typos in .md files (including those of #23751) (#23825)
didier-durand Aug 28, 2025
67cee40
[CI/Build][Bugfix] Fix Qwen VL tests on CPU (#23818)
bigPYJ1151 Aug 28, 2025
a3432f1
[BugFix][Spec Decode] Use float64 for uniform_probs (#23803)
WoosukKwon Aug 28, 2025
bfab219
[Model] [gpt-oss] fix gpt-oss pp support (#23815)
ZJY0516 Aug 28, 2025
d3da2ee
[Doc]: fix typos in Python scripts (#23828)
didier-durand Aug 28, 2025
66548f6
[Bugfix] Fix benchmark_moe.py for blockwise fp8. (#23823)
crischeng Aug 28, 2025
1f096f9
[CI] Fix linting error on main (#23835)
tdoublep Aug 28, 2025
9508960
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…
nvpohanh Aug 28, 2025
db74d60
[Bugfix] Add fake mode around passes (#23349)
angelayi Aug 28, 2025
0583578
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime (…
jeanschmidt Aug 28, 2025
8805ad9
Add scale_config.yml file for Meta autoscalers for GH Actions (#23840)
jeanschmidt Aug 28, 2025
f32a5bc
Migrate Llama4ImagePatchInputs to TensorSchema (#22021)
bbeckca Aug 28, 2025
04d1dd7
[ROCm][Aiter] Add triton fp8 bmm kernel for mla (#23264)
divakar-amd Aug 28, 2025
57d4ede
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_ker…
He-Jingkai Aug 28, 2025
16a45b3
[NVIDIA] Support SiluMul + NVFP4 quant fusion (#23671)
elvischenv Aug 28, 2025
27e88ce
chore: build release image by default (#23852)
simon-mo Aug 28, 2025
7ffbf27
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr…
WoosukKwon Aug 28, 2025
cb293f6
[V1] Enable prefill optimization for Gemma3n (#22628)
sarckk Aug 28, 2025
d3d2aad
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled (#23858)
yewentao256 Aug 28, 2025
b668055
[V0 Deprecation] Remove V0 Samplers test (#23862)
WoosukKwon Aug 29, 2025
235c9db
[XPU] support data parallel for MoE models on XPU (#22887)
chaojun-zhang Aug 29, 2025
de533ab
[Models] Improve iteration over layers (#19497)
lgeiger Aug 29, 2025
006477e
[ROCm][Fix] Fix rocm build caused by #23791 (#23847)
charlifu Aug 29, 2025
c8b3b29
[tests] Improve speed and reliability of test_transcription_api_corre…
russellb Aug 29, 2025
98ac0cb
[Bugfix] Use `ReplicatedLinear` for SequenceClassification head (#23836)
Isotr0py Aug 29, 2025
5264015
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek runni…
KingsleyZhang123 Aug 29, 2025
6597d7a
[Platform] import activation_quant_fusion for CUDA only (#23882)
wangxiyuan Aug 29, 2025
05d839c
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM (#23800)
oneraghavan Aug 29, 2025
b4f9e96
[CI/Build] Clean up LoRA test (#23890)
jeejeelee Aug 29, 2025
2d0afcc
[mrope][Qwen2-VL] Fix edge case where getting index of image/video to…
huachenheli Aug 29, 2025
885ca6d
[Misc] Fix warnings for mistral model (#23552)
ZJY0516 Aug 29, 2025
934bebf
Better errors for Transformers backend missing features (#23759)
hmellor Aug 29, 2025
2554b27
[V0 Deprecation] Remove pooling model support in V0 (#23434)
maxdebayser Aug 29, 2025
ad39106
[CPU] Enable data parallel for CPU backend (#23903)
bigPYJ1151 Aug 29, 2025
d9e00db
[Performance] V1 Classify Models E2E Performance Optimization (#23541)
noooop Aug 29, 2025
69f4635
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec (#23779)
sfeng33 Aug 29, 2025
67c1490
Update PyTorch to 2.8.0 (#20358)
huydhn Aug 29, 2025
4f7cde7
Adds `json_count_leaves` utility function (#23899)
aditchawdhary Aug 29, 2025
1cf3753
[MODEL] `Apertus` and `XIELU` (#23068)
EduardDurech Aug 29, 2025
0a2f4c0
[Models] Use in-place adds in Idefics2Vision (#23932)
lgeiger Aug 29, 2025
d90d8eb
[BugFix] Async scheduling and PP compatibility with DP (#23770)
njhill Aug 29, 2025
72a6913
[CI] Add `aiter` to matching list of issue auto labeller for `rocm` …
vllmellm Aug 29, 2025
0dc9532
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant (#23929)
youzhedian Aug 29, 2025
4d7fe40
[RL][BugFix] Fix missing tokenizer error for token-in-token-out (#23904)
22quinn Aug 29, 2025
b7adf94
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj (#23939)
mgoin Aug 29, 2025
1c26b42
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-bas…
tdoublep Aug 29, 2025
8c3e199
Revert gemma3n fast prefill changes (#23897)
sarckk Aug 29, 2025
5674a40
[Misc] Make `download_weights_from_hf` more reliable (#23863)
hmellor Aug 29, 2025
d660c98
[CI] Fix unavailable image remote URL (#23966)
ywang96 Aug 29, 2025
5b31cb1
[Bugfix] Fix --config arg expansion called from api_server.py (#23944)
dubejf Aug 30, 2025
8fb85b7
Add routed_scaling_factor to MoE grouped topk (#23123)
xyang16 Aug 30, 2025
ee52a32
[CI] Move testing image from remote URL to S3 (#23980)
ywang96 Aug 30, 2025
9748c51
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant f…
sarckk Aug 30, 2025
f1bddbd
[Core] Cleanup TPU model runner for MM (#23894)
DarkLight1337 Aug 30, 2025
4071c76
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba (#23831)
tdoublep Aug 30, 2025
628d00c
[Bugfix] Fix test_lora_resolvers.py (#23984)
jeejeelee Aug 30, 2025
5490d63
[UT] fix unify_kv_cache_configs when kv cache config needs sort (#23843)
andyxning Aug 30, 2025
3a6acad
[Model] Enable encoder DP for MiniCPM-V (#23948)
ZJY0516 Aug 30, 2025
379ea28
Add LoRA support for DeepSeek models (V2, V3, R1-0528) (#23971)
sadeghja1070 Aug 30, 2025
fb4983e
[Misc] add reorder_batch AttentionMetadataBuilder (#23798)
andyxning Aug 30, 2025
e80bca3
[Refactor] refactor freezing_value/cuda_event initialize outside try …
andyxning Aug 30, 2025
68a3491
[Misc] enhance type hint for rearrange return value (#23519)
andyxning Aug 30, 2025
038e9be
[LoRA] Much faster startup when LoRA is enabled (#23777)
andylolu2 Aug 30, 2025
5b8077b
Fix wrong truncate_prompt_tokens type hint (#22761)
gmarinho2 Aug 30, 2025
749be00
[Core][Multimodal] Allow passing `multi_modal_uuids` as multimodal id…
ywang96 Aug 31, 2025
9701352
[Doc]: fix typos in Python comments (#24001)
didier-durand Aug 31, 2025
81eea3d
vllm fix check on max vocab size (#22471)
xw285cornell Aug 31, 2025
752d2e1
[Minor] Fix some random typos in comments (#24009)
njhill Aug 31, 2025
14b4326
v1: Support KV events from connectors (#19737)
orozery Sep 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
40 changes: 31 additions & 9 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ steps:
commands:
# #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
# https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
Expand Down Expand Up @@ -62,23 +62,45 @@ steps:
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
- label: "Build release image (x86)"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
id: build-release-image
id: build-release-image-x86
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
# re-tag to default image tag and push, just in case arm64 build fails
- "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build release image (arm64)"
depends_on: ~
id: build-release-image-arm64
agents:
queue: arm64_cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"

# Add job to create multi-arch manifest
- label: "Create multi-arch manifest"
depends_on:
- build-release-image-x86
- build-release-image-arm64
id: create-multi-arch-manifest
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "docker manifest create public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-x86_64 public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-aarch64 --amend"
- "docker manifest push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Annotate release workflow"
depends_on:
- build-release-image
- create-multi-arch-manifest
- build-wheel-cuda-12-8
- build-wheel-cuda-12-6
- build-wheel-cuda-11-8
Expand Down
1 change: 0 additions & 1 deletion .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,6 @@ if [[ $commands == *" entrypoints/llm "* ]]; then
--ignore=entrypoints/llm/test_chat.py \
--ignore=entrypoints/llm/test_accuracy.py \
--ignore=entrypoints/llm/test_init.py \
--ignore=entrypoints/llm/test_generate_multiple_loras.py \
--ignore=entrypoints/llm/test_prompt_validation.py "}
fi

Expand Down
44 changes: 30 additions & 14 deletions .buildkite/scripts/hardware_ci/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu .

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=16 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2

function cpu_tests() {
set -e
Expand All @@ -49,57 +49,73 @@ function cpu_tests() {
# Run kernel tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -v -s tests/kernels/test_onednn.py"
pytest -x -v -s tests/kernels/test_onednn.py"

# Run basic model test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
# Note: disable until supports V1
# pytest -v -s tests/kernels/attention/test_cache.py -m cpu_model
# pytest -v -s tests/kernels/attention/test_mla_decode_cpu.py -m cpu_model
# pytest -x -v -s tests/kernels/attention/test_cache.py -m cpu_model
# pytest -x -v -s tests/kernels/attention/test_mla_decode_cpu.py -m cpu_model

# Note: disable Bart until supports V1
pytest -v -s tests/models/language/generation -m cpu_model \
pytest -x -v -s tests/models/language/generation -m cpu_model \
--ignore=tests/models/language/generation/test_bart.py
VLLM_CPU_SGL_KERNEL=1 pytest -v -s tests/models/language/generation -m cpu_model \
VLLM_CPU_SGL_KERNEL=1 pytest -x -v -s tests/models/language/generation -m cpu_model \
--ignore=tests/models/language/generation/test_bart.py

pytest -v -s tests/models/language/pooling -m cpu_model
pytest -v -s tests/models/multimodal/generation \
pytest -x -v -s tests/models/language/pooling -m cpu_model
pytest -x -v -s tests/models/multimodal/generation \
--ignore=tests/models/multimodal/generation/test_mllama.py \
--ignore=tests/models/multimodal/generation/test_pixtral.py \
-m cpu_model"

# Run compressed-tensor test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
pytest -x -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"

# Note: disable it until supports V1
# Run AWQ test
# docker exec cpu-test-"$NUMA_NODE" bash -c "
# set -e
# VLLM_USE_V1=0 pytest -s -v \
# VLLM_USE_V1=0 pytest -x -s -v \
# tests/quantization/test_ipex_quant.py"

# Run multi-lora tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
pytest -x -s -v \
tests/lora/test_qwen2vl.py"

# online serving
# online serving: tp+pp
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
server_pid=$!
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions'
--endpoint /v1/completions
kill -s SIGTERM $server_pid &'

# online serving: tp+dp
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -dp=2 &
server_pid=$!
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions
kill -s SIGTERM $server_pid &'
}

# All of CPU tests are expected to be finished less than 40 mins.
Expand Down
1 change: 1 addition & 0 deletions .buildkite/scripts/hardware_ci/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ docker run \
set -e
echo $ZE_AFFINITY_MASK
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -O.cudagraph_mode=NONE
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
cd tests
Expand Down
42 changes: 29 additions & 13 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,9 @@ steps:
- tests/entrypoints/offline_mode
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

- label: Entrypoints Test (API Server) # 40min
Expand Down Expand Up @@ -234,16 +233,33 @@ steps:
# OOM in the CI unless we run this separately
- pytest -v -s tokenization

- label: V1 Test
- label: V1 Test e2e + engine
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/v1
commands:
# split the test to avoid interference
- pytest -v -s v1/core
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- pytest -v -s v1/e2e
- pytest -v -s v1/engine

- label: V1 Test entrypoints
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/v1
commands:
- pytest -v -s v1/entrypoints

- label: V1 Test others
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/v1
commands:
# split the test to avoid interference
- pytest -v -s v1/core
- pytest -v -s v1/executor
- pytest -v -s v1/sample
- pytest -v -s v1/logits_processors
Expand All @@ -256,9 +272,6 @@ steps:
- pytest -v -s v1/test_utils.py
- pytest -v -s v1/test_oracle.py
- pytest -v -s v1/test_metrics_reader.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- pytest -v -s v1/e2e
# Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
Expand Down Expand Up @@ -312,7 +325,7 @@ steps:
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_llm_with_multi_loras.py
parallelism: 4

- label: PyTorch Compilation Unit Tests
Expand Down Expand Up @@ -449,8 +462,8 @@ steps:
- tests/quantization
commands:
# temporary install here since we need nightly, will move to requirements/test.in
# after torchao 0.12 release
- pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
# after torchao 0.12 release, and pin a working version of torchao nightly here
- pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization

- label: LM Eval Small Models # 53min
Expand Down Expand Up @@ -654,6 +667,7 @@ steps:
# Quantization
- pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8'
- pytest -v -s tests/kernels/quantization/test_nvfp4_quant.py
- pytest -v -s tests/kernels/quantization/test_silu_nvfp4_quant_fusion.py
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
Expand All @@ -663,6 +677,7 @@ steps:
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
- pytest -v -s tests/kernels/moe/test_flashinfer.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py

##### 1 GPU test #####
##### multi gpus test #####
Expand Down Expand Up @@ -791,13 +806,14 @@ steps:
# requires multi-GPU testing for validation.
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_multi_loras_with_tp.py
- pytest -v -s -x lora/test_llm_with_multi_loras.py


- label: Weight Loading Multiple GPU Test # 33min
mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/tests"
num_gpus: 2
num_gpus: 2
optional: true
source_file_dependencies:
- vllm/
- tests/weight_loading
Expand Down
21 changes: 21 additions & 0 deletions .github/scale-config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
# runner_types:
# runner_label:
# instance_type: m4.large
# os: linux
# # min_available defaults to the global cfg in the ALI Terraform
# min_available: undefined
# # when max_available value is not defined, no max runners is enforced
# max_available: undefined
# disk_size: 50
# is_ephemeral: true

runner_types:
linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: true
os: linux
4 changes: 4 additions & 0 deletions .github/workflows/issue_autolabel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ jobs:
term: "VLLM_ROCM_",
searchIn: "both"
},
{
term: "aiter",
searchIn: "title"
},
{
term: "rocm",
searchIn: "title"
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ repos:
- id: ruff-format
files: ^(.buildkite|benchmarks|examples)/.*
- repo: https://github.com/crate-ci/typos
rev: v1.34.0
rev: v1.35.5
hooks:
- id: typos
- repo: https://github.com/PyCQA/isort
Expand Down
6 changes: 4 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from docker/Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.7.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.7.0")
set(TORCH_SUPPORTED_VERSION_CUDA "2.8.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.8.0")

#
# Try to find python package with an executable that exactly matches
Expand Down Expand Up @@ -541,6 +541,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS)
set(SRCS
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
"csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu"
"csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
Expand All @@ -559,6 +560,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS)
set(SRCS
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
"csrc/quantization/fp4/activation_nvfp4_quant_fusion_kernels.cu"
"csrc/quantization/fp4/nvfp4_experts_quant.cu"
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu"
"csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu")
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Easy, fast, and cheap LLM serving for everyone
*Latest News* 🔥

- [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH).
- [2025/08] We hosted [vLLM Korea Meetup](https://luma.com/cgcgprmh) with Red Hat and Rebellions! We shared the latest advancements in vLLM along with project spotlights from the vLLM Korea community. Please find the meetup slides [here](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
- [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152).
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
Expand Down
1 change: 0 additions & 1 deletion benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ def run_vllm(
end = time.perf_counter()
else:
assert lora_requests is None, "BeamSearch API does not support LoRA"
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests.
output_len = requests[0].expected_output_len
for request in requests:
Expand Down
Loading