Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
bc964ec
Initial Commit GPT-OSS
hlahkar Oct 9, 2025
6bfaf28
Add Naive Attention
hlahkar Oct 17, 2025
4a4f5ea
[Bugfix] Fix bucketing UT (#367)
kzawora-intel Oct 9, 2025
7959616
[GITHUB ACTION] Remove commits comparison so we can rerun (#373)
xuechendi Oct 9, 2025
96aaa3b
[CI] Set seeds for e2e tests (#368)
kzawora-intel Oct 9, 2025
aa34036
Fix dp padding after upstream change #25768 (#362)
wuxun-zhang Oct 9, 2025
340aa0b
Create LICENSE (#379)
kzawora-intel Oct 10, 2025
1d72e52
Change to starting page and installation (#371)
PatrykWo Oct 10, 2025
d365704
[FIX_FOR_VLLM_LATEST] Fix upstream crash introduced by #24486 + #2492…
iboiko-habana Oct 10, 2025
a5c694f
Enable Parallel Compilation feature for compile mode by default (#370)
jwieczorekhabana Oct 13, 2025
b8d38e0
[SW-239226] Adjust junit xml filenames for retry mechanism (#382)
tlipinski1337 Oct 13, 2025
6b42a79
Docs installation, quick start and build fixes (#384)
PatrykWo Oct 13, 2025
940af0f
Correct htexp._data_ptr utility (#387)
xinyu-intel Oct 13, 2025
a48630f
ray: pin ray to <2.49.0 (#386)
xinyu-intel Oct 13, 2025
b9d407d
[FIX_FOR_VLLM_LATEST] Fix #24172, [Refactor]: Use M-RoPE interface di…
iboiko-habana Oct 13, 2025
8d6e12a
[Bugfix] Fix min linear decode value (#391)
adobrzyn Oct 13, 2025
4851cd5
[SW-241908] Omit all prompt buckets that exceed max_num_batched_token…
skavulya Oct 14, 2025
1bc0606
Experimental - fatal errro from 0.12 release (#398)
adobrzyn Oct 14, 2025
469054d
Port: [Docs] CI failures chapter (#276) (#389)
adobrzyn Oct 14, 2025
c4c14b7
Fix issue with async_scheduling when dealing with chunked input (#360)
tianmu-li Oct 14, 2025
8d5ceec
nixl: support mla kvcache transfer (#403)
xinyu-intel Oct 14, 2025
a4e92c6
Unified Attention Accuracy Bugfixes (#393)
kzawora-intel Oct 15, 2025
492fdf7
Minor optimizationm for bucketing calc (#395)
michalkuligowski Oct 15, 2025
247b8c7
Fix linear assert (#401)
kamil-kaczor Oct 15, 2025
fb87a5e
Enviroment logs - disable prefix caching with conti pa + add vllm brn…
adobrzyn Oct 15, 2025
71e6283
[FIX_FOR_VLLM_LATEST] Upstream vllm fixes for #26355 and #26737 (#407)
iboiko-habana Oct 15, 2025
1faed1c
Cherrypick cd docker fixes/commits from v0.10.2 to main v0.11.0 (#341)
nngokhale Oct 16, 2025
a385e03
Unit test for prefix caching in Gaudi plugin (#349)
iirzynska Oct 16, 2025
8fbdf6c
Add missing prompt bucket to warmup, when max_ctx is 0 (#352)
iboiko-habana Oct 16, 2025
f976f23
Unified attention improvemets (#363)
adobrzyn Oct 16, 2025
d42c1d2
[NIXL][BUGFIX][Gaudi2Gaudi accuracy] use 4d kv_cache for nixl_connect…
xuechendi Oct 16, 2025
adfff3a
Multi-image generation CI tests (#377)
MohitIntel Oct 16, 2025
5efe6a2
[FIX_FOR_VLLM_LATEST] Fix for Separate out vllm.utils.collections #26…
iboiko-habana Oct 16, 2025
d47f341
Add fp8 calibration procedure (#309)
afierka-intel Oct 17, 2025
5883300
[FIX_FOR_VLLM_LATEST] Fix for #27022 (#418)
adobrzyn Oct 18, 2025
c29a226
[CI]unified attn is too easy to fail, add small RTOL (#422)
xuechendi Oct 18, 2025
d0492f2
Update supported_features.md (#180)
mgawarkiewicz-intel Oct 20, 2025
4065723
[FIX_FOR_VLLM_LATEST] Fixes for upstream #26908 and #27143 and #27169…
iboiko-habana Oct 20, 2025
e07b23f
[NIXL]Enable prefill TP < Decode TP with host_buffer (#421)
xuechendi Oct 20, 2025
10ca8c7
Fix typo in installation.md: correct script name to install_nixl.py (…
yafshar Oct 20, 2025
b91c957
[SW-242466] Update not_over_max_model_len filter to fix warmup perf r…
skavulya Oct 21, 2025
f8f2827
Docs update post v0.11 (#428)
PatrykWo Oct 21, 2025
e624381
[FIX_FOR_VLLM_LATEST] Fix for #26440 (#442)
iboiko-habana Oct 22, 2025
4c195c8
[main] Defragmenter warmup accuracy workaround (#436)
kzawora-intel Oct 22, 2025
46f9ad8
Update docs: Quickstart - Executing inference (#410)
pawel-olejniczak Oct 22, 2025
272c110
[Security] Update requirements.txt (#443) (#445)
afierka-intel Oct 22, 2025
9b9eddd
[GITHUB ACTION] Always run same job to same node (#450)
xuechendi Oct 22, 2025
2535e05
reuse DP allgather tensor across layers (#415)
wuxun-zhang Oct 22, 2025
02e40f8
Update test case
hlahkar Oct 23, 2025
b4aed97
Support DP for unified attention (#242)
wuxun-zhang Oct 22, 2025
943852a
[Linear warmup] Default values optimization (#426)
adobrzyn Oct 23, 2025
d84e734
Buckets from file - alpha version (#375)
adobrzyn Oct 23, 2025
eb5f491
Fix math log2 exponential bucket error if max_model_len <= block_size…
skavulya Oct 23, 2025
11dc18e
Fix requirements filtering in HPU Dockerfiles (#419)
jakub-sochacki Oct 23, 2025
38170ff
View to be shape agnostic
hlahkar Oct 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 26 additions & 12 deletions .cd/Dockerfile.rhel.tenc.pytorch.vllm
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ ARG TORCH_TYPE_SUFFIX
FROM ${DOCKER_URL}/${VERSION}/${BASE_NAME}/${REPO_TYPE}/pytorch-${TORCH_TYPE_SUFFIX}installer-${PT_VERSION}:${REVISION}

# Parameterize commit/branch for vllm-fork checkout
ARG VLLM_GAUDI_COMMIT=v0.10.1
ARG VLLM_PROJECT_COMMIT=v0.10.1
ARG VLLM_GAUDI_COMMIT=main
# leave empty to use last-good-commit-for-vllm-gaudi
ARG VLLM_PROJECT_COMMIT=

ARG BASE_NAME
ENV BASE_NAME=${BASE_NAME}
Expand All @@ -39,23 +40,36 @@ ENV VLLM_PATH=/workspace/vllm-project
ENV VLLM_PATH2=/workspace/vllm-gaudi

# Clone the vllm-project repository and install inside the container
RUN mkdir -p $VLLM_PATH && \
# --- START: COMBINED RUN COMMAND ---
RUN \
# Clone vllm-gaudi and get the commit hash for the vllm-project/vllm
set -e && \
mkdir -p $VLLM_PATH2 && \
git clone https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
cd $VLLM_PATH2 && \
if [ -z "${VLLM_PROJECT_COMMIT}" ]; then \
VLLM_PROJECT_COMMIT=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null) && \
echo "Found vLLM commit hash: ${VLLM_PROJECT_COMMIT}"; \
else \
echo "Using vLLM commit : ${VLLM_PROJECT_COMMIT}"; \
fi && \
mkdir -p $VLLM_PATH && \
# Clone vllm-project/vllm and use configured or last good commit hash
git clone https://github.com/vllm-project/vllm.git $VLLM_PATH && \
cd $VLLM_PATH && \
git remote add upstream https://github.com/vllm-project/vllm.git && \
git fetch upstream --tags || true && \
git checkout ${VLLM_PROJECT_COMMIT} && \
bash -c "pip install -r <(sed '/^[torch]/d' requirements/build.txt)" && \
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e .

# Clone the vllm-gaudi repository and install inside the container
RUN mkdir -p $VLLM_PATH2 && \
git clone https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
# Install vllm-project/vllm
bash -c "pip install -r <(sed '/^torch/d' requirements/build.txt)" && \
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation . && \
# Install vllm-gaudi plugin
cd $VLLM_PATH2 && \
git checkout ${VLLM_GAUDI_COMMIT} && \
VLLM_TARGET_DEVICE=hpu && pip install -v -e $VLLM_PATH2
VLLM_TARGET_DEVICE=hpu pip install -v . --no-build-isolation
# --- END: COMBINED RUN COMMAND ---

# to be enabled later PWolsza
# to be enabled later PWolsza
# RUN pip3 install -v -e $VLLM_PATH/tests/vllm_test_utils

# Install additional Python packages
Expand All @@ -70,4 +84,4 @@ COPY benchmark /root/scripts/benchmark/
WORKDIR /root/scripts

# Set entrypoint script
ENTRYPOINT ["python3", "-m", "entrypoints.entrypoint_main"]
ENTRYPOINT ["python3", "-m", "entrypoints.entrypoint_main"]
42 changes: 26 additions & 16 deletions .cd/Dockerfile.ubuntu.pytorch.vllm
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@
# Parameterize base image components
ARG DOCKER_URL=vault.habana.ai/gaudi-docker
ARG VERSION=1.22.0
ARG BASE_NAME=ubuntu22.04
ARG BASE_NAME=ubuntu24.04
ARG PT_VERSION=2.7.1
ARG REVISION=latest
ARG REPO_TYPE=habanalabs

FROM ${DOCKER_URL}/${VERSION}/${BASE_NAME}/${REPO_TYPE}/pytorch-installer-${PT_VERSION}:${REVISION}

# Parameterize commit/branch for vllm-project & vllm-gaudi checkout
ARG VLLM_GAUDI_COMMIT=v0.10.2_next
ARG VLLM_PROJECT_COMMIT=v0.10.2

ARG VLLM_GAUDI_COMMIT=main
# leave empty to use last-good-commit-for-vllm-gaudi
ARG VLLM_PROJECT_COMMIT=
ENV OMPI_MCA_btl_vader_single_copy_mechanism=none

RUN apt update && \
Expand All @@ -30,24 +30,34 @@ RUN echo "dash dash/sh boolean false" | debconf-set-selections && \
ENV ENV=~/.profile

# Clone the vllm-project repository and install inside the container

RUN mkdir -p $VLLM_PATH && \
# --- START: COMBINED RUN COMMAND ---
RUN \
# Clone vllm-gaudi and get the commit hash for the vllm-project/vllm
set -e && \
mkdir -p $VLLM_PATH2 && \
git clone https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
cd $VLLM_PATH2 && \
if [ -z "${VLLM_PROJECT_COMMIT}" ]; then \
VLLM_PROJECT_COMMIT=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null) && \
echo "Found vLLM commit hash: ${VLLM_PROJECT_COMMIT}"; \
else \
echo "Using vLLM commit : ${VLLM_PROJECT_COMMIT}"; \
fi && \
mkdir -p $VLLM_PATH && \
# Clone vllm-project/vllm and use configured or last good commit hash
git clone https://github.com/vllm-project/vllm.git $VLLM_PATH && \
cd $VLLM_PATH && \
git remote add upstream https://github.com/vllm-project/vllm.git && \
git fetch upstream --tags || true && \
git checkout ${VLLM_PROJECT_COMMIT} && \
bash -c "pip install -r <(sed '/^[torch]/d' requirements/build.txt)" && \
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation .

# Clone the vllm-gaudi repository and install inside the container

RUN mkdir -p $VLLM_PATH2 && \
git clone https://github.com/vllm-project/vllm-gaudi.git $VLLM_PATH2 && \
# Install vllm-project/vllm
bash -c "pip install -r <(sed '/^torch/d' requirements/build.txt)" && \
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation . && \
# Install vllm-gaudi plugin
cd $VLLM_PATH2 && \
# Comment: enable if vllm-gaudi release version is used otherwise main
git checkout ${VLLM_GAUDI_COMMIT} && \
VLLM_TARGET_DEVICE=hpu && pip install -v $VLLM_PATH2 --no-build-isolation
git checkout ${VLLM_GAUDI_COMMIT} && \
VLLM_TARGET_DEVICE=hpu pip install -v . --no-build-isolation
# --- END: COMBINED RUN COMMAND ---

# Install additional Python packages
RUN pip install datasets && \
Expand Down
2 changes: 1 addition & 1 deletion .cd/Dockerfile.ubuntu.pytorch.vllm.nixl.latest
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ RUN \
git remote add upstream https://github.com/vllm-project/vllm.git && \
git fetch upstream --tags || true && \
git checkout ${VLLM_COMMIT_HASH} && \
pip install -r <(sed '/^[torch]/d' requirements/build.txt) && \
pip install -r <(sed '/^torch/d' requirements/build.txt) && \
VLLM_TARGET_DEVICE=empty pip install --no-build-isolation . && \
\
# Install vllm-gaudi
Expand Down
4 changes: 1 addition & 3 deletions .cd/benchmark/benchmark_defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,10 @@ model_text:

model_vision:
MODELS:
- meta-llama/Llama-3.2-11B-Vision-Instruct
- meta-llama/Llama-3.2-90B-Vision-Instruct
- Qwen/Qwen2.5-VL-7B-Instruct
DATASET: lmarena-ai/vision-arena-bench-v0.1
DATASET_NAME: hf
BACKEND: openai-chat
ENDPOINT: /v1/chat/completions
CONCURRENT_REQ: 64
NUM_PROMPTS: 500
NUM_PROMPTS: 500
6 changes: 0 additions & 6 deletions .cd/benchmark/benchmark_scenarios_vision.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,2 @@
llama32-11B-Vision-Instruct:
MODEL: meta-llama/Llama-3.2-11B-Vision-Instruct

llama32-90B-Vision-Instruct:
MODEL: meta-llama/Llama-3.2-90B-Vision-Instruct

qwen2.5-vl-7b-instruct:
MODEL: Qwen/Qwen2.5-VL-7B-Instruct
1 change: 1 addition & 0 deletions .cd/entrypoints/entrypoint_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ def run(self):
output_script_path="vllm_server.sh",
variables=variables,
log_dir="logs",
varlist_conf_path="server/server_output.env",
).create_and_run()
elif self.mode == "benchmark":
print("[INFO] Starting container in benchmark mode.")
Expand Down
14 changes: 12 additions & 2 deletions .cd/entrypoints/script_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

class ScriptGenerator:

def __init__(self, template_script_path, output_script_path, variables, log_dir="logs"):
def __init__(self, template_script_path, output_script_path, variables, log_dir="logs", varlist_conf_path=None):
self.template_script_path = template_script_path
self.varlist_conf_path = varlist_conf_path
self.output_script_path = output_script_path
self.variables = variables
self.log_dir = log_dir
Expand All @@ -19,7 +20,16 @@ def generate_script(self, vars_dict):
"""
with open(self.template_script_path) as f:
template = f.read()
export_lines = "\n".join([f"export {k}={v}" for k, v in vars_dict.items()])
# Create our output list
if self.varlist_conf_path:
output_dict = {}
with open(self.varlist_conf_path) as var_file:
for line in var_file:
param = line.strip()
output_dict[param] = vars_dict[param]
export_lines = "\n".join([f"export {k}={v}" for k, v in output_dict.items()])
else:
export_lines = "\n".join([f"export {k}={v}" for k, v in vars_dict.items()])
script_content = template.replace("#@VARS", export_lines)
with open(self.output_script_path, 'w') as f:
f.write(script_content)
Expand Down
60 changes: 60 additions & 0 deletions .cd/server/server_output.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
MODEL
DTYPE
DEVICE_NAME
TENSOR_PARALLEL_SIZE
MAX_MODEL_LEN
TOTAL_GPU_MEM
MODEL_DTYPE
QUANT_DTYPE
BLOCK_SIZE
VLLM_PROMPT_BS_BUCKET_MIN
VLLM_PROMPT_BS_BUCKET_STEP
VLLM_DECODE_BS_BUCKET_MIN
VLLM_DECODE_BS_BUCKET_STEP
VLLM_PROMPT_SEQ_BUCKET_MIN
VLLM_PROMPT_SEQ_BUCKET_STEP
VLLM_DECODE_BLOCK_BUCKET_MIN
VLLM_DECODE_BLOCK_BUCKET_STEP
MAX_NUM_PREFILL_SEQS
NUM_HIDDEN_LAYERS
HIDDEN_SIZE
NUM_KEY_VALUE_HEADS
NUM_ATTENTION_HEADS
CACHE_DTYPE_BYTES
LIMIT_MODEL_LEN
PT_HPU_LAZY_MODE
VLLM_DELAYED_SAMPLING
VLLM_SKIP_WARMUP
EXPERIMENTAL_WEIGHT_SHARING
VLLM_EXPONENTIAL_BUCKETING
MAX_NUM_BATCHED_TOKENS
PT_HPU_ENABLE_LAZY_COLLECTIVES
DEVICE_HPU_MEM
MODEL_MEM_IN_GB
USABLE_MEM
GPU_MEM_UTILIZATION
KV_CACHE_PER_SEQ
EST_MAX_NUM_SEQS
EST_HPU_BLOCKS
DECODE_BS_RAMP_GRAPHS
DECODE_BS_STEP_GRAPHS
DECODE_BLOCK_RAMP_GRAPHS
DECODE_BLOCK_STEP_GRAPHS
NUM_DECODE_GRAPHS
PROMPT_BS_RAMP_GRAPHS
PROMPT_BS_STEP_GRAPHS
PROMPT_SEQ_RAMP_GRAPHS
PROMPT_SEQ_STEP_GRAPHS
EST_NUM_PROMPT_GRAPHS
EST_GRAPH_PROMPT_RATIO
VLLM_GRAPH_PROMPT_RATIO
DECODE_GRAPH_TARGET_GB
EST_GRAPH_RESERVE_MEM
VLLM_GRAPH_RESERVED_MEM
KV_CACHE_MEM
MAX_NUM_SEQS
VLLM_PROMPT_SEQ_BUCKET_MAX
VLLM_CONTIGUOUS_PA
VLLM_DEFRAG
ASYNC_SCHEDULING
VLLM_WEIGHT_LOAD_FORCE_SYNC
1 change: 1 addition & 0 deletions .cd/server/server_user.env
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ MAX_NUM_SEQS
TENSOR_PARALLEL_SIZE
VLLM_EXPONENTIAL_BUCKETING
GPU_MEM_UTILIZATION
ASYNC_SCHEDULING
40 changes: 19 additions & 21 deletions .cd/server/settings_vllm.csv
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
MODEL,TENSOR_PARALLEL_SIZE,MAX_MODEL_LEN,TOTAL_GPU_MEM,UNAVAILABLE_MEM_ABS,MODEL_MEM_FROM_CONFIG,MODEL_DTYPE,QUANT_DTYPE,MODEL_MEM,PROFILER_MEM_OVERHEAD,APPROX_MEM_PER_GRAPH_MB,fsdpa,GPU_FREE_MEM_TARGET,BLOCK_SIZE,VLLM_PROMPT_BS_BUCKET_MIN,VLLM_PROMPT_BS_BUCKET_STEP,VLLM_DECODE_BS_BUCKET_MIN,VLLM_DECODE_BS_BUCKET_STEP,VLLM_PROMPT_SEQ_BUCKET_MIN,VLLM_PROMPT_SEQ_BUCKET_STEP,VLLM_DECODE_BLOCK_BUCKET_MIN,VLLM_DECODE_BLOCK_BUCKET_STEP,MAX_NUM_PREFILL_SEQS,NUM_HIDDEN_LAYERS,HIDDEN_SIZE,NUM_KEY_VALUE_HEADS,NUM_ATTENTION_HEADS,CACHE_DTYPE_BYTES,LIMIT_MODEL_LEN,PT_HPU_LAZY_MODE,VLLM_DELAYED_SAMPLING,VLLM_SKIP_WARMUP,EXPERIMENTAL_WEIGHT_SHARING,VLLM_EXPONENTIAL_BUCKETING,MAX_NUM_BATCHED_TOKENS
meta-llama/Llama-3.1-8B-Instruct,1,4352,128,2,16060522496,2,2,14.95752716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,131072,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.1-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.3-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.2-1B-Instruct,1,4352,128,2,2471645608,2,2,2.301899351,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,16,2048,8,32,2,131072,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.2-3B-Instruct,1,4352,128,2,6425499648,2,2,5.984212875,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,28,3072,8,24,2,131072,1,TRUE,FALSE,0,FALSE,2048
mistralai/Mixtral-8x7B-Instruct-v0.1,2,4352,256,2,93405585408,2,2,86.99073029,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,32768,1,TRUE,FALSE,0,FALSE,2048
mistralai/Mixtral-8x22B-Instruct-v0.1,4,4352,512,2,2.8126E+11,2,2,261.9439201,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,56,6144,8,48,2,65536,1,TRUE,FALSE,0,FALSE,2048
mistralai/Mistral-7B-Instruct-v0.2,1,4352,128,2,14483464192,2,2,13.48877716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,32768,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.1-405B-Instruct,8,4352,1024,2,8.11707E+11,2,2,755.9608459,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,126,16384,8,128,2,131072,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-14B-Instruct,1,4352,128,2,29540067328,2,2,27.51133156,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,48,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048
deepseek-ai/DeepSeek-R1-Distill-Llama-70B,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,64,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-72B-Instruct,4,4352,512,2,1.45412E+11,2,2,135.4258575,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,32768,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-7B-Instruct,1,4352,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,28,3584,4,28,2,32768,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,64,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.2-11B-Vision-Instruct,1,8448,128,2,21340441670,2,2,19.87483507,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,40,4096,8,32,2,131072,1,TRUE,FALSE,0,FALSE,2048
meta-llama/Llama-3.2-90B-Vision-Instruct,4,8448,512,2,177186710646,2,2,165.0179835,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,100,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048
ibm-granite/granite-8b-code-instruct-4k,1,2048,128,2,21474836480,2,2,20.00000000,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,64,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048
ibm-granite/granite-20b-code-instruct-8k,1,2048,128,2,53687091200,2,2,48.00000000,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,80,8192,16,80,2,65536,1,TRUE,FALSE,0,FALSE,2048
Qwen/Qwen2.5-VL-7B-Instruct,1,8448,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,28,3584,4,28,2,32768,1,TRUE,FALSE,0,FALSE,2048
MODEL,TENSOR_PARALLEL_SIZE,MAX_MODEL_LEN,TOTAL_GPU_MEM,UNAVAILABLE_MEM_ABS,MODEL_MEM_FROM_CONFIG,MODEL_DTYPE,QUANT_DTYPE,MODEL_MEM,PROFILER_MEM_OVERHEAD,APPROX_MEM_PER_GRAPH_MB,fsdpa,GPU_FREE_MEM_TARGET,BLOCK_SIZE,VLLM_PROMPT_BS_BUCKET_MIN,VLLM_PROMPT_BS_BUCKET_STEP,VLLM_DECODE_BS_BUCKET_MIN,VLLM_DECODE_BS_BUCKET_STEP,VLLM_PROMPT_SEQ_BUCKET_MIN,VLLM_PROMPT_SEQ_BUCKET_STEP,VLLM_DECODE_BLOCK_BUCKET_MIN,VLLM_DECODE_BLOCK_BUCKET_STEP,MAX_NUM_PREFILL_SEQS,NUM_HIDDEN_LAYERS,HIDDEN_SIZE,NUM_KEY_VALUE_HEADS,NUM_ATTENTION_HEADS,CACHE_DTYPE_BYTES,LIMIT_MODEL_LEN,PT_HPU_LAZY_MODE,VLLM_DELAYED_SAMPLING,VLLM_SKIP_WARMUP,EXPERIMENTAL_WEIGHT_SHARING,VLLM_EXPONENTIAL_BUCKETING,MAX_NUM_BATCHED_TOKENS,VLLM_CONTIGUOUS_PA,VLLM_DEFRAG,ASYNC_SCHEDULING,VLLM_WEIGHT_LOAD_FORCE_SYNC
meta-llama/Llama-3.1-8B-Instruct,1,4352,128,2,16060522496,2,2,14.95752716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
meta-llama/Llama-3.1-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
meta-llama/Llama-3.3-70B-Instruct,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
meta-llama/Llama-3.2-1B-Instruct,1,4352,128,2,2471645608,2,2,2.301899351,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,16,2048,8,32,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
meta-llama/Llama-3.2-3B-Instruct,1,4352,128,2,6425499648,2,2,5.984212875,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,28,3072,8,24,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
mistralai/Mixtral-8x7B-Instruct-v0.1,2,4352,256,2,93405585408,2,2,86.99073029,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
mistralai/Mixtral-8x22B-Instruct-v0.1,4,4352,512,2,2.8126E+11,2,2,261.9439201,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,56,6144,8,48,2,65536,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
mistralai/Mistral-7B-Instruct-v0.2,1,4352,128,2,14483464192,2,2,13.48877716,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,32,4096,8,32,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
meta-llama/Llama-3.1-405B-Instruct,8,4352,1024,2,8.11707E+11,2,2,755.9608459,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,126,16384,8,128,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,1
Qwen/Qwen2.5-14B-Instruct,1,4352,128,2,29540067328,2,2,27.51133156,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,48,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
deepseek-ai/DeepSeek-R1-Distill-Llama-70B,4,4352,512,2,1.41107E+11,2,2,131.4165192,5.5,20,1,1,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,131072,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,1,1,128,1,32,1,32,128,256,128,256,1,64,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
Qwen/Qwen2.5-72B-Instruct,4,4352,512,2,1.45412E+11,2,2,135.4258575,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,80,8192,8,64,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
Qwen/Qwen2.5-7B-Instruct,1,4352,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,28,3584,4,28,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
Qwen/Qwen2.5-32B-Instruct,1,4352,128,2,65527752704,2,2,61.02747536,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,64,5120,8,40,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
ibm-granite/granite-8b-code-instruct-4k,1,4096,128,2,21474836480,2,2,20.00000000,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,36,4096,8,32,2,32768,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
ibm-granite/granite-20b-code-instruct-8k,1,4352,128,2,53687091200,2,2,48.00000000,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,52,6144,1,48,2,65536,1,TRUE,FALSE,0,FALSE,2048,true,true,0,0
Qwen/Qwen2.5-VL-7B-Instruct,1,8448,128,2,15231233024,2,2,14.18519115,5.5,10,0,3,128,1,32,1,32,128,256,128,256,1,28,3584,4,28,2,32768,1,TRUE,FALSE,0,FALSE,2048,false,false,0,0
8 changes: 6 additions & 2 deletions .cd/server/vllm_autocalc_rules.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,12 @@ def calc_DECODE_BLOCK_STEP_GRAPHS(ctx):

def calc_NUM_DECODE_GRAPHS(ctx):
# 3d update
return ((ctx['DECODE_BS_RAMP_GRAPHS'] + ctx['DECODE_BS_STEP_GRAPHS']) *
(ctx['DECODE_BLOCK_RAMP_GRAPHS'] + ctx['DECODE_BLOCK_STEP_GRAPHS'])) / 2
decode_graphs = ((ctx['DECODE_BS_RAMP_GRAPHS'] + ctx['DECODE_BS_STEP_GRAPHS']) *
(ctx['DECODE_BLOCK_RAMP_GRAPHS'] + ctx['DECODE_BLOCK_STEP_GRAPHS']))
if ctx['VLLM_CONTIGUOUS_PA']:
return decode_graphs
else:
return decode_graphs / 2


def calc_PROMPT_BS_RAMP_GRAPHS(ctx):
Expand Down
4 changes: 2 additions & 2 deletions .cd/templates/template_vllm_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#@VARS

# Wait for vLLM server to be ready
until curl -s http://localhost:8000${ENDPOINT} > /dev/null; do
until curl -s http://localhost:8000/v1/models > /dev/null; do
echo "Waiting for vLLM server to be ready..."
sleep 15
done
Expand Down Expand Up @@ -35,4 +35,4 @@ vllm bench serve \
--metric-percentiles 90 \
--ignore-eos \
--trust-remote-code \
2>&1 | tee -a logs/perftest_inp${INPUT_TOK}_out${OUTPUT_TOK}_user${CONCURRENT_REQ}.log
2>&1 | tee -a logs/perftest_inp${INPUT_TOK}_out${OUTPUT_TOK}_user${CONCURRENT_REQ}.log
8 changes: 7 additions & 1 deletion .cd/templates/template_vllm_server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

#@VARS

if [ $ASYNC_SCHEDULING -gt 0 ]; then # Checks if using async scheduling
EXTRA_ARGS+=" --async_scheduling"
fi

## Start server
vllm serve $MODEL \
--block-size $BLOCK_SIZE \
Expand All @@ -11,5 +15,7 @@ vllm serve $MODEL \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization $GPU_MEM_UTILIZATION \
--max-num-seqs $MAX_NUM_SEQS \
--disable-log-requests \
--generation-config vllm \
--max_num_batched_tokens $MAX_NUM_BATCHED_TOKENS \
--disable-log-requests ${EXTRA_ARGS} \
2>&1 | tee -a logs/vllm_server.log
Loading
Loading