Skip to content

Benchmark HF optimum-executorch #11450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions .ci/scripts/gather_benchmark_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@
BENCHMARK_CONFIGS = {
"xplat": [
"xnnpack_q8",
"hf_xnnpack_fp32",
"hf_xnnpack_custom_spda_kv_cache_8da4w",
"et_xnnpack_custom_spda_kv_cache_8da4w",
"llama3_fb16",
"llama3_spinquant",
"llama3_qlora",
Expand Down Expand Up @@ -129,25 +130,26 @@ def generate_compatible_configs(model_name: str, target_os=None) -> List[str]:
"""
configs = []
if is_valid_huggingface_model_id(model_name):
configs.append("hf_xnnpack_custom_spda_kv_cache_8da4w")
if model_name.startswith("meta-llama/"):
# LLaMA models
# etLLM recipes for Llama
repo_name = model_name.split("meta-llama/")[1]
if "qlora" in repo_name.lower():
configs.append("llama3_qlora")
elif "spinquant" in repo_name.lower():
configs.append("llama3_spinquant")
else:
configs.append("llama3_fb16")
configs.append("et_xnnpack_custom_spda_kv_cache_8da4w")
configs.extend(
[
config
for config in BENCHMARK_CONFIGS.get(target_os, [])
if config.startswith("llama")
]
)
else:
# Non-LLaMA models
configs.append("hf_xnnpack_fp32")
if model_name.startswith("Qwen/Qwen3"):
configs.append("et_xnnpack_custom_spda_kv_cache_8da4w")
elif model_name in MODEL_NAME_TO_MODEL:
# ExecuTorch in-tree non-GenAI models
configs.append("xnnpack_q8")
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/android-perf-private-device-experiment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ on:
description: Models to be benchmarked
required: false
type: string
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
devices:
description: Target devices to run benchmark
required: false
Expand All @@ -34,7 +34,7 @@ on:
description: Models to be benchmarked
required: false
type: string
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
devices:
description: Target devices to run benchmark
required: false
Expand All @@ -57,6 +57,6 @@ jobs:
id-token: write
contents: read
with:
models: ${{ inputs.models || 'mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' }}
models: ${{ inputs.models || 'Qwen/Qwen3-0.6B' }}
devices: samsung_galaxy_s22_private
benchmark_configs: ${{ inputs.benchmark_configs }}
95 changes: 82 additions & 13 deletions .github/workflows/android-perf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ jobs:
# Separate default values from the workflow dispatch. To ensure defaults are accessible
# during scheduled runs and to provide flexibility for different defaults between
# on-demand and periodic benchmarking.
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' || 'llama' }}
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8,google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,allenai/OLMo-1B-hf' || 'llama' }}
CRON_DEFAULT_DEVICES: samsung_galaxy_s22
run: |
set -eux
Expand Down Expand Up @@ -201,8 +201,8 @@ jobs:
HF_MODEL_REPO=${{ matrix.model }}
OUT_ET_MODEL_NAME="$(echo "$HF_MODEL_REPO" | awk -F'/' '{print $2}' | sed 's/_/-/g' | tr '[:upper:]' '[:lower:]')_${{ matrix.config }}"

# Convert HF checkpoint to ET via etLLM path
if [[ "$HF_MODEL_REPO" == meta-llama/* ]]; then
# Llama models on Hugging Face
if [[ ${{ matrix.config }} == "llama3_spinquant" ]]; then
# SpinQuant
# Download prequantized chceckpoint from Hugging Face
Expand Down Expand Up @@ -272,6 +272,21 @@ jobs:
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
python -m examples.models.llama.export_llama \
--model llama3_2 \
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \
--params "${DOWNLOADED_PATH}/params.json" \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w -G 32 -E 8,0 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these for llama_3_2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel Yeah, for llama_3_2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel @jackzhxng can you confirm if this is the correct config we should use to export Qwen3 via etLLM path? The perf numbers reported here doesn't make sense to me #11450 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know for qwen3. Can you compare the file sizes for the two? Also use --xnnpack-extended-ops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind. you are using the option i mentioned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one I dont see hf counterpart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's for Qwen. I compared the command in this doc:

python -m examples.models.llama.export_llama \
--model qwen3-0_6b \
--params examples/models/qwen3/0_6b_config.json \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w \
--metadata '{"get_bos_id": 151644, "get_eos_ids":[151645]}' \
--output_name="qwen3-0_6b.pte" \
--verbose

The only difference is -G 32 -E 8,0. I think we will need it because we did the same embedding quant for optimum-et models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one I dont see hf counterpart

The HF counterpart should be just get via optimum-cli down at line 365

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you compare the file sizes for the two?

PTE size from etLLM: 470M. See this line in CI: https://github.com/pytorch/executorch/actions/runs/15549915868/job/43778415688#step:15:12848
PTE size from optimum-et: 506M. See this line in CI: https://github.com/pytorch/executorch/actions/runs/15549915868/job/43778415678#step:15:13591

@kimishpatel The size makes sense to me. The performance doesn't, unless the PTE is not exported with correct config via etLLM, can't explain why optimum-et is 5x faster than etLLM. cc: @jackzhxng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a debug build vs release build difference for apps? You have to try and repro this locally. If it doesnt repro locally likely there is a benchmark infra issue

--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
elif [[ ${{ matrix.config }} == "llama3_qnn_htp" ]]; then
export QNN_SDK_ROOT=/tmp/qnn/2.28.0.241029
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/
Expand All @@ -292,21 +307,75 @@ jobs:
OUT_ET_MODEL_NAME="llama3_2_qnn" # Qualcomm hard-coded it in their script
find . -name "${OUT_ET_MODEL_NAME}.pte" -not -path "./${OUT_ET_MODEL_NAME}.pte" -exec mv {} ./ \;
ls -lh "${OUT_ET_MODEL_NAME}.pte"
else
# By default, test with the Hugging Face model and the xnnpack recipe
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model")
python -m extension.export_util.export_hf_model -hfm="$HF_MODEL_REPO" -o "$OUT_ET_MODEL_NAME"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi
else
echo "Unsupported model ${{ matrix.model }}"
exit 1
elif [[ "$HF_MODEL_REPO" == "Qwen/Qwen3-0.6B" ]]; then
if [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "." --files "tokenizer.json")
python -m examples.models.llama.export_llama \
--model qwen3-0_6b \
--params examples/models/qwen3/0_6b_config.json \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w \
-G 32 \
-E 8,0 \
--metadata '{"get_bos_id": 151644, "get_eos_ids":[151645]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi
fi

if [[ ${{ matrix.config }} == "hf_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(
bash .ci/scripts/download_hf_hub.sh \
--model_id "${HF_MODEL_REPO}" \
--files "tokenizer.json"
)
echo "tokenizer.json is downloaded to $DOWNLOADED_PATH"

# Install optimum-executorch
git clone https://github.com/huggingface/optimum-executorch
pushd optimum-executorch
# There is no release yet, for CI stability, always test from the same commit on main
git checkout 1c653dc49812fc431a22312c7295d97005d22e12
python install_dev.py
pip list

ARGS=(
"--model" "${HF_MODEL_REPO}"
"--task" "text-generation"
"--recipe" "xnnpack"
"--use_custom_sdpa"
"--qlinear"
"--qembedding"
"--output_dir" ".."
)

# Add conditional arguments based on model
case "${HF_MODEL_REPO}" in
*"google/gemma-3-1b-it"*)
echo "--use_custom_kv_cache can not be used for HybridCache"
;;
*)
ARGS+=("--use_custom_kv_cache")
;;
esac

optimum-cli export executorch "${ARGS[@]}"
popd

mv model.pte ${OUT_ET_MODEL_NAME}.pte
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi

zip -j model.zip "${OUT_ET_MODEL_NAME}.pte" "${DOWNLOADED_PATH}/tokenizer.model"
zip -j model.zip ${OUT_ET_MODEL_NAME}.pte ${DOWNLOADED_PATH}/tokenizer.*
ls -lh model.zip
mkdir -p "${ARTIFACTS_DIR_NAME}"
mv model.zip "${ARTIFACTS_DIR_NAME}"
mkdir -p ${ARTIFACTS_DIR_NAME}
mv model.zip ${ARTIFACTS_DIR_NAME}
ls -lh ${ARTIFACTS_DIR_NAME}
elif [[ ${{ matrix.model }} == "llama" ]]; then
# Install requirements for export_llama
PYTHON_EXECUTABLE=python bash examples/models/llama/install_requirements.sh
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/apple-perf-private-device-experiment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ on:
description: Models to be benchmarked
required: false
type: string
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn FYI, I will have to offload the private device by temporarily removing some models, otherwise we would hit the API limits. Please add models and other existing runs when we you refactor the workflow to run as many as possible given the #instances in the private pool

devices:
description: Target devices to run benchmark
required: false
Expand All @@ -34,7 +34,7 @@ on:
description: Models to be benchmarked
required: false
type: string
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
default: Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
devices:
description: Target devices to run benchmark
required: false
Expand All @@ -57,6 +57,6 @@ jobs:
id-token: write
contents: read
with:
models: ${{ inputs.models || 'mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' }}
models: ${{ inputs.models || 'Qwen/Qwen3-0.6B' }}
devices: apple_iphone_15_private
benchmark_configs: ${{ inputs.benchmark_configs }}
94 changes: 84 additions & 10 deletions .github/workflows/apple-perf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ jobs:
# Separate default values from the workflow dispatch. To ensure defaults are accessible
# during scheduled runs and to provide flexibility for different defaults between
# on-demand and periodic benchmarking.
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' || 'llama' }}
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8,google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf' || 'llama' }}
CRON_DEFAULT_DEVICES: apple_iphone_15
run: |
set -eux
Expand Down Expand Up @@ -207,7 +207,10 @@ jobs:
HF_MODEL_REPO=${{ matrix.model }}
OUT_ET_MODEL_NAME="$(echo "$HF_MODEL_REPO" | awk -F'/' '{print $2}' | sed 's/_/-/g' | tr '[:upper:]' '[:lower:]')_${{ matrix.config }}"

# Convert HF checkpoint to ET via etLLM path
if [[ "$HF_MODEL_REPO" == meta-llama/* ]]; then
# The benchmark app replies on the _llm suffix to determine whether the model is a LLM or not
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
# Llama models on Hugging Face
if [[ ${{ matrix.config }} == "llama3_spinquant" ]]; then
# SpinQuant
Expand Down Expand Up @@ -278,6 +281,21 @@ jobs:
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
${CONDA_RUN} python -m examples.models.llama.export_llama \
--model llama3_2 \
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \
--params "${DOWNLOADED_PATH}/params.json" \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w -G 32 -E 8,0 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
elif [[ ${{ matrix.config }} == "llama3_coreml_ane" ]]; then
# ANE
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
Expand All @@ -293,18 +311,74 @@ jobs:
--coreml-compute-units cpu_and_ne \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
else
# By default, test with the Hugging Face model and the xnnpack recipe
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model")
${CONDA_RUN} python -m extension.export_util.export_hf_model -hfm="$HF_MODEL_REPO" -o "$OUT_ET_MODEL_NAME"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi
else
echo "Unsupported model ${{ matrix.model }}"
exit 1
elif [[ "$HF_MODEL_REPO" == "Qwen/Qwen3-0.6B" ]]; then
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
if [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "." --files "tokenizer.json")
${CONDA_RUN} python -m examples.models.llama.export_llama \
--model qwen3-0_6b \
--params examples/models/qwen3/0_6b_config.json \
-kv \
--use_sdpa_with_kv_cache \
-d fp32 \
-X \
--xnnpack-extended-ops \
-qmode 8da4w \
-G 32 \
-E 8,0 \
--metadata '{"get_bos_id": 151644, "get_eos_ids":[151645]}' \
--output_name="${OUT_ET_MODEL_NAME}.pte"
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi
fi

if [[ ${{ matrix.config }} == "hf_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
DOWNLOADED_PATH=$(
bash .ci/scripts/download_hf_hub.sh \
--model_id "${HF_MODEL_REPO}" \
--files "tokenizer.json"
)
echo "tokenizer.json is downloaded to $DOWNLOADED_PATH"

# Install optimum-executorch
git clone https://github.com/huggingface/optimum-executorch
pushd optimum-executorch
# There is no release yet, for CI stability, always test from the same commit on main
git checkout 1c653dc49812fc431a22312c7295d97005d22e12
${CONDA_RUN} python install_dev.py
pip list

ARGS=(
"--model" "${HF_MODEL_REPO}"
"--task" "text-generation"
"--recipe" "xnnpack"
"--use_custom_sdpa"
"--qlinear"
"--qembedding"
"--output_dir" ".."
)

# Add conditional arguments based on model
case "${HF_MODEL_REPO}" in
*"google/gemma-3-1b-it"*)
echo "--use_custom_kv_cache can not be used for HybridCache"
;;
*)
ARGS+=("--use_custom_kv_cache")
;;
esac

${CONDA_RUN} optimum-cli export executorch "${ARGS[@]}"
popd

# The benchmark app replies on the _llm suffix to determine whether the model is a LLM or not
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
mv model.pte ${OUT_ET_MODEL_NAME}.pte
ls -lh "${OUT_ET_MODEL_NAME}.pte"
fi

zip -j model.zip "${OUT_ET_MODEL_NAME}.pte" "${DOWNLOADED_PATH}/tokenizer.model"
zip -j model.zip ${OUT_ET_MODEL_NAME}.pte ${DOWNLOADED_PATH}/tokenizer.*
ls -lh model.zip
mkdir -p "${ARTIFACTS_DIR_NAME}"
mv model.zip "${ARTIFACTS_DIR_NAME}"
Expand Down
Loading
Loading