-
Notifications
You must be signed in to change notification settings - Fork 591
Benchmark HF optimum-executorch #11450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11450
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 1 PendingAs of commit b25c0d2 with merge base cbd3874 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e4718b0
to
fff15c6
Compare
fff15c6
to
00149f2
Compare
00149f2
to
112eb2b
Compare
112eb2b
to
a38a694
Compare
a38a694
to
a0f636f
Compare
a0f636f
to
5d6dd04
Compare
@huydhn Okay, it turns out that I need to run install with |
74583ba
to
88a0a19
Compare
88a0a19
to
e1340e3
Compare
Now I can TPS reported from the Android app. I manually pull the results and it turns seems the TPS is almost same between the etLLM generated Qwen3 and the optimum-et generated Qwen3: |
b96797d
to
b25c0d2
Compare
For some reason doesn't see the results from the hf recipe got uploaded here. Upon manual checking the results, same perf on iOS devices as well. |
Benchmark LLMs from
optimum-executorch
. With all the work recently happening inoptimum-executorch
, we are able to boost the out-of-the-box performance. Putting these models on benchmark infra to gather perf numbers and understand the remaining perf gaps between the in-house generated model via export_llama.We are able to do apple-to-apple comparison for CPU backend by introducing quant, custom SPDA, custom KV cache to native Hugging Face models in
optimum-executorch
:hf_xnnpack_custom_spda_kv_cache_8da4w
represents the recipe used by optimum-et,et_xnnpack_custom_spda_kv_cache_8da4w
is the counterpart for etLLM.Here are the benchmark jobs in our infra:
Note there may be failures when running optimum-et models on-device due to lack of support HF tokenizers in the benchmark apps. I have to remove packing tokenizer.json from the .zip in order to unblock collecting raw latency on
forward()
call.