-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Consolidate Intel Quantization Toolkit Integration in vLLM #31716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate Intel Quantization Toolkit Integration in vLLM #31716
Conversation
|
Documentation preview: https://vllm--31716.org.readthedocs.build/en/31716/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively consolidates the Intel Quantization Toolkit integration by merging AutoRound into the Intel Neural Compressor (INC) configuration. The changes are well-structured, removing redundant files and clarifying the documentation. The code refactoring correctly updates the quantization configuration mappings and removes obsolete logic. My only feedback is a minor but important typo in the documentation that could confuse users.
|
cc @robertgshaw2-redhat @xuechendi PTAL |
|
Hi @yiliu30, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
2f4a616 to
43ef7c6
Compare
Signed-off-by: yiliu30 <[email protected]>
|
also CC @yewentao256 |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Just found one error in the doc.
I might not be the best person to review for this PR and perhaps you can find someone else to give a formal review @heheda12345
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yiliu30 <[email protected]>
|
Thanks for the hard work on this! |
…ect#31716) Signed-off-by: yiliu30 <[email protected]>
…ect#31716) Signed-off-by: yiliu30 <[email protected]>
…ect#31716) Signed-off-by: yiliu30 <[email protected]>
Resolve #30663
Rendered version:
TODO
INCConfigfor HPU vllm-gaudi#779cc @hshen14 @thuang6 @wenhuach21 @jikunshang @kzawora-intel @xuechendi
Note
Unifies Intel quantization by routing
auto-roundthroughINCand removing the dedicated AutoRound config/paths.quantization="auto-round"toINCConfig; deleteauto_round.pyand related importsinc.pyto support weight-only recipes (e.g., W4A16/W8A16), backend selection (GPTQ/AWQ/Marlin/IPEX), per-layer configs, and act dtypes (fp16/bf16)incas config-less (onlyggufremains special)Written by Cursor Bugbot for commit c41ead8. This will update automatically on new commits. Configure here.
Note
Unifies Intel quantization under
INC, deprecating the standalone AutoRound path and broadening supported recipes/backends.auto-roundtoINCConfig; deleteauto_round.pyand related imports/mappingsinc.pyto support weight-only schemes (e.g.,W4A16,W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)incas config-less (onlyggufremains)vllm-gaudiWritten by Cursor Bugbot for commit aeaa863. This will update automatically on new commits. Configure here.
Note
Consolidates Intel quantization by deprecating the standalone AutoRound path and routing it through
INCwith broader functionality.quantization="auto-round"toINCConfig; removeauto_round.pyand related importsinc.pyto support weight-only schemes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)incas config-less (onlyggufremains)vllm-gaudiWritten by Cursor Bugbot for commit 8ce92e2. This will update automatically on new commits. Configure here.
Note
Unifies Intel quantization by routing
auto-roundthroughINCand expanding INC’s functionality; updates docs and simplifies load paths.quantization="auto-round"toINCConfig; removeauto_round.pyand related imports/mappingsinc.py: support weight-only recipes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)incas config-less (onlyggufremains)Written by Cursor Bugbot for commit b79c37d. This will update automatically on new commits. Configure here.
Note
Unifies Intel quantization under
INCand removes the standalone AutoRound path.quantization="auto-round"toINCConfig; deleteauto_round.pyand related imports/mappingsinc.pyto support weight-only recipes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and GPTQ/AWQ/Marlin/IPEX backendsggufas config-lessquantization/README.mdto link Intel Neural Compressor, drop Gaudi column, and note Gaudi migration tovllm-gaudiWritten by Cursor Bugbot for commit f9588ea. This will update automatically on new commits. Configure here.
Note
Consolidates Intel quantization and expands capabilities while simplifying load paths.
quantization="auto-round"toINCand deleteauto_round.py; add override soauto-roundcheckpoints resolve toINCinc.pyto support weight-only recipes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)INCConfigforauto-round; adjust override order to includeincggufas config-lessWritten by Cursor Bugbot for commit 55c043d. This will update automatically on new commits. Configure here.
Note
Consolidates Intel quantization by mapping
auto-roundtoINCand expandingINCto cover weight-only schemes and multiple backends.quantization="auto-round"toINCConfig; deleteauto_round.pyand related imports; add override soauto-roundcheckpoints resolve toincinc.py: support W4A16/W8A16, per-layer configs, fp16/bf16 activations, GPTQ/AWQ/Marlin/IPEX backendsggufas config-less; addincto override probe orderWritten by Cursor Bugbot for commit d3c9232. This will update automatically on new commits. Configure here.
Note
Unifies Intel quantization by routing
auto-roundthroughINCand removing the dedicated AutoRound config/paths.auto_round.py; mapauto-roundtoINCConfigand add override soauto-roundcheckpoints resolve toincinc.pyto include weight-only schemes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and GPTQ/AWQ/Marlin/IPEX backendsinc; treat onlyggufas config-less; remove special CPU "online" load pathquantization/README.md(add "Intel Neural Compressor" link, adjust hardware table, note Gaudi migration)Written by Cursor Bugbot for commit 3295857. This will update automatically on new commits. Configure here.
Note
Unifies Intel quantization under
INCand removes the standalone AutoRound implementation.quantization="auto-round"toINCConfig; deleteauto_round.pyand related imports; add override soauto-roundcheckpoints resolve toincinc.pyto support weight-only recipes (e.g.,W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)INCConfigforauto-round; includeincin override orderggufas config-lessWritten by Cursor Bugbot for commit 6b499ff. This will update automatically on new commits. Configure here.