Consolidate Intel Quantization Toolkit Integration in vLLM #31716

yiliu30 · 2026-01-05T10:23:05Z

Resolve #30663
Rendered version:

TODO

Fix the HPU INC path on vllm-gaudi side Migrate INCConfig for HPU vllm-gaudi#779

cc @hshen14 @thuang6 @wenhuach21 @jikunshang @kzawora-intel @xuechendi

Note

Unifies Intel quantization by routing auto-round through INC and removing the dedicated AutoRound config/paths.

Map quantization="auto-round" to INCConfig; delete auto_round.py and related imports
Expand inc.py to support weight-only recipes (e.g., W4A16/W8A16), backend selection (GPTQ/AWQ/Marlin/IPEX), per-layer configs, and act dtypes (fp16/bf16)
Simplify model loading: remove special CPU load for online quantization and stop treating inc as config-less (only gguf remains special)
Docs: replace AutoRound page with consolidated "Intel Quantization Support"; add AutoRound usage via INC, installation/CLI/API examples, deploy/eval guides
Update quantization README: adjust hardware support table (drop Gaudi column), add note migrating Gaudi support to vLLM-Gaudi, and link "Intel Neural Compressor"

^{Written by Cursor Bugbot for commit c41ead8. This will update automatically on new commits. Configure here.}

Note

Unifies Intel quantization under INC, deprecating the standalone AutoRound path and broadening supported recipes/backends.

Map auto-round to INCConfig; delete auto_round.py and related imports/mappings
Expand inc.py to support weight-only schemes (e.g., W4A16, W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)
Simplify model load path: remove special CPU load for "online" quantization and stop treating inc as config-less (only gguf remains)
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization README to link "Intel Neural Compressor", drop Gaudi column, and note Gaudi migration to vllm-gaudi

^{Written by Cursor Bugbot for commit aeaa863. This will update automatically on new commits. Configure here.}

Note

Consolidates Intel quantization by deprecating the standalone AutoRound path and routing it through INC with broader functionality.

Map quantization="auto-round" to INCConfig; remove auto_round.py and related imports
Extend inc.py to support weight-only schemes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)
Simplify load path: remove special CPU load for "online" quantization and stop treating inc as config-less (only gguf remains)
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization README to link Intel Neural Compressor, adjust hardware table (drop Gaudi), and note Gaudi migration to vllm-gaudi

^{Written by Cursor Bugbot for commit 8ce92e2. This will update automatically on new commits. Configure here.}

Note

Unifies Intel quantization by routing auto-round through INC and expanding INC’s functionality; updates docs and simplifies load paths.

Map quantization="auto-round" to INCConfig; remove auto_round.py and related imports/mappings
Extend inc.py: support weight-only recipes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)
Simplify model loading: remove special CPU load for “online” quantization and stop treating inc as config-less (only gguf remains)
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update README to link Intel Neural Compressor, adjust hardware table (drop Gaudi) and add Gaudi migration note

^{Written by Cursor Bugbot for commit b79c37d. This will update automatically on new commits. Configure here.}

Note

Unifies Intel quantization under INC and removes the standalone AutoRound path.

Map quantization="auto-round" to INCConfig; delete auto_round.py and related imports/mappings
Expand inc.py to support weight-only recipes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and GPTQ/AWQ/Marlin/IPEX backends
Simplify model loading: remove CPU "online" quantization handling and treat only gguf as config-less
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization/README.md to link Intel Neural Compressor, drop Gaudi column, and note Gaudi migration to vllm-gaudi

^{Written by Cursor Bugbot for commit f9588ea. This will update automatically on new commits. Configure here.}

Note

Consolidates Intel quantization and expands capabilities while simplifying load paths.

Map quantization="auto-round" to INC and delete auto_round.py; add override so auto-round checkpoints resolve to INC
Extend inc.py to support weight-only recipes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)
Update quantization registry to return INCConfig for auto-round; adjust override order to include inc
Simplify loading: remove special CPU "online quantization" path and treat only gguf as config-less
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization README (link Intel Neural Compressor, drop Gaudi column, add Gaudi migration note)
Tests: streamline AutoRound test runner args

^{Written by Cursor Bugbot for commit 55c043d. This will update automatically on new commits. Configure here.}

Note

Consolidates Intel quantization by mapping auto-round to INC and expanding INC to cover weight-only schemes and multiple backends.

Map quantization="auto-round" to INCConfig; delete auto_round.py and related imports; add override so auto-round checkpoints resolve to inc
Extend inc.py: support W4A16/W8A16, per-layer configs, fp16/bf16 activations, GPTQ/AWQ/Marlin/IPEX backends
Simplify loading: remove special CPU "online" quantization path; treat only gguf as config-less; add inc to override probe order
Tests: streamline AutoRound test runner args
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization README (link "Intel Neural Compressor", adjust hardware table, add Gaudi migration note)

^{Written by Cursor Bugbot for commit d3c9232. This will update automatically on new commits. Configure here.}

Note

Unifies Intel quantization by routing auto-round through INC and removing the dedicated AutoRound config/paths.

Delete auto_round.py; map auto-round to INCConfig and add override so auto-round checkpoints resolve to inc
Expand inc.py to include weight-only schemes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and GPTQ/AWQ/Marlin/IPEX backends
Update quantization registry and override probe order to include inc; treat only gguf as config-less; remove special CPU "online" load path
Docs: replace AutoRound page with consolidated "Intel Quantization Support"; update quantization/README.md (add "Intel Neural Compressor" link, adjust hardware table, note Gaudi migration)
Tests: simplify AutoRound test runner args

^{Written by Cursor Bugbot for commit 3295857. This will update automatically on new commits. Configure here.}

Note

Unifies Intel quantization under INC and removes the standalone AutoRound implementation.

Map quantization="auto-round" to INCConfig; delete auto_round.py and related imports; add override so auto-round checkpoints resolve to inc
Extend inc.py to support weight-only recipes (e.g., W4A16/W8A16), per-layer configs, fp16/bf16 activations, and backends (GPTQ/AWQ/Marlin/IPEX)
Update quantization registry: return INCConfig for auto-round; include inc in override order
Simplify loading: remove special CPU "online quantization" handling; treat only gguf as config-less
Docs: replace AutoRound page with consolidated "Intel Quantization Support" (install/CLI/API, deploy/eval); update quantization README (link Intel Neural Compressor, adjust hardware table by dropping Gaudi and noting migration to vLLM-Gaudi)
Tests: streamline AutoRound test by removing deprecated flag

^{Written by Cursor Bugbot for commit 6b499ff. This will update automatically on new commits. Configure here.}

mergify · 2026-01-05T10:23:50Z

Documentation preview: https://vllm--31716.org.readthedocs.build/en/31716/

gemini-code-assist

Code Review

This pull request effectively consolidates the Intel Quantization Toolkit integration by merging AutoRound into the Intel Neural Compressor (INC) configuration. The changes are well-structured, removing redundant files and clarifying the documentation. The code refactoring correctly updates the quantization configuration mappings and removes obsolete logic. My only feedback is a minor but important typo in the documentation that could confuse users.

docs/features/quantization/inc.md

jikunshang · 2026-01-05T13:54:52Z

cc @robertgshaw2-redhat @xuechendi PTAL

mergify · 2026-01-05T14:03:06Z

Hi @yiliu30, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: yiliu30 <[email protected]>

heheda12345 · 2026-01-07T06:39:19Z

also CC @yewentao256

mergify · 2026-01-07T11:34:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yiliu30.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yiliu30 <[email protected]>

vllm/model_executor/layers/quantization/inc.py

Signed-off-by: yiliu30 <[email protected]>

yewentao256

Thanks for the work! Just found one error in the doc.
I might not be the best person to review for this PR and perhaps you can find someone else to give a formal review @heheda12345

docs/features/quantization/inc.md

Signed-off-by: yiliu30 <[email protected]>

mergify · 2026-01-13T09:17:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yiliu30.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: yiliu30 <[email protected]>

robertgshaw2-redhat · 2026-01-14T05:24:36Z

Thanks for the hard work on this!

…ect#31716) Signed-off-by: yiliu30 <[email protected]>

yiliu30 requested review from 22quinn, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 5, 2026 10:23

mergify bot added the documentation Improvements or additions to documentation label Jan 5, 2026

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

docs/features/quantization/inc.md Show resolved Hide resolved

yiliu30 mentioned this pull request Jan 6, 2026

Migrate INCConfig for HPU vllm-project/vllm-gaudi#779

Closed

yiliu30 added 4 commits January 6, 2026 20:41

remove auto_round.md

631e7dd

Signed-off-by: yiliu30 <[email protected]>

update inc docs

5eaf95d

Signed-off-by: yiliu30 <[email protected]>

merge autoround into inc

2f43b40

Signed-off-by: yiliu30 <[email protected]>

udpated

43ef7c6

Signed-off-by: yiliu30 <[email protected]>

yiliu30 force-pushed the consolidate-inc branch from 2f4a616 to 43ef7c6 Compare January 7, 2026 02:03

update doc

03c58d3

Signed-off-by: yiliu30 <[email protected]>

heheda12345 assigned yewentao256 Jan 7, 2026

mergify bot added needs-rebase and removed needs-rebase labels Jan 7, 2026

yiliu30 added 4 commits January 7, 2026 14:53

merge main

424e376

Signed-off-by: yiliu30 <[email protected]>

fix format

596ccb0

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'main' into consolidate-inc

237e922

Merge branch 'main' into consolidate-inc

c41ead8

cursor bot reviewed Jan 9, 2026

View reviewed changes

vllm/model_executor/layers/quantization/inc.py Outdated Show resolved Hide resolved

yiliu30 added 2 commits January 9, 2026 11:41

Merge branch 'main' into consolidate-inc

aeaa863

replace moe with moe_config

8ce92e2

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 2 commits January 9, 2026 20:23

Merge branch 'main' into consolidate-inc

b79c37d

Merge branch 'main' into consolidate-inc

f9588ea

yewentao256 reviewed Jan 10, 2026

View reviewed changes

docs/features/quantization/inc.md Show resolved Hide resolved

yewentao256 removed their assignment Jan 10, 2026

yiliu30 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad and youkaichao as code owners January 12, 2026 01:02

yiliu30 added 5 commits January 11, 2026 20:03

override auto-round to inc

df6418d

Signed-off-by: yiliu30 <[email protected]>

update ut

55c043d

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'main' into consolidate-inc

d3c9232

Merge branch 'main' into consolidate-inc

3295857

Merge branch 'main' into consolidate-inc

6b499ff

mergify bot added the needs-rebase label Jan 13, 2026

merge main

fbf492c

Signed-off-by: yiliu30 <[email protected]>

mergify bot removed the needs-rebase label Jan 13, 2026

Merge branch 'main' into consolidate-inc

1f6ea3b

robertgshaw2-redhat approved these changes Jan 14, 2026

View reviewed changes

Merge branch 'main' into consolidate-inc

debbb66

robertgshaw2-redhat enabled auto-merge (squash) January 14, 2026 05:24

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 14, 2026

robertgshaw2-redhat merged commit 50632ad into vllm-project:main Jan 14, 2026
60 checks passed

sammysun0711 pushed a commit to sammysun0711/vllm that referenced this pull request Jan 16, 2026

Consolidate Intel Quantization Toolkit Integration in vLLM (vllm-proj…

ca02205

…ect#31716) Signed-off-by: yiliu30 <[email protected]>

sangbumlikeagod pushed a commit to sangbumlikeagod/vllm that referenced this pull request Jan 16, 2026

Consolidate Intel Quantization Toolkit Integration in vLLM (vllm-proj…

e6e9272

…ect#31716) Signed-off-by: yiliu30 <[email protected]>

sangbumlikeagod pushed a commit to sangbumlikeagod/vllm that referenced this pull request Jan 16, 2026

Consolidate Intel Quantization Toolkit Integration in vLLM (vllm-proj…

3871475

…ect#31716) Signed-off-by: yiliu30 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consolidate Intel Quantization Toolkit Integration in vLLM #31716

Consolidate Intel Quantization Toolkit Integration in vLLM #31716

Uh oh!

yiliu30 commented Jan 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

jikunshang commented Jan 5, 2026

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

heheda12345 commented Jan 7, 2026

Uh oh!

mergify bot commented Jan 7, 2026

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

mergify bot commented Jan 13, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Consolidate Intel Quantization Toolkit Integration in vLLM #31716

Consolidate Intel Quantization Toolkit Integration in vLLM #31716

Uh oh!

Conversation

yiliu30 commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jikunshang commented Jan 5, 2026

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

heheda12345 commented Jan 7, 2026

Uh oh!

mergify bot commented Jan 7, 2026

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Jan 13, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yiliu30 commented Jan 5, 2026 •

edited by github-actions bot

Loading