Postprocessing to share lm_head weights to embedding #1461

jiafatom · 2025-05-08T17:08:05Z

This PR is based on #1437, but we don't need convert MatMul here.

src/python/py/models/postprocess_lm_head_tied_embeddings.py

tianleiwu · 2025-05-08T17:26:25Z

src/python/py/models/postprocess_lm_head_tied_embeddings.py

+    # Subtract zero point from casted weights
+    sub_node = helper.make_node(
+        'Sub',
+        inputs=["casted_quant_weights", "casted_zero_point"],
+        outputs=["centered_weights"],
+        name='/model/embed_tokens/SubtractZeroPoint'
+    )
+
+    # Multiply by scale
+    dequantized_output = "dequantized_embeddings"
+    mul_node = helper.make_node(
+        'Mul',
+        inputs=["centered_weights", "gathered_scales"],
+        outputs=[dequantized_output],
+        name='/model/embed_tokens/MultiplyByScale'
+    )


Can we use DequantizeLinear op?
https://onnx.ai/onnx/operators/onnx__DequantizeLinear.html

Could you please elaborate more, how to construct?

Use helper.make_node to create a DequantizeLinear node, and feed the quantized lm weight and same scale and bias used in last MatMulNBits node into DequantizeLinear. Then you can get the dequantized weights. Then you can Gather based on input_ids.

I agree that we should use DequantizeLinear. It is already constructed in the model builder.

onnxruntime-genai/src/python/py/models/builder.py

Lines 861 to 886 in 8eed730

def make_dequantize_linear(self, dequantize_name, quantized_op):

# Input weights are quantized, save quantized MatMul weights for onnx model

qweight = dequantize_name[1:].replace("/", ".") + ".qweight"

qweight_npy = quantized_op.qweight.detach().cpu()

qweight_npy = qweight_npy.reshape(*qweight_npy.shape[:-2], qweight_npy.shape[-2] * qweight_npy.shape[-1])

self.make_external_tensor(qweight_npy.contiguous(), qweight, True)

scales = dequantize_name[1:].replace("/", ".") + ".scales"

scales_npy = quantized_op.scales.detach().cpu().to(self.to_torch_dtype[self.io_dtype])

scales_npy = scales_npy.reshape(*qweight_npy.shape[:-1], qweight_npy.shape[-1] * 2 // quantized_op.group_size)

self.make_external_tensor(scales_npy.contiguous(), scales)

dequantize_inputs = [qweight, scales]

if hasattr(quantized_op, "qzeros") and quantized_op.qzeros is not None:

zeros = dequantize_name[1:].replace("/", ".") + ".qzeros"

zeros_npy = quantized_op.qzeros.detach().cpu()

zeros_npy = zeros_npy.reshape(*qweight_npy.shape[:-1], qweight_npy.shape[-1] // quantized_op.group_size)

self.make_external_tensor(zeros_npy.contiguous(), zeros, True)

dequantize_inputs.append(zeros)

dequantize_output = f"{dequantize_name}/output_0"

self.make_node("DequantizeLinear", inputs=dequantize_inputs, outputs=[dequantize_output], name=dequantize_name, block_size=quantized_op.group_size, axis=-1)

self.make_value_info(dequantize_output, self.io_dtype, shape=[*scales_npy.shape[:-1], scales_npy.shape[-1] * quantized_op.group_size])

return dequantize_output

It will also be easier to construct the temporary subgraph for GatherBlockQuantized in the model builder directly.

src/python/py/models/postprocess_lm_head_tied_embeddings.py

+    # Inputs A and scale has the same type, but scale is in external data. So we can only get the type from A here.
+    scale_value_type = get_tensor_type_from_graph(graph, matmul_node.input[0])
+    if scale_value_type:
+        scale_value_type = scale_value_type.elem_type


src/python/py/models/postprocess_lm_head_tied_embeddings.py

+
+import onnx
+import numpy as np
+from onnx import helper, numpy_helper, version_converter


github-advanced-security bot found potential problems May 8, 2025

View reviewed changes

src/python/py/models/postprocess_lm_head_tied_embeddings.py Fixed Show fixed Hide fixed

jiafatom force-pushed the tie branch from 76de8f0 to 73df00f Compare May 8, 2025 17:11

tianleiwu reviewed May 8, 2025

View reviewed changes

kunal-vaishnavi reviewed May 8, 2025

View reviewed changes

src/python/py/models/postprocess_lm_head_tied_embeddings.py Show resolved Hide resolved

tianleiwu reviewed May 8, 2025

View reviewed changes

src/python/py/models/postprocess_lm_head_tied_embeddings.py Outdated Show resolved Hide resolved

tianleiwu reviewed May 8, 2025

View reviewed changes

src/python/py/models/postprocess_lm_head_tied_embeddings.py Show resolved Hide resolved

Postprocessing to share lm_head weights to embedding

38b2c1b

jiafatom force-pushed the tie branch 3 times, most recently from 207b13a to 4fefaad Compare May 9, 2025 02:58

github-advanced-security bot found potential problems May 9, 2025

View reviewed changes

Merge branch 'main' into tie

c5ad8a3

jiafatom force-pushed the tie branch from 4fefaad to c5ad8a3 Compare May 9, 2025 04:05

jiafatom added 2 commits May 9, 2025 16:42

Merge branch 'main' into tie

b331c9a

Merge branch 'main' into tie

7a05eff

github-advanced-security bot found potential problems May 13, 2025

View reviewed changes

src/python/py/models/postprocess_lm_head_tied_embeddings.py

import onnx

import numpy as np

from onnx import helper, numpy_helper, version_converter

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'version_converter' is not used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Postprocessing to share lm_head weights to embedding #1461

Postprocessing to share lm_head weights to embedding #1461

Uh oh!

jiafatom commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

tianleiwu May 8, 2025 •

edited

Loading

Uh oh!

jiafatom May 8, 2025

Uh oh!

tianleiwu May 8, 2025

Uh oh!

kunal-vaishnavi May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

Uh oh!

	def make_dequantize_linear(self, dequantize_name, quantized_op):
	# Input weights are quantized, save quantized MatMul weights for onnx model
	qweight = dequantize_name[1:].replace("/", ".") + ".qweight"
	qweight_npy = quantized_op.qweight.detach().cpu()
	qweight_npy = qweight_npy.reshape(qweight_npy.shape[:-2], qweight_npy.shape[-2] qweight_npy.shape[-1])
	self.make_external_tensor(qweight_npy.contiguous(), qweight, True)

	scales = dequantize_name[1:].replace("/", ".") + ".scales"
	scales_npy = quantized_op.scales.detach().cpu().to(self.to_torch_dtype[self.io_dtype])
	scales_npy = scales_npy.reshape(qweight_npy.shape[:-1], qweight_npy.shape[-1] 2 // quantized_op.group_size)
	self.make_external_tensor(scales_npy.contiguous(), scales)

	dequantize_inputs = [qweight, scales]

	if hasattr(quantized_op, "qzeros") and quantized_op.qzeros is not None:
	zeros = dequantize_name[1:].replace("/", ".") + ".qzeros"
	zeros_npy = quantized_op.qzeros.detach().cpu()
	zeros_npy = zeros_npy.reshape(*qweight_npy.shape[:-1], qweight_npy.shape[-1] // quantized_op.group_size)
	self.make_external_tensor(zeros_npy.contiguous(), zeros, True)
	dequantize_inputs.append(zeros)

	dequantize_output = f"{dequantize_name}/output_0"
	self.make_node("DequantizeLinear", inputs=dequantize_inputs, outputs=[dequantize_output], name=dequantize_name, block_size=quantized_op.group_size, axis=-1)
	self.make_value_info(dequantize_output, self.io_dtype, shape=[scales_npy.shape[:-1], scales_npy.shape[-1] quantized_op.group_size])

	return dequantize_output

Postprocessing to share lm_head weights to embedding #1461

Are you sure you want to change the base?

Postprocessing to share lm_head weights to embedding #1461

Uh oh!

Conversation

jiafatom commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tianleiwu May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiafatom May 8, 2025

Choose a reason for hiding this comment

Uh oh!

tianleiwu May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

Uh oh!

jiafatom commented May 8, 2025 •

edited

Loading

tianleiwu May 8, 2025 •

edited

Loading