Skip to content

Conversation

@YangKai0616
Copy link

Similar to #2473 . Our internal CI testing for XPU found that current test results for cases like test_bnb_regression.py::test_opt_350m_4bit_quant_storage and test_regression.py::TestOpt4bitBnb::test_lora_4bit are similar to test_bnb_regression.py::test_opt_350m_4bit, vary due to the bitsandbytes version and hardware platform.

XPU currently can pass all UT files in the latest bitsandbytes version. Therefore, can we temporarily disable these specific examples on XPU to prevent CI errors?

@YangKai0616
Copy link
Author

Hi @BenjaminBossan , please help review. Thanks!

@BenjaminBossan
Copy link
Member

Hi, thanks for the PR. I assume you are from the same team as @yao-matrix?

XPU currently can pass all UT files in the latest bitsandbytes version

Do you mean the latest release from bitsandbytes makes the tests pass? Why is that version not used for CI then? Or do you mean the tests require bitsandbytes installed from source? In that case, I would change the condition such that:

  • test runs if CUDA available
  • test runs if XPU is available AND bitsandbytes version is greater than x.y.z (presumably >= 0.49)

@yao-matrix
Copy link
Contributor

yao-matrix commented Oct 16, 2025

Echo Benjamin, our target is to cover all cases on XPU, so as a general principal, let's be very cautious of doing platform-specific skip. Pls follow Benjamin suggestion to figure out what bnb version is needed to pass. @YangKai0616

@YangKai0616
Copy link
Author

Do you mean the latest release from bitsandbytes makes the tests pass? Why is that version not used for CI then? Or do you mean the tests require bitsandbytes installed from source? In that case, I would change the condition such that:

  • test runs if CUDA available
  • test runs if XPU is available AND bitsandbytes version is greater than x.y.z (presumably >= 0.49)

Sorry for my unclear expression. What I meant is that the current XPU can pass all tests of bitsandbytes. Because I noticed you mentioned that test_bnb_regression.py exists to test bitsandbytes(link).

For the test case test_opt_350m_4bit_quant_storage, I compared the outputs of CUDA (A100) and XPU layer by layer and found no abnormal errors. The test failure seems more like a result of reasonable error accumulation when comparing across different platforms.

Additionally, I tested the FP32/4bit matmul, Linear outputs and quantization precision on both CUDA (A100) and XPU, and performed a cross-platform comparison. The results are as follows:

Run mode: Run on XPU and compare with CUDA
✅ Results loaded: cuda_results.pkl
   PyTorch version: 2.8.0+cu126
   BNB version: 0.48.1

================================================================================
Start comparative analysis...
================================================================================

================================================================================
FP32 Baseline Comparison (CUDA vs XPU)
================================================================================

1. FP32 matmul Comparison::

Size                      Max abs Diff         Mean abs Diff
------------------------------------------------------------
(64, 512, 256)            0.0000190735         0.0000031811
(128, 1024, 512)          0.0000419617         0.0000061635
(256, 2048, 1024)         0.0001602173         0.0000192979

2. FP32 Linear Comparison:

  Output max diff: 0.0000009537
  Output mean diff: 0.0000001015
  Weight max diff: 0.0000000000
  Weight mean diff: 0.0000000000

3. FP32 MLP Comparison:

  Max diff: 0.0000001192
  Mean diff: 0.0000000290

================================================================================
4-bit Quantization Comparison (CUDA vs XPU)
================================================================================

1. Quantization / Dequantization Comparison:

Size             Metric                          CUDA            XPU             Diff          
--------------------------------------------------------------------------------------------
128x512         Quantization max abs error       0.55697584      0.55697584      0.00000000
                Quantization mean abs error      0.09671264      0.09671264      0.00000000
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------
1024x1024       Quantization max abs error       0.64687514      0.64687514      0.00000000
                Quantization mean abs error      0.09654428      0.09654428      0.00000000
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------
2048x4096       Quantization max abs error       0.69555235      0.69555235      0.00000000
                Quantization mean abs error      0.09655710      0.09655712      0.00000001
                Dequant result max diff                                          0.00000000
                Dequant result mean diff                                         0.00000000
--------------------------------------------------------------------------------------------

2. 4-bit matmul Comparison:

  Max abs diff: 0.00002289
  Mean abs diff: 0.00000319

3. Linear4bit Comparison:

  Max abs diff: 0.00000143
  Mean abs diff: 0.00000010

The results show that the 4-bit quantization precision on XPU and CUDA (A100) aligns perfectly. The differences in Linear, MLP, and matmul computations are reasonable. The increased differences in large size matmul demonstrate the effect of accumulated errors.

Considering that the ground truth for tests such as test_opt_350m_4bit_quant_storage is calculated on the CUDA platform, should we add a ground truth for the XPU platform for testing? What are your thoughts on this?

Thanks!

@YangKai0616 YangKai0616 reopened this Oct 21, 2025
@BenjaminBossan
Copy link
Member

I see, thanks for explaining. Honestly, I'm considering to remove these tests completely, as they are actually tests for bitsandbytes and not PEFT. We added them to PEFT because bnb didn't have its own CI for this at the time, but now the picture has changed. Therefore, I would like to avoid putting too much extra work into this, like providing separate artifacts per architecture.

I started a discussion with bnb maintainers about removing the bnb tests in PEFT and will reply here once we have come up with a conclusion.

@BenjaminBossan
Copy link
Member

@YangKai0616 After discussing with bnb colleagues, we decided to remove test_bnb_regression.py
(#2858). Regarding test_regression.py, do we have the same situation there?

@YangKai0616
Copy link
Author

@YangKai0616 After discussing with bnb colleagues, we decided to remove test_bnb_regression.py (#2858). Regarding test_regression.py, do we have the same situation there?

Thank you for your prompt response!

Yes, you can see that the 4-bit BNB model is also used in test case test_regression.py::TestOpt4bitBnb::test_lora_4bit. The ground truth based on CUDA GPU is located at link. Should we maintain a similar ground truth for XPU?

@BenjaminBossan
Copy link
Member

@YangKai0616 Okay, got it. So what could be done:

  1. You create your own HF repo to upload the artifacts to.
  2. In test_regression.py, HF_REPO points to your repo if XPU is being detected, otherwise it stays on "peft-internal-testing/regression-tests"
  3. On an XPU enabled machine, you run the regression tests in "creation mode" (see the comment at the top). This will upload the XPU artifacts to your repo. It is recommended to use a PEFT release version for this, not the main branch.
  4. In subsequent runs, if XPU is detected, the artifacts from your repo will be used.

Let's add a comment to let the user know that if they run with XPU, the regression artifacts are loaded from a repo outside of Hugging Face's control, so they run this at their own risk. We use torch.load for the tensors, so there is a potential for vulnerabilities. Of course, we trust Intel here, but it's still good to let users know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants