Aarch64 and X64 have different behaviour in u8s8 reorder for convolution #2412
Labels
help wanted
platform:cpu-aarch64
Codeowner: @oneapi-src/onednn-cpu-aarch64
sighting
Suspicious library behavior. Should be promoted to a bug when confirmed
Summary
This issue has been identified from these pytorch unit test failures pytorch/pytorch#144770 and it appears to me that the issue is a result of the weight prepacking such that the weights are not in the expected memory locations.
https://github.com/pytorch/pytorch/blob/dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L437-L458
The code relating to the test failure is above, and I can confirm that by commenting out this line, the tests pass.
https://github.com/pytorch/pytorch/blob/dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L415
op_attr.set_zero_points_mask(DNNL_ARG_SRC, /* zero_points_mask= */0);
Reproducer is in comments below.
But it seems that doing a reorder with reverse memory format ab->ba of a 135x2 int8 matrix seems to have zeros in the resulting memory on Aarch64, but not on x64.
followed by
See Observed behaviour below
Note - This seems to follow the reference primitives of reorder ( jit_unit_reorder) and convolution ( gemm_x8s8s32x_convolution_fwd_t ).
Version
Ideep = 77f18a4a44492e79b8c738e4fcd2698699b1acba
oneDNN = 66f0cb9 ( v3.5.3 )
Environment
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs pa
ca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Steps to reproduce
See comment below with reproducer attached.
Observed behavior
Tensor dimensions are : 135 ( 27 groups x 5 input channels ) , 2, 1, 1
Expected behavior
Document behavior you expect.
If you comment out these lines https://github.com/oneapi-src/oneDNN/blob/381134e4fdbdcb1099fc584291ee2805fafdb5f0/src/cpu/cpu_convolution_list.cpp#L592C1-L612C82 to force x86 to use gemm_x8s8s32x_convolution_fwd_t then we get the correct output on x86.
The text was updated successfully, but these errors were encountered: