How to debug conv2d backward `Unexpected page fault` error #2864

hoshibara · 2025-03-12T07:37:15Z

Hi, I recently encountered an issue related to conv2d backward.
After updating oneDNN to 3.7.0, conv2d backward throws Unexpected page fault after a specific shape matmul operation.
I tried to reproduce this issue according oneDNN verbose logs from another tensor shape, but couldn't replicate the issue.
I guess it's caused by operator inconsistencies generated under specific shapes.
Could I get your suggestions on how to further investigate or debug this issue?

Pytorch reproducer

import os
import torch
import torch.nn as nn
from torch.testing._internal.common_utils import TestCase

cpu_device = torch.device("cpu")
dpcpp_device = torch.device("xpu")

a_size = int(os.environ.get('A_SIZE', 1))

# This reproducer will throw page fault when a_size=1~8

a=torch.rand([a_size, 1024], device='xpu:0', dtype=torch.float16)
b=torch.rand([1024, 1024], device='xpu:0', dtype=torch.float16)
torch.mm(a,b)

dtype=torch.double

x_cpu = torch.randn(
    [1, 64, 256, 256], dtype=dtype, device=cpu_device, requires_grad=True
)
grad_cpu = torch.full(
    [1, 64, 256, 256], 1e-3, dtype=dtype, device=cpu_device, requires_grad=True
)
conv_cpu = nn.Conv2d(
    64, 64, kernel_size=3, stride=1, padding=1, bias=False
).double()

x_dpcpp = x_cpu.to(dpcpp_device).requires_grad_()
grad_dpcpp = grad_cpu.to(dpcpp_device)
conv_dpcpp = conv_cpu.to(dpcpp_device)
y_dpcpp = conv_dpcpp(x_dpcpp)
y_dpcpp.backward(grad_dpcpp)

oneDNN Verbose

onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:56
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support 
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:8 
onednn_verbose,v1,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,2,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,3,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,4,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,5,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,6,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,7,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x1024:1024x1024
onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1
FATAL: Unexpected page fault from GPU at 0xff0000000e97a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
FATAL: Unexpected page fault from GPU at 0xff0000000e97a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
Abort was called at 288 line in file:
./shared/source/os_interface/linux/drm_neo.cpp
Aborted

The text was updated successfully, but these errors were encountered:

dzarukin · 2025-03-12T22:12:46Z

Hi @hoshibara, thank you for providing verbose logs, that gives some perspective and tips, but it if you could provide ONEDNN_VERBOSE=all output (on the contrary to 1) that may give some more context for the team triaging the issue.

If the problem was in the execution of backward path, it should report a create line for that backward convolution, and it would help to see if parameters matched and what implementation was used.

It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?

And just in case if you are eager to help us with debugging, there are some more questions:

Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?
Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?
Does it happen with f32 or f64 matmul used instead of f16?
Does it happen for this specific convolution shape? Will any other fail?

Answering these questions will help to narrow down the scope and potentially simplify the standalone oneDNN reproducer.

Thank you.

hoshibara · 2025-03-13T07:22:02Z

Hi @dzarukin, thank you for your patience and suggestions.
I followed your advice and did further testing. Here are the results:

It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?

I rollback the oneDNN integrated in the latest PyTorch and confirmed that this issue first appeared in v3.7 (5e92240).
When I switched to the v3.6.2 (2eb3dd1) branch, this test can be ran properly.
In addition, I collected ONEDNN_VERBOSE=all logs for both versions when running the reproducer. Follows are the differences:

diff --git a/shape_1.3_6_2.processed.log b/shape_1.3_7.processed.log
index d3bd144..25dad21 100644
--- a/shape_1.3_6_2.processed.log
+++ b/shape_1.3_7.processed.log
@@ -1,13 +1,14 @@
-onednn_verbose,v1,info,oneDNN v3.6.2 (commit 2eb3dd1082db767fab171e934c551c609008289a)
+onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
 onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:56
 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support 
 onednn_verbose,v1,info,gpu,runtime:DPC++
+onednn_verbose,v1,info,gpu,engine,sycl gpu device count:1 
 onednn_verbose,v1,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
 onednn_verbose,v1,info,graph,backend,0:dnnl_backend
 onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
 onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
 onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -33,7 +34,7 @@ onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x2,2x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested primitive jit:gemm:any,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested jit:gemm:any primitive,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
 onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -67,6 +68,7 @@ onednn_verbose,v1,primitive,exec,gpu,matmul,jit:gemm:any,undef,src:f16::blocked:
 onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
 onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
 onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+Abort was called at 288 line in file:
+./shared/source/os_interface/linux/drm_neo.cpp

Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?

Yes. If I remove the torch.mm operation in the code, conv2d runs without any errors.

Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?

Yes. The smallest shape I found that can reproduce the issue is (1~8)x8 : 8x4.

Does it happen with f32 or f64 matmul used instead of f16?

No. If I use f32 or f64 for matmul with the same tensor shapes, conv2d works fine and no error occurs.

Does it happen for this specific convolution shape? Will any other fail?

Yes. The smallest convolution shape I found so far that reproduces the issue is as follows:

x_cpu shape: [1, 32, 256, 256]
grad_cpu shape: [1, 9, 256, 256]
conv_cpu = nn.Conv2d(32, 9, kernel_size=3, stride=1, padding=1, bias=False).double()

Thank you again for your time and support!

hoshibara added the question label Mar 12, 2025

shu1chen assigned onednnsupporttriage Mar 12, 2025

rcao8 self-assigned this Mar 13, 2025

shu1chen unassigned onednnsupporttriage Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug conv2d backward `Unexpected page fault` error #2864

How to debug conv2d backward `Unexpected page fault` error #2864

hoshibara commented Mar 12, 2025 •

edited

Loading

dzarukin commented Mar 12, 2025

hoshibara commented Mar 13, 2025 •

edited

Loading

How to debug conv2d backward Unexpected page fault error #2864

How to debug conv2d backward Unexpected page fault error #2864

Comments

hoshibara commented Mar 12, 2025 • edited Loading

dzarukin commented Mar 12, 2025

hoshibara commented Mar 13, 2025 • edited Loading

It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?

Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?

Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?

Does it happen with f32 or f64 matmul used instead of f16?

Does it happen for this specific convolution shape? Will any other fail?

How to debug conv2d backward `Unexpected page fault` error #2864

How to debug conv2d backward `Unexpected page fault` error #2864

hoshibara commented Mar 12, 2025 •

edited

Loading

hoshibara commented Mar 13, 2025 •

edited

Loading