Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug conv2d backward Unexpected page fault error #2864

Open
hoshibara opened this issue Mar 12, 2025 · 2 comments
Open

How to debug conv2d backward Unexpected page fault error #2864

hoshibara opened this issue Mar 12, 2025 · 2 comments
Assignees
Labels

Comments

@hoshibara
Copy link

hoshibara commented Mar 12, 2025

Hi, I recently encountered an issue related to conv2d backward.
After updating oneDNN to 3.7.0, conv2d backward throws Unexpected page fault after a specific shape matmul operation.
I tried to reproduce this issue according oneDNN verbose logs from another tensor shape, but couldn't replicate the issue.
I guess it's caused by operator inconsistencies generated under specific shapes.
Could I get your suggestions on how to further investigate or debug this issue?


Pytorch reproducer

import os
import torch
import torch.nn as nn
from torch.testing._internal.common_utils import TestCase

cpu_device = torch.device("cpu")
dpcpp_device = torch.device("xpu")

a_size = int(os.environ.get('A_SIZE', 1))

# This reproducer will throw page fault when a_size=1~8

a=torch.rand([a_size, 1024], device='xpu:0', dtype=torch.float16)
b=torch.rand([1024, 1024], device='xpu:0', dtype=torch.float16)
torch.mm(a,b)

dtype=torch.double

x_cpu = torch.randn(
    [1, 64, 256, 256], dtype=dtype, device=cpu_device, requires_grad=True
)
grad_cpu = torch.full(
    [1, 64, 256, 256], 1e-3, dtype=dtype, device=cpu_device, requires_grad=True
)
conv_cpu = nn.Conv2d(
    64, 64, kernel_size=3, stride=1, padding=1, bias=False
).double()

x_dpcpp = x_cpu.to(dpcpp_device).requires_grad_()
grad_dpcpp = grad_cpu.to(dpcpp_device)
conv_dpcpp = conv_cpu.to(dpcpp_device)
y_dpcpp = conv_dpcpp(x_dpcpp)
y_dpcpp.backward(grad_dpcpp)

oneDNN Verbose

onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:56
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support 
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:8 
onednn_verbose,v1,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,1,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,2,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,3,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,4,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,5,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,6,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,gpu,engine,7,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend
onednn_verbose,v1,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src:f16::blocked:ab::f0 wei:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-scratchpad:user,,1x1024:1024x1024
onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1
FATAL: Unexpected page fault from GPU at 0xff0000000e97a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
FATAL: Unexpected page fault from GPU at 0xff0000000e97a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
Abort was called at 288 line in file:
./shared/source/os_interface/linux/drm_neo.cpp
Aborted
@dzarukin
Copy link
Contributor

Hi @hoshibara, thank you for providing verbose logs, that gives some perspective and tips, but it if you could provide ONEDNN_VERBOSE=all output (on the contrary to 1) that may give some more context for the team triaging the issue.

If the problem was in the execution of backward path, it should report a create line for that backward convolution, and it would help to see if parameters matched and what implementation was used.

It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?

And just in case if you are eager to help us with debugging, there are some more questions:

  • Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?
  • Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?
  • Does it happen with f32 or f64 matmul used instead of f16?
  • Does it happen for this specific convolution shape? Will any other fail?

Answering these questions will help to narrow down the scope and potentially simplify the standalone oneDNN reproducer.

Thank you.

@hoshibara
Copy link
Author

hoshibara commented Mar 13, 2025

Hi @dzarukin, thank you for your patience and suggestions.
I followed your advice and did further testing. Here are the results:

It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?

I rollback the oneDNN integrated in the latest PyTorch and confirmed that this issue first appeared in v3.7 (5e92240).
When I switched to the v3.6.2 (2eb3dd1) branch, this test can be ran properly.
In addition, I collected ONEDNN_VERBOSE=all logs for both versions when running the reproducer. Follows are the differences:

diff --git a/shape_1.3_6_2.processed.log b/shape_1.3_7.processed.log
index d3bd144..25dad21 100644
--- a/shape_1.3_6_2.processed.log
+++ b/shape_1.3_7.processed.log
@@ -1,13 +1,14 @@
-onednn_verbose,v1,info,oneDNN v3.6.2 (commit 2eb3dd1082db767fab171e934c551c609008289a)
+onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
 onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:56
 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support 
 onednn_verbose,v1,info,gpu,runtime:DPC++
+onednn_verbose,v1,info,gpu,engine,sycl gpu device count:1 
 onednn_verbose,v1,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
 onednn_verbose,v1,info,graph,backend,0:dnnl_backend
 onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
 onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
 onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -33,7 +34,7 @@ onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x2,2x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested primitive jit:gemm:any,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested jit:gemm:any primitive,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
 onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
 onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -67,6 +68,7 @@ onednn_verbose,v1,primitive,exec,gpu,matmul,jit:gemm:any,undef,src:f16::blocked:
 onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
 onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
 onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+Abort was called at 288 line in file:
+./shared/source/os_interface/linux/drm_neo.cpp

Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?

Yes. If I remove the torch.mm operation in the code, conv2d runs without any errors.

Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?

Yes. The smallest shape I found that can reproduce the issue is (1~8)x8 : 8x4.

Does it happen with f32 or f64 matmul used instead of f16?

No. If I use f32 or f64 for matmul with the same tensor shapes, conv2d works fine and no error occurs.

Does it happen for this specific convolution shape? Will any other fail?

Yes. The smallest convolution shape I found so far that reproduces the issue is as follows:

x_cpu shape: [1, 32, 256, 256]
grad_cpu shape: [1, 9, 256, 256]
conv_cpu = nn.Conv2d(32, 9, kernel_size=3, stride=1, padding=1, bias=False).double()

Thank you again for your time and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants