-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to debug conv2d backward Unexpected page fault
error
#2864
Comments
Hi @hoshibara, thank you for providing verbose logs, that gives some perspective and tips, but it if you could provide ONEDNN_VERBOSE=all output (on the contrary to 1) that may give some more context for the team triaging the issue. If the problem was in the execution of backward path, it should report a create line for that backward convolution, and it would help to see if parameters matched and what implementation was used. It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that? And just in case if you are eager to help us with debugging, there are some more questions:
Answering these questions will help to narrow down the scope and potentially simplify the standalone oneDNN reproducer. Thank you. |
Hi @dzarukin, thank you for your patience and suggestions. It was also mentioned it started after upgrade to v3.7. Does it imply the previous version was working well? And if so, which version was that?I rollback the oneDNN integrated in the latest PyTorch and confirmed that this issue first appeared in diff --git a/shape_1.3_6_2.processed.log b/shape_1.3_7.processed.log
index d3bd144..25dad21 100644
--- a/shape_1.3_6_2.processed.log
+++ b/shape_1.3_7.processed.log
@@ -1,13 +1,14 @@
-onednn_verbose,v1,info,oneDNN v3.6.2 (commit 2eb3dd1082db767fab171e934c551c609008289a)
+onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:56
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support
onednn_verbose,v1,info,gpu,runtime:DPC++
+onednn_verbose,v1,info,gpu,engine,sycl gpu device count:1
onednn_verbose,v1,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Data Center GPU Max 1550,driver_version:1.6.31294,binary_kernels:enabled
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:69
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,jit:xe_hp:gemm:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,unsupported format tag,src/gpu/intel/jit/gemm/xe_hp_systolic_gemm.cpp:80
onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -33,7 +34,7 @@ onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:64x2,2x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:64x4,8x2x1,score:xxxxxx
-onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested primitive jit:gemm:any,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
+onednn_verbose,v1,primitive,create:dispatch,gemm,gpu,gemm,ocl:gemm_with_po:any,undef,src_a:f16::blocked:ab::f0 src_b:f16::blocked:ab::f0 dst:f16::blocked:ab::f0,,,1x1024:1024x1024,failed to create nested jit:gemm:any primitive,src/gpu/intel/ocl/gemm/gemm_with_post_ops.cpp:76
onednn_verbose,v1,info,gpu,gemm,consider:64x40,4x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:64x32,4x8x1,score:xxxxxx
onednn_verbose,v1,info,gpu,gemm,consider:32x48,8x4x1,score:xxxxxx
@@ -67,6 +68,7 @@ onednn_verbose,v1,primitive,exec,gpu,matmul,jit:gemm:any,undef,src:f16::blocked:
onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,forward_training,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_data,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,create:cache_miss,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
-onednn_verbose,v1,primitive,exec,gpu,convolution,jit:ir,backward_weights,src:f64::blocked:abcd::f0 wei:f64::blocked:abcd::f0 bia:undef::undef::: dst:f64::blocked:abcd::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic64oc64_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,xxxxxx
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+FATAL: Unexpected page fault from GPU at 0xff0000000ea3a000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
+Abort was called at 288 line in file:
+./shared/source/os_interface/linux/drm_neo.cpp Does the issue happen if and only if there's this specific matmul in there? What happens if matmul op gets removed?Yes. If I remove the Does the issue happen with f16 matmul on any other shape? Like 1x256:256x256? Or maybe 16x16:16x16?Yes. The smallest shape I found that can reproduce the issue is Does it happen with f32 or f64 matmul used instead of f16?No. If I use f32 or f64 for matmul with the same tensor shapes, conv2d works fine and no error occurs. Does it happen for this specific convolution shape? Will any other fail?Yes. The smallest convolution shape I found so far that reproduces the issue is as follows: x_cpu shape: [1, 32, 256, 256]
grad_cpu shape: [1, 9, 256, 256]
conv_cpu = nn.Conv2d(32, 9, kernel_size=3, stride=1, padding=1, bias=False).double() Thank you again for your time and support! |
Hi, I recently encountered an issue related to conv2d backward.
After updating oneDNN to 3.7.0, conv2d backward throws
Unexpected page fault
after a specific shape matmul operation.I tried to reproduce this issue according oneDNN verbose logs from another tensor shape, but couldn't replicate the issue.
I guess it's caused by operator inconsistencies generated under specific shapes.
Could I get your suggestions on how to further investigate or debug this issue?
Pytorch reproducer
oneDNN Verbose
The text was updated successfully, but these errors were encountered: