Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix accuracy of max-pooling backpropagation for bfloat16 data #2386

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

asimonov1
Copy link

@asimonov1 asimonov1 commented Jan 13, 2025

[MFDNN-11050] (bf16 backward max pooling returns incorrect results)
[MFDNN-11396] (BF16 pooling_backward performance regression on SPR)
(Also [MFDNN-12863] (JIT max pool implementation works incorrectly for small data types and large kernels))

It was found earlier (MFDNN-11050) that bf16 backward max pooling returns incorrect results (MFDNN-11050). An initial fix of an accuracy led to significant performance regression (MFDNN-11396). That initial fix was rolled back.

The reason of an accuracy issue is that even a sum of relatively small numbers is not accurate, e.g. bf16(256.0)+bf16(1.0) is bf16(256.0). Summation can take place if some pooling strides are less than corresponding kernel sizes.

The current fix uses additional accumulation arrays of f16's, with one array per thread. The size of those arrays for src_diff is the same as for existing ncsp implementation (ncsp implementation creates arrays of f32's for dst_diff, src_diff and indices, reorders data and uses those arrays during calculations). The ncsp case is not affected by this PR.

I have done some manual measurements on a machine with SPR processor. In some cases this implementation works faster than the original version, sometimes slower, but significantly better than not-optimized implementation (that was used after the first fix of MFDNN-11050).

The following tables contain performance data for axb and aBx16b layouts for the original implementation (main branch), the fixed version (this PR), and another implementation (that is used if the optimized implementation is skipped).

Scratch of a script used to run tests:

export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_NUM_THREADS=56
...
export LD_PRELOAD=${libiomp5_loc}/libiomp5.so
numactl --physcpubind=0-59 --membind=0 ./benchdnn -v5 --mode=p --pool --reset --allow-enum-tags-only=0 --engine=cpu --dir=BWD_D --alg=pooling_max --dt=bf16:bf16 --tag=<tag> <problem>

axb

problem original (ms) fixed (ms) other (simple_nhwc) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3 13 13 451
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3 29 22 1685
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 16 19 857
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 55 33 3960
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2 6.7 14 117
mb512_ic512_iw2048kw129sw1ow1920 142 88 2480
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396) 1.85 5.34 67

aBx16b

problem original (ms) fixed (ms) other (ref) (ms)
mb200_ic32_id20ih40iw30_od18oh38ow28_kd3kh3kw3 13 7 310
mb200_ic35_id20ih40iw30_od18oh38ow28_kd3kh3kw3 20 10 338
mb200_ic32_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 16 14 155
mb200_ic35_id40ih40iw30_od10oh35ow25_kd4kh6kw6_sd4sh1sw1 25 19 177
mb128ic128_ih112oh56kh3sh2_iw112ow56kw3sw2 3.6 3.7 147
mb512_ic512_iw2048kw129sw1ow1920 121 60 1310
mb128ic64_ih112oh56kh3sh2dh0ph0_iw112ow56kw3sw2dw0pw0 (from MFDNN-11396) 1.33 1.55 73

The current implementation also fixes the bug [MFDNN-12863] (JIT max pool implementation works incorrectly for small data types and large kernels).

@github-actions github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 component:tests Codeowner: @oneapi-src/onednn-arch labels Jan 13, 2025
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 17ab064 to 2820609 Compare January 13, 2025 18:39
@asimonov1 asimonov1 changed the title Use float32 accumulator for max-pooling backpropagation for bfloat16 data Fix accuracy of max-pooling backpropagation for bfloat16 data Jan 13, 2025
@asimonov1
Copy link
Author

make test
disable device_gpu
disable benchdnn_all
enable benchdnn_pool
enable benchdnn_nightly

@asimonov1
Copy link
Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 2820609 to 73b85e2 Compare January 14, 2025 16:09
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 73b85e2 to dce529d Compare January 15, 2025 16:49
@@ -367,6 +368,11 @@ status_t jit_uni_pool_kernel<isa>::init_conf(jit_pool_conf_t &jpp,
}
assert(jpp.ur > 0);

jpp.needs_f32_accum_for_bf16 = jpp.is_bf16
&& jpp.alg == alg_kind::pooling_max && jpp.is_backward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the issue is limited to max algorithm, accumulation on backward happens to all algorithms... We will need to adjust threshold in benchdnn for lower data types, but it's not tied to this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, average pooling algorithms have similar loss of accuracy. The scope of the task is defined in MFDNN-11050 and MFDNN-11396 (actually MFDNN-11050 should be reopened, as its fix was rolled back).

@@ -18,6 +18,7 @@
#include <bitset>

#include "common/dnnl_thread.hpp"
#include "common/memory_desc.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should come with "cpu/cpu_pooling_pd.hpp", thus, not needed as standalone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

memory_desc_init_by_tag(
jpp.tmp_md, ndims, dims, data_type::f32, fmt_tag);

scratchpad.book<char>(key_pool_src_f32_accum, tmp_d.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider improving the code by detaching init_scratchpad function out of init_conf as it's done in most places elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

template <cpu_isa_t isa>
inline void jit_uni_pool_kernel<isa>::load32(const int idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a lot of existing routines to support loading. I highly recommend to change the loading/storing implementation to rely on io_injector, which is more flexible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, loading/storing functions need a refactoring. I did not know how to do that properly, and I did not know about io_injector. I have to investigate that. Could it be done as a separate task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doing it as a separate task, I definitely recommend refactor existing implementation first, and then apply f32 accumulation altogether with acc_mode support.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The io injector is now used to load/store tensor data, but indices are processed the same way as before because the io injector does not support these data types: it converts integers to floats during loading (however, it does not convert data if data are stored as s32; s32 and f32 are stored by one common function store_f32; it looks like a bug).

@@ -483,6 +483,151 @@ class bwd_pooling_transpose_facade_t
const dim_t c_tail_;
};

struct bwd_f32_accum_for_bf16_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, help me understanding why a new class is needed when the only change happening is inside the kernel (use a different buffer and instructions to accumulate the inputs) and shouldn't be on a parallel/balancing level?

Copy link
Author

@asimonov1 asimonov1 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the idea. This class contains implementation of copying data with conversion to bf16 into corresponding place of src_diff (functions bwd_f32_accum_for_bf16_t::cvt_*).
Merge with bwd_pooling_transpose_facade_t? The class bwd_pooling_transpose_facade_t supports ncsp format, reorder both tensors (both src and dst) to block format and then back to the original format; it does not support the case ur_bc>1 (only one block is processed on each iteration). The class bwd_f32_accum_for_bf16_t is simpler, it must convert only dst (dst_diff) from f32 to bf16 without data reordering, but it supports the case ur_bc>1 for nspc.

@@ -526,6 +526,8 @@ struct jit_pool_conf_t {
bool with_binary;
int nthr;
memory_desc_t tmp_md;
bool needs_f32_accum_for_bf16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add support for acc_mode (separate commit is totally fine) which would allow to preserve former behavior (accumulation in bf16) to avoid issues like with softmax reported recently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is in progress.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added partial support of acc_mode. jit_uni_pooling implementation of backpropagation for max-pooling for axb, aBx16b/aBx8b for bf16 numbers switches to an old implementation (without f32 accumulator) if 'relaxed' or 'any' acc mode is specified. We have strict, relaxed, any, f32, s32, f16, so s32 and f16 are just ignored, and f32 accumulator is used for strict and f32 modes.

benchdnn is updated to use zero error threshold in case of max-pooling with strict and f32 accumulation mode.

If this approach is ok then docs are to be updated. It looks that docs are not correct/complete (https://oneapi-src.github.io/oneDNN/dev_guide_pooling.html) In case of abx format our jit_uni_pooling implementation converts inputs/outputs to to/from f32 arrays, so its accumulation mode is actually always strict. 'relaxed' mode is not necessary faster than strict but uses less memory. f64 data type can be used on GPUs only (?).

I also noticed that GPU version is out of scope (?) of MFDNN-11050 and MFDNN-11396, I did not test it.

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 2 times, most recently from 8f5d49e to 8d3b54c Compare January 21, 2025 16:22
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 3 times, most recently from 25cbb9a to f5d5a1d Compare January 26, 2025 14:25
@asimonov1
Copy link
Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 4 times, most recently from d98dd7d to 5734498 Compare February 5, 2025 16:39
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from 5734498 to 3ee89a1 Compare February 8, 2025 00:39
@asimonov1
Copy link
Author

make test
disable benchdnn_all
disable test_device_gpu
disable build_gpu_runtime_ocl
disable build_gpu_runtime_sycl
enable benchdnn_nightly
enable benchdnn_pool
enable arch_cpu_adl
enable arch_cpu_clx
enable arch_cpu_dmr
enable arch_cpu_gnr
enable arch_cpu_hsw
enable arch_cpu_nhm
enable arch_cpu_nvl
enable arch_cpu_skx
enable arch_cpu_snb
enable arch_cpu_spr
enable arch_cpu_srf

@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch 4 times, most recently from 32be96c to fbf38e7 Compare February 12, 2025 11:36
Do not use f32 accumulator in jit_uni_pooling for max pooling back
propagation with bf16 if 'relaxed' or 'any' accumulation mode
is specified.
Use zero error threshold in tests for max pooling if 'strict' or
'f32' accumulation mode is specified.
@asimonov1 asimonov1 force-pushed the asimonov/maxpool_bwd_bf16_accuracy branch from fbf38e7 to 3abcafb Compare February 12, 2025 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants