gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

yehudaorel · 2025-02-03T18:34:25Z

Description

Systems with data compression enabled by the driver by default wont generate meaningful performance data with benchdnn ... --mode=F

Reuse ocl_philox.h kernel to directly fill gpu mem with random vals
cold-cache support

Fixes MFDNN-12589

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

rjoursler · 2025-02-03T19:13:10Z

tests/benchdnn/dnnl_memory.cpp

+    } else if (is_sycl) {
+#ifdef DNNL_WITH_SYCL
+        // TODO: add sycl support
+         return dnnl_status_t::dnnl_unimplemented;


There should be some fixed size window where data compression can happen, and I believe it is on the order of 256 bytes, so it is not particularly big. What if we (ab)use the binary primitive to broadcast incompressible data into the buffer by doing binary_max(incompressible_src, incompressible_src) with broadcast semantics? That should be a bit simpler as then we won't have to maintain all this runtime specific code. although it will introduce memory size alignment requirements.

I don't follow - how does binary_max generate random data?

Binary max doesn't generate the data, instead we setup 1 incompressible source buffer at startup. Essentially, we do something like:

out_data get_mem_rng(size) { // 1x256 u8 memory descriptor static dnnl_mem isrc = []() { uint8_t rand_data[256]; fill_rng(rand_data); dnnl_mem isrc; reorder(isrc, rand_data) return isrc; } // size/256 x 256 u8 memory descriptor dnnl_mem out_data(utils::rnd_up(size, 256)) out_data = binary_max(isrc, isrc); return out_data; }

@rjoursler Do you have details on how compression is done? For example, if it can detect a repeating pattern at the page granularity (4 kilobytes) then any pattern of a power of two size would result in some subset of pages having the same data which exposes compression opportunities for the hardware.

Discussed with Roy offline - the current hardware seems to work with 256-byte blocks (and future hardware may have possibly larger blocks). So using a small block of 256 bytes with random data as a pattern should be enough.

But I'm not sure about this trick with broadcasting via binary primitive as it looks like dst must match either src0 or src1:

oneDNN/src/common/binary.cpp

Lines 130 to 131 in a6a303a

VCHECK_BINARY(IMPLICATION(src0_md->dims[d] != dims[d],

src1_md->dims[d] == dims[d]),

Thanks for finding that @echeresh, it sounds like a separate kernel would would still be required.

echeresh · 2025-02-04T00:11:22Z

@yehudaorel Did you compare it with the current data filling approach? How much is the additional overhead on RNG-based data filling?

yehudaorel · 2025-02-04T17:59:47Z

@yehudaorel Did you compare it with the current data filling approach? How much is the additional overhead on RNG-based data filling?

Based on my testing, this will introduce ~5% overhead, which in realistic terms will only be a couple seconds difference when running larger tests.

timed 1perf: [A570M]

base(main) with cold-cache
- real 4m19.663s
- user 2m32.354s
- sys 0m56.664s
base(main) with no-cold-cache:
- real 3m18.686s
- user 2m15.223s
- sys 0m27.828s
PRNG with cold-cache:
- real 4m11.391s
- user 2m1.066s
- sys 0m58.836s
PRNG with no-cold-cache:
- real 4m3.358s
- user 2m17.656s
- sys 0m30.760s

perf,--mode=F --matmul --engine=gpu --memory-kind=buffer --stag=ab --wtag=ba --dtag=ab --attr-fpmath=f16:true 1x1024:1024x1024,0.160625,26.1634,13.0562
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.160625 avg(ms):0.164043
total: 0.07s; fill: 0.00s (0%);

=== Device Timing Summary ===

                Total Execution Time (ns):             74905580
Total Device Time for CL GPU backend (ns):             12576009

== CL GPU Backend ==

                                            Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
         gemm_kernel[SIMD8 {512; 8; 4} {16; 8; 4}],          73,            11977260,       95.24,              164072,              160625,              167395
philox_fill_kernel[SIMD16 {65536; 1; 1} {0; 0; 0}],           1,              559479,        4.45,              559479,              559479,              559479
  philox_fill_kernel[SIMD16 {256; 1; 1} {0; 0; 0}],           2,               39270,        0.31,               19635,               19583,               19687

karturov · 2025-02-05T19:20:19Z

@yehudaorel , 5% overhead is acceptable.

dzarukin · 2025-02-11T18:29:02Z

tests/benchdnn/utils/ocl_philox.h

+    ctr = PHILOX_4UINT_ROUND(ctr_mul, ctr, key1.s23);
+
+    return ctr[~idx & 3L];
+}


It seems that the function call is identical to what we have inside the library. Though ocl_philox.h there has some unnecessary dependency for benchdnn.
Suggest to take out philox_XXX functions from .h and put into new ocl_philox.cl, include that file in ocl_philox.h inside the library and include it here to have a shared code across both places, otherwise, it's pretty easy to forget to update it and pretty hard to find an issue when code will become different.

dzarukin · 2025-02-11T18:30:57Z

tests/benchdnn/utils/ocl_philox.h

+        data[i] = philox_4x32(i * sizeof(uint), seed) & INF_NAN_MASK;
+    }
+}
+)CLC";


Please move this next to the spot it is compiled.
With this and the one above, the file should dissolve.

tests/benchdnn/utils/cold_cache.cpp

dzarukin · 2025-02-11T18:35:41Z

tests/benchdnn/utils/cold_cache.cpp

-        dnn_mem_t src_m(1, dims, dnnl_f32, tag::abx, engine);
-        dnn_mem_t dst_m(1, dims, dnnl_f32, tag::abx, engine);
+        const dnn_mem_t::handle_info_t no_rng_m_handle
+                = {false, DNNL_MEMORY_ALLOCATE, false};


Adding argument solves the mode=F part with the kernel no being called but doesn't help to mode=P which would still initialize the memory which is not needed here.
Instead, suggest to change DNNL_MEMORY_ALLOCATE which an allocated memory (one for src, one for dst) and drop a new argument.

dzarukin · 2025-02-11T18:37:35Z

tests/benchdnn/dnnl_memory.cpp

+                        auto s = this->memset_rng(sz);
+                        if (s != dnnl_status_t::dnnl_success) {
+                            this->memset(dnnl_mem_default_value, sz);
+                        }


Only this part of if branch should remain, and there should be no fallback, it would lead to potentially misleading results which would be hard to track down to this difference.

tests/benchdnn/dnnl_memory.cpp

karturov · 2025-02-18T18:14:35Z

make test perf-gpu
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-lpg

karturov · 2025-02-21T18:24:30Z

make test perf-gpu
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-lpg

petercad · 2025-03-10T21:03:30Z

One side effect of random fill is that it will increase power usage, leading to significant performance variability for power-limited sizes. Has anyone checked if 1perf handles this situation properly (e.g. no false regressions)?

@echeresh I remember that when you first introduced --mode=po you mentioned the low power usage of constant/zero fill as an asset in performance tracking.

echeresh · 2025-03-10T23:41:55Z

One side effect of random fill is that it will increase power usage, leading to significant performance variability for power-limited sizes. Has anyone checked if 1perf handles this situation properly (e.g. no false regressions)?

@echeresh I remember that when you first introduced --mode=po you mentioned the low power usage of constant/zero fill as an asset in performance tracking.

Good point. Yes, this is a known issue/behavior. @yehudaorel I'd suggest to do a couple of performance runs with the same commit and look at the run-to-run variability. We need to confirm that it's stable with the random data filling.

I'll list two performance effects and implications for our performance testing. Unfortunately, it seems there is no approach that is reliable and fast enough for our performance testing (but still, I hope we can manage it with our performance testing infrastructure).

Zero (or constant) initialized data vs random data requires less power and causes less frequency throttling
- Implications: performance with random data is 1) generally lower and 2) less stable - there is a higher chance of throttling (which depends on the test cases and hardware)
- A good thing: a lot of cases in our performance testing are not compute-bound (moreover we track cold-cache performance) so it seems even if frequency drops, it often bounces back without compromising the numbers
- A bad thing: the most stable way to measure performance is to run a case long enough to ensure we run at a sustained frequency. I assume it's hundreds of milliseconds per case vs ~10ms per case as of now
L3 cache write-back overhead. If a kernel writes to global memory, it may write to L3 cache only leaving L3 -> global memory transfer for future queue synchronization (which triggers cache flush) or future kernel submissions (if they evict that L3 data to free up it for their data)
- Implications: first kernel runs are generally faster - they have "fresh" L3 and don't have to evict previous data

Both effects result in a significant gap between minimum/average times. When moving to random data filling, stability should suffer more due to the first effect.

As for the practical side, we have some "mitigations" in our performance testing so it's not that bad:

We use the minimum time for performance reports. It's more stable and suffers less from throttling and L3 cache impact
When a regression is detected, we do several re-runs to confirm the regression is there
We have quite a few non compute-bound cases which limits the impact of the 1st effect

petercad · 2025-03-10T23:59:49Z

Zero (or constant) initialized data vs random data requires less power and causes less frequency throttling

Implications: performance with random data is 1) generally lower and 2) less stable - there is a higher chance of throttling (which depends on the test cases and hardware)

Another side-effect (actually, the one I'm most worried about) is that frequency throttling causes the performance of one power-bound layer to affect the performance of following layers in batch testing. Some possible results:

Reordering layers in batch testing may affect the performance reported for each layer.
An optimization that leads to greater compute efficiency in one power-bound layer may reduce frequency for the following layers, causing (false) apparent regressions for those layers.

echeresh · 2025-03-11T01:14:46Z

Another side-effect (actually, the one I'm most worried about) is that frequency throttling causes the performance of one power-bound layer to affect the performance of following layers in batch testing. Some possible results:

Reordering layers in batch testing may affect the performance reported for each layer.

An optimization that leads to greater compute efficiency in one power-bound layer may reduce frequency for the following layers, causing (false) apparent regressions for those layers.

I agree, in isolation throttling is less of a problem as the first few runs are not usually affected (unless the sizes are huge).

I guess we need to step into some related issues to see what helps and what doesn't. For example, we can add a short delay after compute-bound shapes in benchdnn to let the frequency recover, or, if nothing helps, switch to longer runs.

petercad · 2025-03-11T02:34:46Z

I agree, in isolation throttling is less of a problem as the first few runs are not usually affected (unless the sizes are huge).

I guess we need to step into some related issues to see what helps and what doesn't. For example, we can add a short delay after compute-bound shapes in benchdnn to let the frequency recover, or, if nothing helps, switch to longer runs.

Can we detect memory-bound vs. compute-bound cases automatically (for select primitives at least)? So by default fill mode would be constant for compute-bound, random for memory-bound, with a knob to override the default.

yehudaorel · 2025-03-11T15:30:27Z

Can we detect memory-bound vs. compute-bound cases automatically (for select primitives at least)? So by default fill mode would be constant for compute-bound, random for memory-bound, with a knob to override the default.

I believe this could be done, although for compute-bound with constant filling cases, we might end up with original issue of inflated performance data again due to data compression.

yehudaorel requested a review from a team as a code owner February 3, 2025 18:34

github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Feb 3, 2025

yehudaorel added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 3, 2025

yehudaorel requested a review from dzarukin February 3, 2025 18:38

rjoursler reviewed Feb 3, 2025

View reviewed changes

github-actions bot removed the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 3, 2025

dzarukin reviewed Feb 11, 2025

View reviewed changes

yehudaorel added 2 commits March 4, 2025 08:44

benchdnn: ocl: philox rng initial impl - rebase

87b76da

gpu: benchdnn: philox: rebase to main

3390967

yehudaorel force-pushed the benchdnn_philox branch from d961381 to 3390967 Compare March 4, 2025 16:55

benchdnn: gpu: move philox kernel

cec94cd

yehudaorel requested a review from a team as a code owner March 10, 2025 20:20

github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Mar 10, 2025

benchdnn: gpu: philox - clang format fix

9a22f62

yehudaorel added 2 commits March 12, 2025 14:08

tests: gpu: benchdnn usm fix

83390e8

tests: gpu: benchdnn philox format fix

2fb3fb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

yehudaorel commented Feb 3, 2025

rjoursler Feb 3, 2025 •

edited

Loading

Simonsays095 Feb 3, 2025

rjoursler Feb 4, 2025 •

edited

Loading

echeresh Feb 4, 2025

echeresh Feb 4, 2025

rjoursler Feb 5, 2025

echeresh commented Feb 4, 2025

yehudaorel commented Feb 4, 2025

karturov commented Feb 5, 2025

dzarukin Feb 11, 2025

dzarukin Feb 11, 2025

dzarukin Feb 11, 2025

dzarukin Feb 11, 2025

karturov commented Feb 18, 2025

karturov commented Feb 21, 2025

petercad commented Mar 10, 2025 •

edited

Loading

echeresh commented Mar 10, 2025 •

edited

Loading

petercad commented Mar 10, 2025 •

edited

Loading

echeresh commented Mar 11, 2025

petercad commented Mar 11, 2025

yehudaorel commented Mar 11, 2025

	VCHECK_BINARY(IMPLICATION(src0_md->dims[d] != dims[d],
	src1_md->dims[d] == dims[d]),

gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

Are you sure you want to change the base?

gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

Conversation

yehudaorel commented Feb 3, 2025

Description

Checklist

General

rjoursler Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjoursler Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeresh commented Feb 4, 2025

yehudaorel commented Feb 4, 2025

karturov commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karturov commented Feb 18, 2025

karturov commented Feb 21, 2025

petercad commented Mar 10, 2025 • edited Loading

echeresh commented Mar 10, 2025 • edited Loading

petercad commented Mar 10, 2025 • edited Loading

echeresh commented Mar 11, 2025

petercad commented Mar 11, 2025

yehudaorel commented Mar 11, 2025

rjoursler Feb 3, 2025 •

edited

Loading

rjoursler Feb 4, 2025 •

edited

Loading

petercad commented Mar 10, 2025 •

edited

Loading

echeresh commented Mar 10, 2025 •

edited

Loading

petercad commented Mar 10, 2025 •

edited

Loading