Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

yehudaorel
Copy link

Description

Systems with data compression enabled by the driver by default wont generate meaningful performance data with benchdnn ... --mode=F

  • Reuse ocl_philox.h kernel to directly fill gpu mem with random vals
  • cold-cache support

Fixes MFDNN-12589

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@yehudaorel yehudaorel requested a review from a team as a code owner February 3, 2025 18:34
@github-actions github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Feb 3, 2025
@yehudaorel yehudaorel added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 3, 2025
@yehudaorel yehudaorel requested a review from dzarukin February 3, 2025 18:38
} else if (is_sycl) {
#ifdef DNNL_WITH_SYCL
// TODO: add sycl support
return dnnl_status_t::dnnl_unimplemented;
Copy link
Contributor

@rjoursler rjoursler Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some fixed size window where data compression can happen, and I believe it is on the order of 256 bytes, so it is not particularly big. What if we (ab)use the binary primitive to broadcast incompressible data into the buffer by doing binary_max(incompressible_src, incompressible_src) with broadcast semantics? That should be a bit simpler as then we won't have to maintain all this runtime specific code. although it will introduce memory size alignment requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - how does binary_max generate random data?

Copy link
Contributor

@rjoursler rjoursler Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary max doesn't generate the data, instead we setup 1 incompressible source buffer at startup. Essentially, we do something like:

out_data get_mem_rng(size) {

   // 1x256 u8 memory descriptor
    static dnnl_mem isrc = []() {
        uint8_t rand_data[256];
        fill_rng(rand_data);
        dnnl_mem isrc;
        reorder(isrc, rand_data)
        return isrc;
    }

    // size/256 x 256 u8 memory descriptor
    dnnl_mem out_data(utils::rnd_up(size, 256)) 
    out_data = binary_max(isrc, isrc);
    return out_data;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler Do you have details on how compression is done? For example, if it can detect a repeating pattern at the page granularity (4 kilobytes) then any pattern of a power of two size would result in some subset of pages having the same data which exposes compression opportunities for the hardware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Roy offline - the current hardware seems to work with 256-byte blocks (and future hardware may have possibly larger blocks). So using a small block of 256 bytes with random data as a pattern should be enough.

But I'm not sure about this trick with broadcasting via binary primitive as it looks like dst must match either src0 or src1:

VCHECK_BINARY(IMPLICATION(src0_md->dims[d] != dims[d],
src1_md->dims[d] == dims[d]),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding that @echeresh, it sounds like a separate kernel would would still be required.

@github-actions github-actions bot removed the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 3, 2025
@echeresh
Copy link
Contributor

echeresh commented Feb 4, 2025

@yehudaorel Did you compare it with the current data filling approach? How much is the additional overhead on RNG-based data filling?

@yehudaorel
Copy link
Author

@yehudaorel Did you compare it with the current data filling approach? How much is the additional overhead on RNG-based data filling?

Based on my testing, this will introduce ~5% overhead, which in realistic terms will only be a couple seconds difference when running larger tests.

timed 1perf: [A570M]

  • base(main) with cold-cache

    • real 4m19.663s
    • user 2m32.354s
    • sys 0m56.664s
  • base(main) with no-cold-cache:

    • real 3m18.686s
    • user 2m15.223s
    • sys 0m27.828s
  • PRNG with cold-cache:

    • real 4m11.391s
    • user 2m1.066s
    • sys 0m58.836s
  • PRNG with no-cold-cache:

    • real 4m3.358s
    • user 2m17.656s
    • sys 0m30.760s
perf,--mode=F --matmul --engine=gpu --memory-kind=buffer --stag=ab --wtag=ba --dtag=ab --attr-fpmath=f16:true 1x1024:1024x1024,0.160625,26.1634,13.0562
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0.160625 avg(ms):0.164043
total: 0.07s; fill: 0.00s (0%);

=== Device Timing Summary ===

                Total Execution Time (ns):             74905580
Total Device Time for CL GPU backend (ns):             12576009

== CL GPU Backend ==

                                            Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
         gemm_kernel[SIMD8 {512; 8; 4} {16; 8; 4}],          73,            11977260,       95.24,              164072,              160625,              167395
philox_fill_kernel[SIMD16 {65536; 1; 1} {0; 0; 0}],           1,              559479,        4.45,              559479,              559479,              559479
  philox_fill_kernel[SIMD16 {256; 1; 1} {0; 0; 0}],           2,               39270,        0.31,               19635,               19583,               19687


@karturov
Copy link
Contributor

karturov commented Feb 5, 2025

@yehudaorel , 5% overhead is acceptable.

ctr = PHILOX_4UINT_ROUND(ctr_mul, ctr, key1.s23);

return ctr[~idx & 3L];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the function call is identical to what we have inside the library. Though ocl_philox.h there has some unnecessary dependency for benchdnn.
Suggest to take out philox_XXX functions from .h and put into new ocl_philox.cl, include that file in ocl_philox.h inside the library and include it here to have a shared code across both places, otherwise, it's pretty easy to forget to update it and pretty hard to find an issue when code will become different.

data[i] = philox_4x32(i * sizeof(uint), seed) & INF_NAN_MASK;
}
}
)CLC";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this next to the spot it is compiled.
With this and the one above, the file should dissolve.

dnn_mem_t src_m(1, dims, dnnl_f32, tag::abx, engine);
dnn_mem_t dst_m(1, dims, dnnl_f32, tag::abx, engine);
const dnn_mem_t::handle_info_t no_rng_m_handle
= {false, DNNL_MEMORY_ALLOCATE, false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding argument solves the mode=F part with the kernel no being called but doesn't help to mode=P which would still initialize the memory which is not needed here.
Instead, suggest to change DNNL_MEMORY_ALLOCATE which an allocated memory (one for src, one for dst) and drop a new argument.

auto s = this->memset_rng(sz);
if (s != dnnl_status_t::dnnl_success) {
this->memset(dnnl_mem_default_value, sz);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this part of if branch should remain, and there should be no fallback, it would lead to potentially misleading results which would be hard to track down to this difference.

@karturov
Copy link
Contributor

make test perf-gpu
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-lpg

1 similar comment
@karturov
Copy link
Contributor

make test perf-gpu
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-lpg

@yehudaorel yehudaorel requested a review from a team as a code owner March 10, 2025 20:20
@github-actions github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Mar 10, 2025
@petercad
Copy link
Contributor

petercad commented Mar 10, 2025

One side effect of random fill is that it will increase power usage, leading to significant performance variability for power-limited sizes. Has anyone checked if 1perf handles this situation properly (e.g. no false regressions)?

@echeresh I remember that when you first introduced --mode=po you mentioned the low power usage of constant/zero fill as an asset in performance tracking.

@echeresh
Copy link
Contributor

echeresh commented Mar 10, 2025

One side effect of random fill is that it will increase power usage, leading to significant performance variability for power-limited sizes. Has anyone checked if 1perf handles this situation properly (e.g. no false regressions)?

@echeresh I remember that when you first introduced --mode=po you mentioned the low power usage of constant/zero fill as an asset in performance tracking.

Good point. Yes, this is a known issue/behavior. @yehudaorel I'd suggest to do a couple of performance runs with the same commit and look at the run-to-run variability. We need to confirm that it's stable with the random data filling.

I'll list two performance effects and implications for our performance testing. Unfortunately, it seems there is no approach that is reliable and fast enough for our performance testing (but still, I hope we can manage it with our performance testing infrastructure).

  • Zero (or constant) initialized data vs random data requires less power and causes less frequency throttling
    • Implications: performance with random data is 1) generally lower and 2) less stable - there is a higher chance of throttling (which depends on the test cases and hardware)
    • A good thing: a lot of cases in our performance testing are not compute-bound (moreover we track cold-cache performance) so it seems even if frequency drops, it often bounces back without compromising the numbers
    • A bad thing: the most stable way to measure performance is to run a case long enough to ensure we run at a sustained frequency. I assume it's hundreds of milliseconds per case vs ~10ms per case as of now
  • L3 cache write-back overhead. If a kernel writes to global memory, it may write to L3 cache only leaving L3 -> global memory transfer for future queue synchronization (which triggers cache flush) or future kernel submissions (if they evict that L3 data to free up it for their data)
    • Implications: first kernel runs are generally faster - they have "fresh" L3 and don't have to evict previous data

Both effects result in a significant gap between minimum/average times. When moving to random data filling, stability should suffer more due to the first effect.

As for the practical side, we have some "mitigations" in our performance testing so it's not that bad:

  • We use the minimum time for performance reports. It's more stable and suffers less from throttling and L3 cache impact
  • When a regression is detected, we do several re-runs to confirm the regression is there
  • We have quite a few non compute-bound cases which limits the impact of the 1st effect

@petercad
Copy link
Contributor

petercad commented Mar 10, 2025

  • Zero (or constant) initialized data vs random data requires less power and causes less frequency throttling

    • Implications: performance with random data is 1) generally lower and 2) less stable - there is a higher chance of throttling (which depends on the test cases and hardware)

Another side-effect (actually, the one I'm most worried about) is that frequency throttling causes the performance of one power-bound layer to affect the performance of following layers in batch testing. Some possible results:

  • Reordering layers in batch testing may affect the performance reported for each layer.
  • An optimization that leads to greater compute efficiency in one power-bound layer may reduce frequency for the following layers, causing (false) apparent regressions for those layers.

@echeresh
Copy link
Contributor

Another side-effect (actually, the one I'm most worried about) is that frequency throttling causes the performance of one power-bound layer to affect the performance of following layers in batch testing. Some possible results:

  • Reordering layers in batch testing may affect the performance reported for each layer.
  • An optimization that leads to greater compute efficiency in one power-bound layer may reduce frequency for the following layers, causing (false) apparent regressions for those layers.

I agree, in isolation throttling is less of a problem as the first few runs are not usually affected (unless the sizes are huge).

I guess we need to step into some related issues to see what helps and what doesn't. For example, we can add a short delay after compute-bound shapes in benchdnn to let the frequency recover, or, if nothing helps, switch to longer runs.

@petercad
Copy link
Contributor

I agree, in isolation throttling is less of a problem as the first few runs are not usually affected (unless the sizes are huge).

I guess we need to step into some related issues to see what helps and what doesn't. For example, we can add a short delay after compute-bound shapes in benchdnn to let the frequency recover, or, if nothing helps, switch to longer runs.

Can we detect memory-bound vs. compute-bound cases automatically (for select primitives at least)? So by default fill mode would be constant for compute-bound, random for memory-bound, with a knob to override the default.

@yehudaorel
Copy link
Author

Can we detect memory-bound vs. compute-bound cases automatically (for select primitives at least)? So by default fill mode would be constant for compute-bound, random for memory-bound, with a knob to override the default.

I believe this could be done, although for compute-bound with constant filling cases, we might end up with original issue of inflated performance data again due to data compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants