-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: ocl: benchdnn: enable RNG memory fill in --mode=F #2583
base: main
Are you sure you want to change the base?
Conversation
tests/benchdnn/dnnl_memory.cpp
Outdated
} else if (is_sycl) { | ||
#ifdef DNNL_WITH_SYCL | ||
// TODO: add sycl support | ||
return dnnl_status_t::dnnl_unimplemented; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some fixed size window where data compression can happen, and I believe it is on the order of 256 bytes, so it is not particularly big. What if we (ab)use the binary primitive to broadcast incompressible data into the buffer by doing binary_max(incompressible_src, incompressible_src)
with broadcast semantics? That should be a bit simpler as then we won't have to maintain all this runtime specific code. although it will introduce memory size alignment requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow - how does binary_max
generate random data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary max doesn't generate the data, instead we setup 1 incompressible source buffer at startup. Essentially, we do something like:
out_data get_mem_rng(size) {
// 1x256 u8 memory descriptor
static dnnl_mem isrc = []() {
uint8_t rand_data[256];
fill_rng(rand_data);
dnnl_mem isrc;
reorder(isrc, rand_data)
return isrc;
}
// size/256 x 256 u8 memory descriptor
dnnl_mem out_data(utils::rnd_up(size, 256))
out_data = binary_max(isrc, isrc);
return out_data;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler Do you have details on how compression is done? For example, if it can detect a repeating pattern at the page granularity (4 kilobytes) then any pattern of a power of two size would result in some subset of pages having the same data which exposes compression opportunities for the hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Roy offline - the current hardware seems to work with 256-byte blocks (and future hardware may have possibly larger blocks). So using a small block of 256 bytes with random data as a pattern should be enough.
But I'm not sure about this trick with broadcasting via binary primitive as it looks like dst
must match either src0
or src1
:
Lines 130 to 131 in a6a303a
VCHECK_BINARY(IMPLICATION(src0_md->dims[d] != dims[d], | |
src1_md->dims[d] == dims[d]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding that @echeresh, it sounds like a separate kernel would would still be required.
@yehudaorel Did you compare it with the current data filling approach? How much is the additional overhead on RNG-based data filling? |
Based on my testing, this will introduce ~5% overhead, which in realistic terms will only be a couple seconds difference when running larger tests. timed 1perf: [A570M]
|
@yehudaorel , 5% overhead is acceptable. |
tests/benchdnn/utils/ocl_philox.h
Outdated
ctr = PHILOX_4UINT_ROUND(ctr_mul, ctr, key1.s23); | ||
|
||
return ctr[~idx & 3L]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the function call is identical to what we have inside the library. Though ocl_philox.h
there has some unnecessary dependency for benchdnn.
Suggest to take out philox_XXX
functions from .h
and put into new ocl_philox.cl
, include that file in ocl_philox.h
inside the library and include it here to have a shared code across both places, otherwise, it's pretty easy to forget to update it and pretty hard to find an issue when code will become different.
tests/benchdnn/utils/ocl_philox.h
Outdated
data[i] = philox_4x32(i * sizeof(uint), seed) & INF_NAN_MASK; | ||
} | ||
} | ||
)CLC"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this next to the spot it is compiled.
With this and the one above, the file should dissolve.
tests/benchdnn/utils/cold_cache.cpp
Outdated
dnn_mem_t src_m(1, dims, dnnl_f32, tag::abx, engine); | ||
dnn_mem_t dst_m(1, dims, dnnl_f32, tag::abx, engine); | ||
const dnn_mem_t::handle_info_t no_rng_m_handle | ||
= {false, DNNL_MEMORY_ALLOCATE, false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding argument solves the mode=F part with the kernel no being called but doesn't help to mode=P which would still initialize the memory which is not needed here.
Instead, suggest to change DNNL_MEMORY_ALLOCATE
which an allocated memory (one for src, one for dst) and drop a new argument.
tests/benchdnn/dnnl_memory.cpp
Outdated
auto s = this->memset_rng(sz); | ||
if (s != dnnl_status_t::dnnl_success) { | ||
this->memset(dnnl_mem_default_value, sz); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only this part of if
branch should remain, and there should be no fallback, it would lead to potentially misleading results which would be hard to track down to this difference.
make test perf-gpu |
1 similar comment
make test perf-gpu |
d961381
to
3390967
Compare
One side effect of random fill is that it will increase power usage, leading to significant performance variability for power-limited sizes. Has anyone checked if 1perf handles this situation properly (e.g. no false regressions)? @echeresh I remember that when you first introduced |
Good point. Yes, this is a known issue/behavior. @yehudaorel I'd suggest to do a couple of performance runs with the same commit and look at the run-to-run variability. We need to confirm that it's stable with the random data filling. I'll list two performance effects and implications for our performance testing. Unfortunately, it seems there is no approach that is reliable and fast enough for our performance testing (but still, I hope we can manage it with our performance testing infrastructure).
Both effects result in a significant gap between minimum/average times. When moving to random data filling, stability should suffer more due to the first effect. As for the practical side, we have some "mitigations" in our performance testing so it's not that bad:
|
Another side-effect (actually, the one I'm most worried about) is that frequency throttling causes the performance of one power-bound layer to affect the performance of following layers in batch testing. Some possible results:
|
I agree, in isolation throttling is less of a problem as the first few runs are not usually affected (unless the sizes are huge). I guess we need to step into some related issues to see what helps and what doesn't. For example, we can add a short delay after compute-bound shapes in benchdnn to let the frequency recover, or, if nothing helps, switch to longer runs. |
Can we detect memory-bound vs. compute-bound cases automatically (for select primitives at least)? So by default fill mode would be constant for compute-bound, random for memory-bound, with a knob to override the default. |
I believe this could be done, although for compute-bound with constant filling cases, we might end up with original issue of inflated performance data again due to data compression. |
Description
Systems with data compression enabled by the driver by default wont generate meaningful performance data with
benchdnn ... --mode=F
Fixes MFDNN-12589
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?