-
Notifications
You must be signed in to change notification settings - Fork 1.1k
benchdnn: cold cache improvement #4529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
abort() prevents singleton destructions which print misleading and anooying messages in case of incorrectly finished application due to their nature to catch errors for correctly finished one. It also allows to catch errors on the spot with gdb instead of breaking on a specific line from the message.
Timer can now identify spikes in collection results and discard them when reporting final results. This is done to stabilize best statistics for GPU perf validation.
|
make test perf-gpu |
| constexpr double magnitude_threshold = 1.1; | ||
| size_t major_magnitude = SIZE_MAX; | ||
| for (size_t i = 0; i < deltas.size(); i++) { | ||
| deltas[i] = ms_vec_[i + 1] / ms_vec_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check bounds? Say, if ms_vec_.size() == 1 and deltas_size == 1, this goes out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment: do we really need deltas vector? Looks like its values are not used outside of the loop.
| } | ||
| } | ||
|
|
||
| if (major_magnitude < deltas.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (major_magnitude < deltas.size()) { | |
| if (major_magnitude < deltas_size) { |
LNL may report higher BW than HW provides for certain shapes (the last number is expected to be <= 108):
The issue lies in the hardware behavior when in certain situations L3 cache won't be flushed and input memories will reside in there even for a cold-cache mode.
The resolution is to introduce a flushing kernel that is submitted after each execution for trigger that L3 cache flushing. The credit for idea. library implementation and pivoting of PoC goes to @echeresh.
Note: even though the flushing kernel does stabilize result significantly, it may happen there's a one or two hardware "shots" where the number would still be higher as reported above which is caused by unknown effect at this moment.
The work around those "shots" is to drop several fastest results from the timer collection
Finally, flushing kernel costs time but with better stability time which allows to decrease the sample size.
Three changes combined should address the "not-cold-enough" cold-cache issue.