Skip to content

Conversation

@dzarukin
Copy link
Contributor

LNL may report higher BW than HW provides for certain shapes (the last number is expected to be <= 108):

Output template: %prb%,%-time%,%-Gflops%,%-Gbw%
--mode=P --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user 1024x8:8x4096,0.026458,2536.43,320.15

The issue lies in the hardware behavior when in certain situations L3 cache won't be flushed and input memories will reside in there even for a cold-cache mode.
The resolution is to introduce a flushing kernel that is submitted after each execution for trigger that L3 cache flushing. The credit for idea. library implementation and pivoting of PoC goes to @echeresh.

Note: even though the flushing kernel does stabilize result significantly, it may happen there's a one or two hardware "shots" where the number would still be higher as reported above which is caused by unknown effect at this moment.
The work around those "shots" is to drop several fastest results from the timer collection
Finally, flushing kernel costs time but with better stability time which allows to decrease the sample size.
Three changes combined should address the "not-cold-enough" cold-cache issue.

Output template: %prb%,%-time%,%-Gflops%,%-Gbw%
--mode=F --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab 1024x8:8x4096,0.08,838.861,105.882
--mode=P --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user 1024x8:8x4096,0.078437,855.577,107.991

echeresh and others added 4 commits January 12, 2026 15:29
abort() prevents singleton destructions which print misleading and
anooying messages in case of incorrectly finished application due to
their nature to catch errors for correctly finished one.

It also allows to catch errors on the spot with gdb instead of breaking
on a specific line from the message.
Timer can now identify spikes in collection results and discard them
when reporting final results. This is done to stabilize best statistics
for GPU perf validation.
@dzarukin dzarukin requested review from a team as code owners January 12, 2026 23:45
@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Jan 12, 2026
@dzarukin
Copy link
Contributor Author

make test perf-gpu
set primitive=reorder sum concat binary conv deconv pool

constexpr double magnitude_threshold = 1.1;
size_t major_magnitude = SIZE_MAX;
for (size_t i = 0; i < deltas.size(); i++) {
deltas[i] = ms_vec_[i + 1] / ms_vec_[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check bounds? Say, if ms_vec_.size() == 1 and deltas_size == 1, this goes out of bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment: do we really need deltas vector? Looks like its values are not used outside of the loop.

}
}

if (major_magnitude < deltas.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (major_magnitude < deltas.size()) {
if (major_magnitude < deltas_size) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants