benchdnn: cold cache improvement #4529

dzarukin · 2026-01-12T23:45:14Z

LNL may report higher BW than HW provides for certain shapes (the last number is expected to be <= 108):

Output template: %prb%,%-time%,%-Gflops%,%-Gbw%
--mode=P --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user 1024x8:8x4096,0.026458,2536.43,320.15

The issue lies in the hardware behavior when in certain situations L3 cache won't be flushed and input memories will reside in there even for a cold-cache mode.
The resolution is to introduce a flushing kernel that is submitted after each execution for trigger that L3 cache flushing. The credit for idea. library implementation and pivoting of PoC goes to @echeresh.

Note: even though the flushing kernel does stabilize result significantly, it may happen there's a one or two hardware "shots" where the number would still be higher as reported above which is caused by unknown effect at this moment.
The work around those "shots" is to drop several fastest results from the timer collection
Finally, flushing kernel costs time but with better stability time which allows to decrease the sample size.
Three changes combined should address the "not-cold-enough" cold-cache issue.

Output template: %prb%,%-time%,%-Gflops%,%-Gbw%
--mode=F --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab 1024x8:8x4096,0.08,838.861,105.882
--mode=P --matmul --engine=gpu --cold-cache=all --dt=bf16:bf16:bf16 --stag=ab --wtag=ab --dtag=ab --attr-scratchpad=user 1024x8:8x4096,0.078437,855.577,107.991

abort() prevents singleton destructions which print misleading and anooying messages in case of incorrectly finished application due to their nature to catch errors for correctly finished one. It also allows to catch errors on the spot with gdb instead of breaking on a specific line from the message.

Timer can now identify spikes in collection results and discard them when reporting final results. This is done to stabilize best statistics for GPU perf validation.

dzarukin · 2026-01-12T23:45:38Z

make test perf-gpu
set primitive=reorder sum concat binary conv deconv pool

echeresh · 2026-01-13T22:29:46Z

tests/benchdnn/utils/timer.cpp

+    constexpr double magnitude_threshold = 1.1;
+    size_t major_magnitude = SIZE_MAX;
+    for (size_t i = 0; i < deltas.size(); i++) {
+        deltas[i] = ms_vec_[i + 1] / ms_vec_[i];


Do we need to check bounds? Say, if ms_vec_.size() == 1 and deltas_size == 1, this goes out of bounds.

Another comment: do we really need deltas vector? Looks like its values are not used outside of the loop.

echeresh · 2026-01-13T22:32:45Z

tests/benchdnn/utils/timer.cpp

+        }
+    }
+
+    if (major_magnitude < deltas.size()) {


Suggested change

if (major_magnitude < deltas.size()) {

if (major_magnitude < deltas_size) {

echeresh and others added 4 commits January 12, 2026 15:29

src, benchdnn: introduce flushing kernel for cold-cache

ec63143

benchdnn: perf: fast: reduce the number of runs based on warm-up results

d8dcbd6

benchdnn: timer, perf: add filter collection function

7db542c

Timer can now identify spikes in collection results and discard them when reporting final results. This is done to stabilize best statistics for GPU perf validation.

dzarukin requested review from a team as code owners January 12, 2026 23:45

github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Jan 12, 2026

echeresh reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchdnn: cold cache improvement #4529

benchdnn: cold cache improvement #4529

Uh oh!

dzarukin commented Jan 12, 2026

Uh oh!

dzarukin commented Jan 12, 2026

Uh oh!

echeresh Jan 13, 2026

Uh oh!

echeresh Jan 13, 2026

Uh oh!

echeresh Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (major_magnitude < deltas.size()) {
	if (major_magnitude < deltas_size) {

benchdnn: cold cache improvement #4529

Are you sure you want to change the base?

benchdnn: cold cache improvement #4529

Uh oh!

Conversation

dzarukin commented Jan 12, 2026

Uh oh!

dzarukin commented Jan 12, 2026

Uh oh!

echeresh Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

echeresh Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

echeresh Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants