perf stat recipies

Miscellaneous notes on perf and especially perf stat and related functions that count or sample based on hardware performance counters.

Memory

You should probably avoid the builtin hardware events like L1-dcache-load-misses since they have questionable definitions.

L2

The L2_RQSTS events are good, but the description the manual is confusing and incorrect in places. For Skylake and derived archs (and probably Haswell), the functionality offered is fairly simple: every completed request is has two attributes: it's origin (where the request came from) and the result (did the request hit in the cache). There are 5 possible origins and 3 possible results, and the umask filters for any specified combination of result AND origin.

The 3 possible results are encoded in bits 5-7 (3 most significant bits of the origin) as follows:

umask bit	result type	notes
`0x80`	L2 hit M-state	The prior state of the line was M
`0x40`	L2 hit E/S-state	The prior state of the line was E/S¹
`0x20`	L2 miss	The line was not in L2

The 5 possible origins of L2 requests are encoded in bits 0-4 (the least significant 5 bits) as follows:

umask bit	origin	notes
`0x01`	Demand read requests	Demand read requests originating from the core (does not include SW prefetch)
`0x02`	RFO requests	RFOs originating from stores in the core - generally only "blind stores" without a read first
`0x04`	Instruction reads	Reads originating from misses in the L1I cache
`0x08`	L1 prefetch requests	Requests originating from L1 HW prefetcher or softwware prefetch requests, possibly also NPP²
`0x10`	L2 HW prefetcher	Requests originating from within the L2 HW prefetcher itself

The masks can be combined in any way, and all events that match any of the selected origins and any of the selected results be counted. This means that you always need to include at least one origin bit and one result bit, or else the result is always zero.

For example, to count just demand data loads that miss, use 0x01 | 0x20 == 0x21. To count all misses of any type use 0x20 | 0x1F. To count all RFO requests regardless of the result 0xE0 | 0x02, and so on.

¹ I haven't actually jumped through the hoops to carefully test that both the E and S state are covered, just that non-M lines fall into this category - but it seems very likely that would be the case. It can be quite hard to actually ensure you have a line in the E state (as opposed to S) since the cache may decide to bring it in either state depending on opaque heuristics.

² NPP is the next-page prefetcher about which little information is available (some limited and empirical observations can be found here. Even when all other prefetchers are disabled, I observe one umask=0x80 event for every accessed page, so perhaps the NPP makes some type of request to the L2 which is flagged in this category.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf stat recipies

Memory

L2

Clone this wiki locally