-
Notifications
You must be signed in to change notification settings - Fork 63
Intel Performance Quirks
Weird performance quirks discovered (not necessarily by me) on modern Intel chips. We don't cover things already mentioned in Agner's guides or the Intel optimization manual.
L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12
It seems to happen only if the two loads issue in the same cycle: if the loads issue in different cycles the penalty appears to be reduced to 1 or 2 cycles. Goes back to Sandy Bridge (probably), but not Nehalem (which can't issue two loads per cycle).
Benchmarks in uarch-bench can be run with --test-name=studies/memory/l2-doubleload/*
.
adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines
Normally adc
is 2 uops to p0156
and 2 cycles of latency, but in the special case the immediate zero is used, it only takes 1 uop and 1 cycle of latency on Haswell machines. This is pretty a important optimization for adc
since adc reg, 0
is a common pattern used to accumulate the result of comparisons and other branchless techniques. Presumably the same optimization applies to sbb reg, 0
. In Broadwell and beyond, adc
is usually a single uop with 1 cycle latency, regardless of the immediate or register source.
Discussion at the bottom of this SO answer and the comments. Test in uarch bench can be run with --test-name=misc/adc*
.
Short form adc and sbb using the accumulator (rax, eax, ax, al) are two uops on Broadwell and Skylake
In Broadwell and Skylake, most uses of adc
and sbb
only take one uop, versus two on prior generations. However, the "short form" specially encoded versions, which use the rax
register (and all the sub-registers like eax
, ax
and al
), still take 2-uops.
This is likely to occur in practice with any immediate that doesn't fit in a single byte.
I first saw it mentioned by Andreas Abel in this comment thread.
Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this
In particular, if the load address is ready to go as soon as the store address is, or at most 1 or 2 cycles later (you have to look at the dependency chains leading into each address to determine this), you'll get a variable store forwarding delay of 4 or 5 cycles (seems to be a 50% chance of each, giving an average of 4.5 cycles). If the load address is available exactly 3 cycles later, you'll get a faster 3 cycle store forwarding (i.e., the load won't be further delayed at all). This can lead to weird effects like adding extra work speeding up a loop, or call/ret sequence, because it delayed the load enough to get the "ideal" store-forwarding latency.
Initial discussion here.
Stores to a cache line that is an L1-miss but L2-hit are unexpectedly slow if interleaved with stores to other lines
When stores that miss in L1 but hit in L2 are interleaved with stores that hit in L1 (for example), the throughput is much lower than one might expect (considering L2 latencies, fill buffer counts, and store buffer prefetching), and also weirdly bi-modal. In particular, each L2 store (paired with an L1 store) might take 9 or 18 cycles. This effect can be eliminated by prefetching the lines into L1 before the store (or using a demand load to achieve the same effect), at which point you'll get the expected performance (similar to what you'd get for independent loads). A corollary of this is that it only happens with "blind stores": stores that write to a location without first reading it: if it is read first, the load determines the caching behavior and the problem goes away.
Discussion at RWT and StackOverflow.
A load using an addressing mode like [reg + offset]
may have a latency of 4 cycles, which is the often-quoted best-case latency for recent Intel L1 caches - but this only applies when the value of reg
was the result of an earlier load, e.g, in a pure pointer chasing loop. If the value of reg
was instead calculated from an ALU instruction, the past path is not applicable, so the load will take 5 cycles (in addition to whatever latency the ALU operation adds).
For example, the following pointer-chase runs at 4 cycles per iteration in a loop:
mov rax, [rax + 128]
While the following equivalent code, which adds only a single-cycle latency ALU operation to the dependency chain, runs in 6 cycles per iteration rather than the 5 you might expect:
mov rax, [rax]
add rax, 128
The add
instruction feeds the subsequent load which disables the 4-cycle L1-hit latency path.
Peter Cordes mentioned this behavior somewhere on Stack Overflow but I can't find the place at the moment.
The 4-cycle best-case load latency fails and the load must be replayed when the base register points to a different page
Consider again a pointer-chasing loop like the following:
top:
mov rax, [rax + 128]
test rax, rax
jnz top
Since this meets the documented conditions for the 4-cycle best case latency (simple addressing mode, non-negaitve offset < 2048) we expect to see it take 4 cycles per iteration. Indeed, it usually does: but in the case that the base pointer [rax]
and the full address including offset [rax + 128]
fall into different 4k pages the load actually takes 9 or 10 cycles on recent Intel and dispatches twice (i.e., it is replayed after the different page condition is detected).
Apparently to achieve the 4-cycle latency the TLB lookup happens based on the base register even before the full address is calculated (which presumably happens in parallel), and this condition is checked and the load has to be replayed if the offset puts the load in a different page. On Haswell loads can "mispredict" continually in this manner, but on Skylake a load being mispredicted will force the next load to use the normal full 5-cycle path which won't mispredict, which leads to alternating 5 and 10 cycle loads when the full address is always on a different page, for an average latency of 7.5).
Originally reported by user harold on StackOverflow and investigated in depth by Peter Cordes.
As reported in Evaluating the Cost of Atomic Operations on Modern Architectures, the speed to access a line in the L3 depends on whether the last accesses were read-only by some other cores, or a write by another core. In the read case, the line may be silently evicted from the L1/L2 of the reading core, which doesn't update the "core valid" bits and hence on a new access the L1/L2 of the core(s) that earlier accessed the lines must be snooped and invalidated even if they don't contain the line. In the case of a modified line, it is written back and hence the core valid bits are updated (cleared) and so no invalidation of other core's private caches needs to occur. Quoting from the above paper, pages 6-7:
In the S/E states executing an atomic on the data held by a different core (on the same CPU) is not influenced by the data location (L1, L2 or L3) ... The data is evicted silently, with neither writebacks nor updating the core valid bit in L3. Thus, all [subsequent] accesses snoop L1/L2, making the latency identical .... M cache lines are written back when evicted updating the core valid bits. Thus, there is no invalidation when reading an M line in L3 that is not present in any local cache. This explains why M lines have lower latency in L3 than E lines.
I haven't confirmed quirks in this section myself, but they have been reported elsewhere.
This seems like it would obviously affect L1 <-> L2 throughput since you would need an additional L1 access to read the line to write back and an additional access to L2 to accept the evicted line (and also probably an L2 -> L3 writeback), but on the linked post the claim was actually that it increased the latency of L2-resident pointer chasing. It isn't clear on what uarch the results were obtained.