-
Notifications
You must be signed in to change notification settings - Fork 63
Intel Performance Quirks
Weird performance quirks discovered (not necessarily by me) on modern Intel chips. We don't cover things already mentioned in Agner's guides or the Intel optimization manual.
L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12
It seems to happen only if the two loads issue in the same cycle: if the loads issue in different cycles the penalty appears to be reduced to 1 or 2 cycles. Goes back to Sandy Bridge (probably), but not Nehalem (which can't issue two loads per cycle).
Benchmarks in uarch-bench can be run with --test-name=studies/memory/l2-doubleload/*
.
adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines
Normally adc
is 2 uops to p0156
and 2 cycles of latency, but in the special case the immediate zero is used, it only takes 1 uop and 1 cycle of latency on Haswell machines. This is pretty a important optimization for adc
since adc reg, 0
is a common pattern used to accumulate the result of comparisons and other branchless techniques. Presumably the same optimization applies to sbb reg, 0
. In Broadwell and beyond, adc
is usually a single uop with 1 cycle latency, regardless of the immediate or register source.
Discussion at the bottom of this SO answer and the comments. Test in uarch bench can be run with --test-name=misc/adc*
.
Short form adc and sbb using the accumulator (rax, eax, ax, al) are two uops on Broadwell and Skylake
In Broadwell and Skylake, most uses of adc
and sbb
only take one uop, versus two on prior generations. However, the "short form" specially encoded versions, which use the rax
register (and all the sub-registers like eax
, ax
and al
), still take two uops.
This short form is only used with an immediate operand and of the same size as the destination register (or 32 bits for rax
), so isn't wouldn't be used (for example) for an addition to eax
if the immediate fits in a byte, because the 2-byte opcode form would be 2 bytes shorter (one byte longer opcode, but three bytes shorter immediate).
I first saw it mentioned by Andreas Abel in this comment thread.
Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this
In particular, if the load address is ready to go as soon as the store address is, or at most 1 or 2 cycles later (you have to look at the dependency chains leading into each address to determine this), you'll get a variable store forwarding delay of 4 or 5 cycles (seems to be a 50% chance of each, giving an average of 4.5 cycles). If the load address is available exactly 3 cycles later, you'll get a faster 3 cycle store forwarding (i.e., the load won't be further delayed at all). This can lead to weird effects like adding extra work speeding up a loop, or call/ret sequence, because it delayed the load enough to get the "ideal" store-forwarding latency.
Initial discussion here.
Stores to a cache line that is an L1-miss but L2-hit are unexpectedly slow if interleaved with stores to other lines
When stores that miss in L1 but hit in L2 are interleaved with stores that hit in L1 (for example), the throughput is much lower than one might expect (considering L2 latencies, fill buffer counts, and store buffer prefetching), and also weirdly bi-modal. In particular, each L2 store (paired with an L1 store) might take 9 or 18 cycles. This effect can be eliminated by prefetching the lines into L1 before the store (or using a demand load to achieve the same effect), at which point you'll get the expected performance (similar to what you'd get for independent loads). A corollary of this is that it only happens with "blind stores": stores that write to a location without first reading it: if it is read first, the load determines the caching behavior and the problem goes away.
Discussion at RWT and StackOverflow.
A load using an addressing mode like [reg + offset]
may have a latency of 4 cycles, which is the often-quoted best-case latency for recent Intel L1 caches - but this only applies when the value of reg
was the result of an earlier load, e.g, in a pure pointer chasing loop. If the value of reg
was instead calculated from an ALU instruction, the fast path is not applicable, so the load will take 5 cycles (in addition to whatever latency the ALU operation adds).
For example, the following pointer-chase runs at 4 cycles per iteration in a loop:
mov rax, [rax + 128]
While the following equivalent code, which adds only a single-cycle latency ALU operation to the dependency chain, runs in 6 cycles per iteration rather than the 5 you might expect:
mov rax, [rax]
add rax, 128
The add
instruction feeds the subsequent load which disables the 4-cycle L1-hit latency path.
Peter Cordes mentioned this behavior somewhere on Stack Overflow but I can't find the place at the moment.
The 4-cycle best-case load latency fails and the load must be replayed when the base register points to a different page
Consider again a pointer-chasing loop like the following:
top:
mov rax, [rax + 128]
test rax, rax
jnz top
Since this meets the documented conditions for the 4-cycle best case latency (simple addressing mode, non-negaitve offset < 2048) we expect to see it take 4 cycles per iteration. Indeed, it usually does: but in the case that the base pointer [rax]
and the full address including offset [rax + 128]
fall into different 4k pages the load actually takes 9 or 10 cycles on recent Intel and dispatches twice: i.e., it is replayed after the different page condition is detected. Update: actually, it is likely the load following the different-page load that dispatches twice: it first tries to dispatch 4 cycles after the prior load, but finds its operand not available (since the prior load is taking 9-10 cycles) and has to try again later.
Apparently to achieve the 4-cycle latency the TLB lookup happens based on the base register even before the full address is calculated (which presumably happens in parallel), and this condition is checked and the load has to be replayed if the offset puts the load in a different page. On Haswell loads can "mispredict" continually in this manner, but on Skylake a load being mispredicted will force the next load to use the normal full 5-cycle path which won't mispredict, which leads to alternating 5 and 10 cycle loads when the full address is always on a different page, for an average latency of 7.5).
Originally reported by user harold on StackOverflow and investigated in depth by Peter Cordes.
As reported in Evaluating the Cost of Atomic Operations on Modern Architectures, the speed to access a line in the L3 depends on whether the last accesses were read-only by some other cores, or a write by another core. In the read case, the line may be silently evicted from the L1/L2 of the reading core, which doesn't update the "core valid" bits and hence on a new access the L1/L2 of the core(s) that earlier accessed the lines must be snooped and invalidated even if they don't contain the line. In the case of a modified line, it is written back and hence the core valid bits are updated (cleared) and so no invalidation of other core's private caches needs to occur. Quoting from the above paper, pages 6-7:
In the S/E states executing an atomic on the data held by a different core (on the same CPU) is not influenced by the data location (L1, L2 or L3) ... The data is evicted silently, with neither writebacks nor updating the core valid bit in L3. Thus, all [subsequent] accesses snoop L1/L2, making the latency identical .... M cache lines are written back when evicted updating the core valid bits. Thus, there is no invalidation when reading an M line in L3 that is not present in any local cache. This explains why M lines have lower latency in L3 than E lines.
An address that would otherwise be complex may be treated as simple if the index register is zeroed via idiom
General purpose register loads have a best-case latency of either 4 cycles or 5 cycles depending mostly on whether the are simple or complex. In general, a simple address has the form [reg + offset]
where offset < 2048
, and complex address is anything with a too-large offset, or which involves an index register, like [reg1 + reg2 * 4]
. However, in the special case that the index register is zero and has been zeroed, via a zeroing idiom, the address is treated as simple and is eligible for the 4-cycle latency.
So a pointer-chasing loop like this:
xor esi, esi
.loop
mov rdi, [rdi + rsi*4]
test rdi, rdi ; exit on null ptr
jnz .loop
Can run at 4 cycles per iteration, but the identical loop runs at 5 cycles per iteration if the initial zeroing of rsi
is changed from xor esi, esi
to mov esi, 0
, since former is a zeroing idiom while the latter is not.
This is mostly a curiosity since if you know the index register rsi
is always zero (as in the above example), you'd simply omit it from the addressing entirely. However, perhaps you have a scenario where rsi
is often but not always zero, in that case a check for zero followed by an explicit xor-zeroing could speed it up by a cycle when the index is zero:
test rsi, rsi
jnz notzero
xor esi, esi ; semantically redundant, but speeds things up!
notzero:
; loop goes here
Perhaps more interesting that this fairly obscure optimization possibility is the implication for the micro-architecture. It implies that the decision on whether an address generation is simple or complex is decided at least as late as the rename stage, since that is where this zeroed register information is available (dynamically). It means that simple-vs-complex is not decided earlier, near the decode stage, based on the static form of the address - as one might have expected (as I did).
Reported and discussed on RWT.
When the most recent modification of register has been to zero it with vzeroall
, it seems to stay in a special state that causes slowdowns when used as a source (this stops as soon as the register is written to). In particular, a dependent series of vpaddq
instruction using a vzeroall
-zeroed ymm
register seems to take 1.67 cycles, rather than the expected 1 cycle, on SKL and SKLX.
More details and initial report at RWT.
Normally, if an instruction requires more than one fused-domain uop, the component uops can be renamed (the rename bottleneck of 4 or 5 per cycle is the one of the narrowest bottlenecks on most Intel CPUs) without regard to instruction boundaries. That is, there is no issue if uops from a single instruction fall into different rename groups. As an example, alternating two and three uop instructions could be renamed as AABB BAAB BBAA BBBA ...
, where AA
and BBB
are two and three uops sequences from each instruction and the spacing shows the 4-uop rename groups: uops from a single instruction span freely across groups.
However, in the case that AA
are two uops from an unlaminated instruction (i.e., one which takes 1 slot in the uop cache but two rename slots), they apparently must appear in the same rename (allocation) group. With the same example above, but with A
behind an 1:2 unlaminated instruction and B
unchanged as a 3-uop-at-decode instruction, the rename pattern would look like: AABB BAAB BBAA BBBx AABB ...
where x
is a 1-slot rename bubble, not present in the first example, caused by the requirement for unlaminated uops to be in the same rename group.
In some cases with high density of unlaminated ops, this could cause a significant reduction in rename throughput, and this can sometimes be avoided with careful reordering.
First known report by Andreas Abel in this StackOverflow question.
The usual encoding for pop r12
instruction is slower to decode (goes to the complex decoder, so limited to 1 per cycle) than popping to other registers (except pop rsp
which is also slow). The cause seems to be that the short form of pop
encodes the register into 3 bits of the opcode byte: those three bits can encode only 8 registers, of course, so the last bit to choose between the classic 32-bit registers (rax
et al) and the x86-64 registers (r8
through r15
) lives in the REX prefix byte. In this encoding, pop rsp
and pop r12
share the same 3 bits and so the same opcode. As pop rsp
needs to be handled specially, it gets handled by the complex decoder and pop r12
gets sucked along for the ride (although it does ultimately decode to a single uop, unlike pop rsp
).
Reported by Andreas Abel with a detailed explanation of the effect given by Peter Cordes.
Single uop instructions which have other forms that decode to 2 fused-domain uops go to the complex decoder
This is the general case of the pop r12
quirk mentioned above. For example, cdq
results in only 1 uop, but it can only be decoded by the complex decoder. It seems like these instructions are mostly (all?) ones where some other variant with the same opcode needs >= 2 fused-domain uops. Presumably what happens is that the predecoder doesn't look beyond the opcode to distinguish the cases and sends them all the to complex decoder, which then ends up only emitting 1 uop in the simple cases.
Reported by Andreas Abel on StackOverflow with the theory above relating to opcodes shared with 2+ uop variants fleshed out by myself, Andreas and Peter Cordes in the comments.
I haven't confirmed quirks in this section myself, but they have been reported elsewhere.
This seems like it would obviously affect L1 <-> L2 throughput since you would need an additional L1 access to read the line to write back and an additional access to L2 to accept the evicted line (and also probably an L2 -> L3 writeback), but on the linked post the claim was actually that it increased the latency of L2-resident pointer chasing. It isn't clear on what uarch the results were obtained.
I could not reproduce this effect in my tests.
When an FP instruction receives an input from a SIMD integer instruction, we expect a bypass delay, but Peter Cordes reports that this delay can affect all operands indefinitely even when the other operands are FP and when bypass definitely not longer occurs:
xsave/rstor can fix the issue where writing a register with a SIMD-integer instruction like paddd creates extra latency indefinitely for reading it with an FP instruction, affecting latency from both inputs. e.g. paddd xmm0, xmm0 then in a loop addps xmm1, xmm0 has 5c latency instead of the usual 4, until the next save/restore. It's bypass latency but still happens even if you don't touch the register until after the paddd has definitely retired (by padding with >ROB uops) before the loop.
Most operations can dispatch in the cycle following allocation, but according to IACA loads can only dispatch in the 4th cycle after allocation. I have not tested to see if real hardware follows this rule or if it an error in IACA's model. Update: I didn't find any evidence for this effect.
Reported by Rivet Amber and discussed on Twitter. My tests seem to indicate that this effect does not occur.