Cranelift: consider adding a loop-peeling optimization pass #10111

fitzgen · 2025-01-24T23:57:19Z

That is, a pass that unrolls one iteration of the loop body just before the actual loop. This is often beneficial because loop bodies might contain conditional code that only executes on the first iteration.

There could be additional benefit for Cranelift in particular: right now, Cranelift will deduplicate various kinds side-effecting-but-idempotent instructions (such as trapzs) but will not move those instructions outside of their original blocks and hoist them just above the loop even when it would be valid to do so. But if we have already unrolled the first iteration of a loop, then that unrolled first iteration will dominate the rest of the loop and when we dedupe those instructions from inside the loop body to their unrolled counterparts, we are effectively doing exactly that code motion we wanted to do before but could not.

Random tidbits:

Could we do this at the same time as the aegraphs pass? I think so? But it would be in the same traversal, not actually in ISLE, similar to eg the alias analysis optimization.
We will need some heuristics for when to peel loops or not. Don't think we want the code bloat of peeling every single loop in the whole program. Maybe just innermost loops? Maybe just loops of up to a certain size? Lots of experimenting to be done in this regard.

The text was updated successfully, but these errors were encountered:

fitzgen · 2025-01-25T00:01:38Z

If we don't care about cleaning up conditional code that only happens in the first iteration of the loop, and can statically be shown as such after the peeling, and we only care about being able to LICM side-effectful-but-idempotent instructions, then we could also just investigate extending aegraph elaboration to support LICM'ing the idempotent parts of the side-effectful skeleton when that wouldn't move one side-effectful instruction across another one. But that kinda feels a little too specific and one-off for my tastes, in comparison to the peeling which effectively gives us that and also some more stuff all at once (ie is more general/powerful and probably not very different in terms of engineering effort?).

cfallin · 2025-01-25T00:20:42Z

I'd be interested in working through some example IR you've see to motivate this -- the main reason being that my preference is actually the opposite, toward better LICM rather than general peeling. The main reasons are that:

Peeling doubles the size of the loop (in the general case), as you mention in your second bullet point; that seems to me like it will be the dominant negative effect and a large one. Consider the two universes we could choose with our heuristics:
- We peel only inner loops or small loops, as suggested. Then any benefit of hoisting traps happens only for some code in the program; and there are surprising performance cliffs that occur. (Adding a few instructions causes the whole loop to slow significantly; adding another nested loop somewhere turns the main loop into "not an inner loop anymore".) This seems generally contrary to the "predictable performance" design goal of Wasm and by extension of Cranelift.
- We peel all loops. Then we have significant code expansion, and compile time slowdown: somewhere between 1x and 2x depending on how much of a program's body lives in loops. (Or more, depending on how we handle nested loops.) Note that will also turn into runtime slowdown if there is any icache presure.
Peeling is a pretty complex change to the aegraph visitor pass; it breaks some fundamental invariants and adds a new dimension of complexity (output CFG is not input CFG, with all the associated implications for SSA, the domtree, and otherwise).
The main benefit on its own is LICM-for-free (but then at the cost of a doubling of code size); additional benefits, like specializations for the first iteration, require other work like branch folding which we don't do at all currently. We could revisit peeling when we do branch folding as both are control flow/code motion optimizations.

In contrast, "better LICM" feels much more tractable to me: it doesn't require bulk CFG manipulation (only instruction placement/scheduling which we're already doing) and gets the same concrete benefits on e.g. repeated trap checks without paying the large associated cost (duplicating the loop body).

fitzgen added cranelift Issues related to the Cranelift code generator cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. labels Jan 24, 2025

fitzgen mentioned this issue Jan 25, 2025

Pulley Performance Tracking #10102

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: consider adding a loop-peeling optimization pass #10111

Cranelift: consider adding a loop-peeling optimization pass #10111

fitzgen commented Jan 24, 2025

fitzgen commented Jan 25, 2025

cfallin commented Jan 25, 2025

Cranelift: consider adding a loop-peeling optimization pass #10111

Cranelift: consider adding a loop-peeling optimization pass #10111

Comments

fitzgen commented Jan 24, 2025

fitzgen commented Jan 25, 2025

cfallin commented Jan 25, 2025