Fix order #4914

rawnhenry · 2024-10-15T16:48:33Z

Makes the default order from contiguity row-major. It used to be tranposed by default, which could lead to some unnecessary convert layouts in the final TTGIR if the contiguity in all dimensions was equal.

Adds a new getOrderFromContiguity function to handle the default case differently, since the order reversal would be surprising from a call to argsort.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

Jokeren · 2024-10-15T17:00:17Z

I don't understand what problems this PR is trying to address. Is there any specific case you find something wrong?

rawnhenry · 2024-10-15T17:05:48Z

It solves a case where your tensor has contiguity of all 1s (or more generally, the contiguity of all the dimension in the tensor are equal). In that case, using argSort to infer the order will return the transpose of the original layout.

In my use case, I had a load (which cannot be vectorized) which was consumed by a store (which can be vectorized). The transpose introduced by the coalesce pass lead to an unnecessary convertLayout in the final code. The root cause of this is the fact that the argSort returns the transpose of the original order if a layout has no contiguity.

This caused some performance issues which go away with this PR.

Jokeren · 2024-10-15T17:22:59Z

Seems to me that it's better to take a look at the coalesce pass to check why the convert layout is generated. Also, not all convert layout ops come with significant overhead; sometimes convert layout is just a permutation of registers

rawnhenry · 2024-10-15T17:36:25Z

The convert layout is generated because the argSort here returns {0, 1} when the contiguity is {1, 1}. I would expect it to return {1, 0} in order to maintain the default ordering of the tensor. I added a new function since this index reversal seems unexpected for a vanilla argSort function.

I understand that not all layout conversions are expensive. However, this one is indeed expensive and unnecessary.

Jokeren · 2024-10-15T17:50:29Z

The convert layout is generated because the argSort here returns {0, 1} when the contiguity is {1, 1}. I would expect it to return {1, 0} in order to maintain the default ordering of the tensor. I added a new function since this index reversal seems unexpected for a vanilla argSort function.

Either [1, 0] or [0, 1] is correct. I don't know about the performance impact yet after changing it to [1, 0] for other test cases. So I'll block the PR for now.
What I proposed is that there might be a way to eliminate the conversion by checking the contiguity for both cases.

ThomasRaoux · 2024-10-15T18:48:18Z

please add a lit test in https://github.com/triton-lang/triton/blob/main/test/TritonGPU/coalesce.mlir

I agree with Keren that this doesn't address the problem of avoiding incompatible layouts. That being said I find it a bit weird that we pick transposed layout when the compiler cannot tell anything about contiguity. Right now triton to triton GPU defaults to [1, 0] I think we should keep it like that if we have no information about contiguity which is what this PR does. @Jokeren, what you think?

Jokeren · 2024-10-15T19:19:21Z

I think it's fine to merge the PR if there's no perf regression issue. Otherwise you may have to bisect the commits later anyway

ThomasRaoux · 2024-10-15T19:20:31Z

I think it's fine to merge the PR if there's no perf regression issue. Otherwise you may have to bisect the commits later anyway

yes definitely. We can test for that or let the nightly catch those

rawnhenry added 2 commits October 15, 2024 09:11

Fix order

0da8ede

precommit

10dd182

rawnhenry requested a review from ptillet as a code owner October 15, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix order #4914

Fix order #4914

rawnhenry commented Oct 15, 2024

Jokeren commented Oct 15, 2024

rawnhenry commented Oct 15, 2024 •

edited

Loading

Jokeren commented Oct 15, 2024

rawnhenry commented Oct 15, 2024 •

edited

Loading

Jokeren commented Oct 15, 2024 •

edited

Loading

ThomasRaoux commented Oct 15, 2024

Jokeren commented Oct 15, 2024

ThomasRaoux commented Oct 15, 2024

Fix order #4914

Are you sure you want to change the base?

Fix order #4914

Conversation

rawnhenry commented Oct 15, 2024

Jokeren commented Oct 15, 2024

rawnhenry commented Oct 15, 2024 • edited Loading

Jokeren commented Oct 15, 2024

rawnhenry commented Oct 15, 2024 • edited Loading

Jokeren commented Oct 15, 2024 • edited Loading

ThomasRaoux commented Oct 15, 2024

Jokeren commented Oct 15, 2024

ThomasRaoux commented Oct 15, 2024

rawnhenry commented Oct 15, 2024 •

edited

Loading

rawnhenry commented Oct 15, 2024 •

edited

Loading

Jokeren commented Oct 15, 2024 •

edited

Loading