Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix order #4914

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Fix order #4914

wants to merge 2 commits into from

Conversation

rawnhenry
Copy link

Makes the default order from contiguity row-major. It used to be tranposed by default, which could lead to some unnecessary convert layouts in the final TTGIR if the contiguity in all dimensions was equal.

Adds a new getOrderFromContiguity function to handle the default case differently, since the order reversal would be surprising from a call to argsort.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@Jokeren
Copy link
Contributor

Jokeren commented Oct 15, 2024

I don't understand what problems this PR is trying to address. Is there any specific case you find something wrong?

@rawnhenry
Copy link
Author

rawnhenry commented Oct 15, 2024

It solves a case where your tensor has contiguity of all 1s (or more generally, the contiguity of all the dimension in the tensor are equal). In that case, using argSort to infer the order will return the transpose of the original layout.

In my use case, I had a load (which cannot be vectorized) which was consumed by a store (which can be vectorized). The transpose introduced by the coalesce pass lead to an unnecessary convertLayout in the final code. The root cause of this is the fact that the argSort returns the transpose of the original order if a layout has no contiguity.

This caused some performance issues which go away with this PR.

@Jokeren
Copy link
Contributor

Jokeren commented Oct 15, 2024

Seems to me that it's better to take a look at the coalesce pass to check why the convert layout is generated. Also, not all convert layout ops come with significant overhead; sometimes convert layout is just a permutation of registers

@rawnhenry
Copy link
Author

rawnhenry commented Oct 15, 2024

The convert layout is generated because the argSort here returns {0, 1} when the contiguity is {1, 1}. I would expect it to return {1, 0} in order to maintain the default ordering of the tensor. I added a new function since this index reversal seems unexpected for a vanilla argSort function.

I understand that not all layout conversions are expensive. However, this one is indeed expensive and unnecessary.

@Jokeren
Copy link
Contributor

Jokeren commented Oct 15, 2024

The convert layout is generated because the argSort here returns {0, 1} when the contiguity is {1, 1}. I would expect it to return {1, 0} in order to maintain the default ordering of the tensor. I added a new function since this index reversal seems unexpected for a vanilla argSort function.

Either [1, 0] or [0, 1] is correct. I don't know about the performance impact yet after changing it to [1, 0] for other test cases. So I'll block the PR for now.
What I proposed is that there might be a way to eliminate the conversion by checking the contiguity for both cases.

@ThomasRaoux
Copy link
Collaborator

please add a lit test in https://github.com/triton-lang/triton/blob/main/test/TritonGPU/coalesce.mlir

I agree with Keren that this doesn't address the problem of avoiding incompatible layouts. That being said I find it a bit weird that we pick transposed layout when the compiler cannot tell anything about contiguity. Right now triton to triton GPU defaults to [1, 0] I think we should keep it like that if we have no information about contiguity which is what this PR does. @Jokeren, what you think?

@Jokeren
Copy link
Contributor

Jokeren commented Oct 15, 2024

I think it's fine to merge the PR if there's no perf regression issue. Otherwise you may have to bisect the commits later anyway

@ThomasRaoux
Copy link
Collaborator

I think it's fine to merge the PR if there's no perf regression issue. Otherwise you may have to bisect the commits later anyway

yes definitely. We can test for that or let the nightly catch those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants