Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Reintroduce "[SWP] attempt to remove a workaround for a triton llvm codegen bug (#4873)" #4973

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -450,37 +450,6 @@ assignMemoryLayouts(llvm::SmallVector<std::tuple<Operation *, int, Operation *>>
// If we can't agree on a shared encoding skip pipelinig the load.
if (incompatible)
continue;

// HACK: Triton LLVM codegen has a bug where local_loads from #shared to
// #mma layout can lead to invalid code if the loaded shape is smaller
// than the mma tile (e.g. loading a 128x1 tensor for an MMAv2 dot with
// tile {16,8} is bad because 1 < 8). To work around this, don't
// pipeline such loads.
//
// The codegen bug is caught by an assertion, so if you think you've
// fixed it, feel free to delete this code and see if the assert still
// fails. :)
if (!loadInfo.sharedEncoding) {
if (auto dotEnc = dyn_cast<ttg::NvidiaMmaEncodingAttr>(
dot.getResult().getType().getEncoding())) {
auto loadTy = cast<RankedTensorType>(op->getResultTypes()[0]);
auto mmaInstrShape = dotEnc.getInstrShape();
if (loadTy.getRank() < mmaInstrShape.size())
continue;
bool ok = true;
for (int i = 0; i < mmaInstrShape.size(); i++) {
if (loadTy.getShape()[loadTy.getRank() - mmaInstrShape.size() +
i] < mmaInstrShape[i]) {
ok = false;
break;
}
}
// If this load might trigger the bug, don't do the fallback logic
// below, which might allow the load to be pipelined.
if (!ok)
continue;
}
}
}
} else if (auto loadOp = dyn_cast<tt::LoadOp>(use)) {
// The use of this loadOp is another loadOp. If the use is not in the
Expand Down
3 changes: 2 additions & 1 deletion test/TritonGPU/loop-pipeline.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -1450,7 +1450,8 @@ module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 2 :
// -----

// COMMON-LABEL: @dont_pipeline_128x1
// COMMON-NOT: local_load{{.*}}128x1
// AMD-NOT: local_load{{.*}}128x1
// CHECK: local_load{{.*}}128x1
#blocked = #triton_gpu.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 1], warpsPerCTA = [4, 1], order = [0, 1]}>
#mma = #triton_gpu.nvidia_mma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 8]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
Expand Down
Loading