-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hip mfma tests #246
base: develop
Are you sure you want to change the base?
hip mfma tests #246
Conversation
…lemented for gfx908 and gfx90a
… not availble. kernel is currently empty
…tructions are available
This is looking much better. |
fix spacing
constexpr Index_type Ne = m_Ne; | ||
constexpr Index_type NeNe = m_Ne * m_Ne; | ||
|
||
dim3 gridDim (1, 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mfma instructions operate on a per-wavefront basis, as opposed to per thread. We're using 4 groups of 16 threads for each outer product, so we only need single block per grid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that once we saturate flops for a CU we'll be leaving flops on the table.
src/common/HipDataUtils.hpp
Outdated
hipDeviceProp_t devProp; | ||
hipGetDeviceProperties(&devProp, 0); | ||
std::string gcnArchName(devProp.gcnArchName); | ||
std::string hipArch = gcnArchName.substr(0, 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 0, 6 the right thing for all architectures, aren't there 7 digit gpu names like gfx10##?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Currently we're only using it to test gfx908 and gfx90a features but if we want to use this function more generally (e.g. for testing xnack-ness) I suppose we should really grab the entire string and then sub select it based on what we're going to query. The full name example would be like:
gfx908:sramecc-:xnack-
gfx1010:sramecc-:xnack-
#define MAT_FUSED_MUL_ADD_BODY \ | ||
Real_type dot = 0; \ | ||
for (Index_type k = 0; k < Ne; ++k) { \ | ||
dot += A[row*Ne + k + ii*(Ne*Ne)] * B[k*Ne + col + ii*(Ne*Ne)]; \ | ||
} \ | ||
D[row*Ne + col + ii*(Ne*Ne)] = dot; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be doing D = A*B
or D = A*B+C
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mfma instructions are for computing D = A x B + C but we're assuming for this case C is zeros and ignored. It's defined this way so we can expand future cases with a non-trivial C matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the kernel is confusing then if we're not actually doing the ADD part.
src/basic/MAT_FUSED_MUL_ADD-Seq.cpp
Outdated
|
||
startTimer(); | ||
for (Index_type irep = 0; irep < run_reps; ++irep) { | ||
for(Index_type ii = 0; ii != (N/(Ne*Ne)); ++ii){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is N/(Ne*Ne) the number of elements, should we make it a named quantity?
src/apps/DEL_DOT_VEC_2D-Hip.cpp
Outdated
@@ -122,7 +122,7 @@ void DEL_DOT_VEC_2D::runHipVariantImpl(VariantID vid) | |||
|
|||
const size_t grid_size = RAJA_DIVIDE_CEILING_INT(iend, block_size); | |||
|
|||
hipLaunchKernelGGL((lambda_hip_forall<block_size, decltype(deldotvec2d_lambda)>), | |||
hipLaunchKernelGGL((lambda_hip_forall_1D<block_size, decltype(deldotvec2d_lambda)>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this update be in another PR to keep this focused on the Kernel? I try to keep cuda and hip in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert it and/or move it to a new branch, it got pulled in from one of your review comments and I foresee adding a lambda_hip_forall using 2D thread indexing and wanted to get in front of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not rename it here.
const Index_type N_Elem = N/(Ne*Ne); | ||
for(Index_type ii = 0; ii != N_Elem; ++ii){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to parallelize over elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is to mirror an FE mass matrix solve we need to assume that each element is independent and shares no common data (they could but this is the worst case).
This PR adds basic functionality test of leveraging the matrix cores on AMD gfx908 and gfx90a hardware for dense matrix products.