Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add matmul with float16 #39

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Aug 5, 2024

This PR implements matrix multiplication with float16.
Mac's memory bandwidth is not enough for maximum floating point performance of the GPU.
To improve performance, we need to reduce memory traffic.

#40 (comment)

The result on M2 pro is as follows:

Matmul version GFLOPS
2D Matmul with float32(version = 8) 1672.18 GFLOPS
Transposed 2D Matrix with float32(version = 9) 1702.70 GFLOPS
2D Matmul with float16(version = 10) 3174.96 GFLOPS
Transposed 2D Matmul with float16(version = 11) 2889.82 GFLOPS

gpu.h Outdated
*/
template<class precision>
Tensor createTensor(Context &ctx, const Shape &shape, precision *data) {
throw std::runtime_error("Unsupported number type for reateTensor(Context &ctx, const Shape &shape, precision *data) ");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

};

inline half operator + (const half& a, const half& b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the expectation is that host code is ahead-of-time and not performance critical, if we're going to include these at all and it's not too complex to implement something reasonably performant we should consider that. We're already citing Mike Acton's implementation for the conversions, there's at least some reference implementations here https://github.com/macton/ps3-archive/blob/master/src/half-precision/Half.v2/int_insn.h

Alternatively we could leave this out of scope unless we have a strong case for this being valuable. It could be a reasonable expectation for users to bring their own computations on the host side - they could use f32 or their own choice of library or implementation and limit gpu.cpp to casting / data movement.

Maybe there's a rationale that's not obvious, but it seems like having operator implementations that are implemented via casts internally could be worse than either option? If f32 conversions are happening within operations, wouldn't it be better to use f32 and cast at the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randn needs these operators to generate values of half type.
But I can cast float32 values to half values after processing randn.

@@ -285,7 +306,8 @@ inline void flip(float *a, size_t R, size_t C, bool horizontal = true) {
* @param tol The tolerance for closeness.
* @return bool True if the arrays are close, false otherwise.
*/
bool isclose(float *a, float *b, size_t n, float tol = 1e-3) {
template<class precision=float>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think isclose can be kept as two concrete overloads (one for half, one for f32) - keeps this code easier to quickly scan and not get tripped up on having to think about which template overload is going to get triggered. If in the future we have more precision types, and this trival overloading starts is too repetitive, we can re-evaluate then.

But optimize for easy-to-scan and reason about is valuable enough to keep simple overloads the default, unless there's something i'm missing.

void randn(float *a, size_t N, std::mt19937 &gen, float mean = 0.0,
float std = 1.0) {
std::normal_distribution<float> dist(mean, std);
template<class precision=float>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think instead of committing to implement variants of every function here for each precision type randn, transpose, etc. let's start with a clean f32 implementation with the expectation that most tasks can be handled by deferring the casting to after any AOT host operations are.

randn, transpose are already written with the expectation that they're only used for testing or non-critical AOT data preparation. afaict implementing support for half here doesn't add much either in performance or what's possible over deferring the half cast until other CPU preparation/testing code is done in f32.

If I'm missing something let me know.

gpu.h Outdated
return createTensor(ctx, shape, kf16 ,data);
}

template<class precision>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comments on whether we can get rid of getPrecision entirely.

gpu.h Outdated
* @endcode
*/
template<class precision>
Tensor createTensor(Context &ctx, const Shape &shape, precision *data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as with other comments, let's keep createTensor at value level unless there's something we really need to add the templated variants.

kernel = createKernel(ctx, matmul, bindings,
/*nWorkgroups*/ nWorkgroups);
}
return kernel;
}

template<class precision=float>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After learning my lesson from type-templated early iterations of gemma.cpp my bias here is to implement the core implementations with precision types at the value level as a default unless there's a rationale otherwise, but only for if there's a strong case.

Templated overloads are okay as a more typesafe convenience API sugar - it's easy to wrap a value-level core implementation in a more strongly templated wrapper. It's a bigger refactoring to go in the other direction and unwind a core implementation into value level.

My suggestion would be, unless there's some value I'm missing.

  • Either have runTest take a precision argument or have a separate runTest (e.g. runtest16) for alternative precision values.
  • avoid having getPrecision to pull out values from types, remove them entirely unless there's something we need them for.

@austinvhuang
Copy link
Contributor

Thank you! This is a massive performance boost. Most of my comments are really the same idea (favor value-level, especially for the core implementations, unless there's a strong rationale for templates) which I don't think should be that big a change but let me know if there's a rationale i'm missing.

Minor question - are the m2 numbers for pallas and the pytorch numbers @ghostplant mentions in #40 (comment) for 2 or 4 byte floating point representations?

Look forward to trying this out on my M1 when it lands, nice work!

@austinvhuang
Copy link
Contributor

austinvhuang commented Aug 6, 2024

BTW I've been meaning to write some contributor notes but TLDR is something like ~

  • Prefer value types over templates where possible, even moreso for core implementations.
  • For comptime polymorphism prefer trivial function overloads over templates
  • Prefer passing by reference into a function over methods. Using mutating functions generalizes more cleanly to operations that have side effects on more than one variable, whereas methods priveledge the the owning class, treating the single variable case as a special case.
  • Methods are usually only used for constructor/destructor/operator priveledged cases.
  • Use createX factory functions for requesting GPU resources - createTensor, createKernel, createContext etc. - Use (as-trivial-as-possible) constructors for supporting specification types Shape, KernelCode, etc.
  • Use pools as a single point of control to manage sets of resources. Consider incorporating a pool in Context if the resource is universal enough to the overall API.
  • Prefer stack allocation for ownership, use unique_ptr for ownership when the heap is needed. Use raw pointers for non-owning views. When possible: stack > make_unique > malloc. Avoid shared_ptr.
  • Use struct over class. Prefer transparency over encapsulation. Don't use abstract classes as interface specifications, the library is the interface.
  • In addition to performance, time-to-grok the codebase, compilation time, # failure modes for builds are things worth optimizing for.
  • Increase the implementation surface area only when there's a clear goal behind doing so.

@junjihashimoto
Copy link
Collaborator Author

@austinvhuang Thank you for your review!

@austinvhuang
Copy link
Contributor

LGTM thanks for the revisions!

@austinvhuang austinvhuang merged commit 228084f into AnswerDotAI:main Aug 8, 2024
1 check passed
@junjihashimoto
Copy link
Collaborator Author

Thank you, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants