Add matmul with float16 #39

junjihashimoto · 2024-08-05T08:05:53Z

This PR implements matrix multiplication with float16.
Mac's memory bandwidth is not enough for maximum floating point performance of the GPU.
To improve performance, we need to reduce memory traffic.
#40 (comment)

The result on M2 pro is as follows:

Matmul version	GFLOPS
2D Matmul with float32(version = 8)	1672.18 GFLOPS
Transposed 2D Matrix with float32(version = 9)	1702.70 GFLOPS
2D Matmul with float16(version = 10)	3174.96 GFLOPS
Transposed 2D Matmul with float16(version = 11)	2889.82 GFLOPS

junjihashimoto · 2024-08-05T20:52:02Z

gpu.h

+ */
+template<class precision>
+Tensor createTensor(Context &ctx, const Shape &shape, precision *data) {
+  throw std::runtime_error("Unsupported number type for reateTensor(Context &ctx, const Shape &shape, precision *data) ");


austinvhuang · 2024-08-06T01:45:56Z

numeric_types/half.h

 };

+inline half operator + (const half& a, const half& b) {


Although the expectation is that host code is ahead-of-time and not performance critical, if we're going to include these at all and it's not too complex to implement something reasonably performant we should consider that. We're already citing Mike Acton's implementation for the conversions, there's at least some reference implementations here https://github.com/macton/ps3-archive/blob/master/src/half-precision/Half.v2/int_insn.h

Alternatively we could leave this out of scope unless we have a strong case for this being valuable. It could be a reasonable expectation for users to bring their own computations on the host side - they could use f32 or their own choice of library or implementation and limit gpu.cpp to casting / data movement.

Maybe there's a rationale that's not obvious, but it seems like having operator implementations that are implemented via casts internally could be worse than either option? If f32 conversions are happening within operations, wouldn't it be better to use f32 and cast at the end?

randn needs these operators to generate values of half type.
But I can cast float32 values to half values after processing randn.

austinvhuang · 2024-08-06T02:01:02Z

utils/array_utils.h

@@ -285,7 +306,8 @@ inline void flip(float *a, size_t R, size_t C, bool horizontal = true) {
 * @param tol The tolerance for closeness.
 * @return bool True if the arrays are close, false otherwise.
 */
-bool isclose(float *a, float *b, size_t n, float tol = 1e-3) {
+template<class precision=float>


I think isclose can be kept as two concrete overloads (one for half, one for f32) - keeps this code easier to quickly scan and not get tripped up on having to think about which template overload is going to get triggered. If in the future we have more precision types, and this trival overloading starts is too repetitive, we can re-evaluate then.

But optimize for easy-to-scan and reason about is valuable enough to keep simple overloads the default, unless there's something i'm missing.

austinvhuang · 2024-08-06T02:07:34Z

utils/array_utils.h

-void randn(float *a, size_t N, std::mt19937 &gen, float mean = 0.0,
-           float std = 1.0) {
-  std::normal_distribution<float> dist(mean, std);
+template<class precision=float>


i think instead of committing to implement variants of every function here for each precision type randn, transpose, etc. let's start with a clean f32 implementation with the expectation that most tasks can be handled by deferring the casting to after any AOT host operations are.

randn, transpose are already written with the expectation that they're only used for testing or non-critical AOT data preparation. afaict implementing support for half here doesn't add much either in performance or what's possible over deferring the half cast until other CPU preparation/testing code is done in f32.

If I'm missing something let me know.

austinvhuang · 2024-08-06T02:08:43Z

gpu.h

+  return createTensor(ctx, shape, kf16 ,data);
+}
+
+template<class precision>


see above comments on whether we can get rid of getPrecision entirely.

austinvhuang · 2024-08-06T02:10:35Z

gpu.h

+ * @endcode
+ */
+template<class precision>
+Tensor createTensor(Context &ctx, const Shape &shape, precision *data) {


as with other comments, let's keep createTensor at value level unless there's something we really need to add the templated variants.

austinvhuang · 2024-08-06T02:10:43Z

examples/matmul/run.cpp

    kernel = createKernel(ctx, matmul, bindings,
                          /*nWorkgroups*/ nWorkgroups);
  }
  return kernel;
 }

+template<class precision=float>


After learning my lesson from type-templated early iterations of gemma.cpp my bias here is to implement the core implementations with precision types at the value level as a default unless there's a rationale otherwise, but only for if there's a strong case.

Templated overloads are okay as a more typesafe convenience API sugar - it's easy to wrap a value-level core implementation in a more strongly templated wrapper. It's a bigger refactoring to go in the other direction and unwind a core implementation into value level.

My suggestion would be, unless there's some value I'm missing.

Either have runTest take a precision argument or have a separate runTest (e.g. runtest16) for alternative precision values.

avoid having getPrecision to pull out values from types, remove them entirely unless there's something we need them for.

austinvhuang · 2024-08-06T02:17:23Z

Thank you! This is a massive performance boost. Most of my comments are really the same idea (favor value-level, especially for the core implementations, unless there's a strong rationale for templates) which I don't think should be that big a change but let me know if there's a rationale i'm missing.

Minor question - are the m2 numbers for pallas and the pytorch numbers @ghostplant mentions in #40 (comment) for 2 or 4 byte floating point representations?

Look forward to trying this out on my M1 when it lands, nice work!

austinvhuang · 2024-08-06T02:30:28Z

BTW I've been meaning to write some contributor notes but TLDR is something like ~

Prefer value types over templates where possible, even moreso for core implementations.
For comptime polymorphism prefer trivial function overloads over templates
Prefer passing by reference into a function over methods. Using mutating functions generalizes more cleanly to operations that have side effects on more than one variable, whereas methods priveledge the the owning class, treating the single variable case as a special case.
Methods are usually only used for constructor/destructor/operator priveledged cases.
Use createX factory functions for requesting GPU resources - createTensor, createKernel, createContext etc. - Use (as-trivial-as-possible) constructors for supporting specification types Shape, KernelCode, etc.
Use pools as a single point of control to manage sets of resources. Consider incorporating a pool in Context if the resource is universal enough to the overall API.
Prefer stack allocation for ownership, use unique_ptr for ownership when the heap is needed. Use raw pointers for non-owning views. When possible: stack > make_unique > malloc. Avoid shared_ptr.
Use struct over class. Prefer transparency over encapsulation. Don't use abstract classes as interface specifications, the library is the interface.
In addition to performance, time-to-grok the codebase, compilation time, # failure modes for builds are things worth optimizing for.
Increase the implementation surface area only when there's a clear goal behind doing so.

junjihashimoto · 2024-08-06T09:52:26Z

@austinvhuang Thank you for your review!

austinvhuang · 2024-08-08T20:46:09Z

LGTM thanks for the revisions!

junjihashimoto · 2024-08-08T21:12:14Z

Thank you, too.

junjihashimoto commented Aug 5, 2024

View reviewed changes

austinvhuang reviewed Aug 6, 2024

View reviewed changes

junjihashimoto force-pushed the feature/matmul-f16 branch from 89e890d to b5ed517 Compare August 8, 2024 20:34

Add matmul with float16

23dd96e

junjihashimoto force-pushed the feature/matmul-f16 branch from b5ed517 to 23dd96e Compare August 8, 2024 20:46

austinvhuang merged commit 228084f into AnswerDotAI:main Aug 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add matmul with float16 #39

Add matmul with float16 #39

junjihashimoto commented Aug 5, 2024 •

edited

Loading

junjihashimoto Aug 5, 2024

austinvhuang Aug 6, 2024

junjihashimoto Aug 6, 2024

austinvhuang Aug 6, 2024

austinvhuang Aug 6, 2024

austinvhuang Aug 6, 2024

austinvhuang Aug 6, 2024

austinvhuang Aug 6, 2024

austinvhuang commented Aug 6, 2024

austinvhuang commented Aug 6, 2024 •

edited

Loading

junjihashimoto commented Aug 6, 2024

austinvhuang commented Aug 8, 2024

junjihashimoto commented Aug 8, 2024

Add matmul with float16 #39

Add matmul with float16 #39

Conversation

junjihashimoto commented Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinvhuang commented Aug 6, 2024

austinvhuang commented Aug 6, 2024 • edited Loading

junjihashimoto commented Aug 6, 2024

austinvhuang commented Aug 8, 2024

junjihashimoto commented Aug 8, 2024

junjihashimoto commented Aug 5, 2024 •

edited

Loading

austinvhuang commented Aug 6, 2024 •

edited

Loading