Implement kernel cache #68

junjihashimoto · 2024-10-16T08:06:40Z

This PR implements shader caching to reduce the cost of createKernel.
In case of matmul(#67), the cost is about 50 times that of a GPU operation.
To do the caching, it changes theKernel data-type to a shared pointer.

experimental/kernels/Makefile

…th CPU's one

…rEnd

austinvhuang · 2024-10-19T12:02:12Z

experimental/kernels/ops.cpp

-  Tensor output = createTensor(ctx, Shape{b * t * c}, kf32);
-  printf("Created tensors\n");
+  // Generate the key of the cache by arguments.
+  std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);


This might be okay as a start, but we probably don't want to do this with all the implicit string allocations here from the to_string and concatenation.

As a start, might use snprintf for the string construction, but later can consider some alternative structure to a string->Kernel map for the pool cache.

austinvhuang · 2024-10-19T12:04:40Z

experimental/kernels/ops.cpp

-  Tensor mean_t = createTensor(ctx, Shape{b * t}, kf32);
-  Tensor rstd_t = createTensor(ctx, Shape{b * t}, kf32);
+  // Generate the key of the cache by arguments.
+  std::string key = "layernorm_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);


same as above re: strings (in general trying not to use too much STL unless it's too onerous to find a lighter alternative)

austinvhuang · 2024-10-19T12:06:20Z

experimental/kernels/ops.cpp

+  std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);
+  Kernel op;
+  if (ctx.kernelPool.data.find(key) == ctx.kernelPool.data.end()) {
+    Tensor input = createTensor(ctx, Shape{b * t}, ki32);


This might be out of scope for this PR but eventually there shouldn't be any createTensor operations in ops - they should all be passed in.

In the ideal state (don't have to tackle this all in this PR):

an op should take in any resources needed for a dispatch and submit the dispatch

not do any GPU allocation or perform any CPU/GPU data movement.

the inputs to the op function should probably be the GPU resources themselves instead of pointers to CPU resources.

austinvhuang · 2024-10-19T12:20:32Z

gpu.hpp

 };

+typedef std::shared_ptr<RawKernel> Kernel;


Could this be a unique_ptr with operations requiring a non-owning views being a raw pointer?

shared_ptr often results in being unclear on what resource is responsible for ownership/lifetime. In this case, I think it should be clear that the ownership and lifetime is handled by the KernelPool.

Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)

Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)
If you cache a RawKernel, it is owned by the KernelPool, so createTensor must return a reference to the Kernel, not the Kernel itself.

When we cache it, it is owned by the KernelPool, and when we do not cache it, it is not owned by the KernelPool, so separate calls are required to return a reference and not to return a reference, depending on whether we cache it or not. For now, I used shared_ptr to avoid having to prepare two.

Alternatively, we might want to have a separate function that the user can put into the cache after calling createTensor, so we don't have to have two functions, one that returns a reference and one that returns the value.

austinvhuang · 2024-10-19T12:27:50Z

Thanks a lot! Targeting dev so we can merge and follow-up on broader refactoring in the dev branch. Have a look at comments and we can go ahead and merge there.

I feel like a lot of things will be much clearer what the function signature for an op is, it should eventually look pretty different than the current state (GPU resources as inputs, no allocations or data movement), but I probably need to implement a few examples of this to get the ball rolling (or find gaps/faws in my conceptualization).

junjihashimoto · 2024-10-21T02:35:43Z

Thank you for your review!

junjihashimoto added 2 commits October 13, 2024 10:32

Add kShaderMatmul2DTiling in kernels.h

7addf83

Reduce matmul-kernel creation time

da1f32d

junjihashimoto marked this pull request as ready for review October 16, 2024 11:07

austinvhuang self-assigned this Oct 18, 2024

austinvhuang reviewed Oct 18, 2024

View reviewed changes

experimental/kernels/Makefile Outdated Show resolved Hide resolved

junjihashimoto force-pushed the feature/cache branch 3 times, most recently from 7484178 to 7b896c0 Compare October 18, 2024 08:33

junjihashimoto added 10 commits October 18, 2024 17:37

Change Kernel to shared_ptr<RawKernel> to support cached kernels

dd2a25f

Add caches for ops.cpp

9fff8cd

Add caches for unittests.cpp

b9fb38b

Fix bugs

efb87ee

Remove global variables of kernels

c474fba

Fix the matmul of version 1 in unittests

438be22

Add the duration-time of matmul_forward_dummy to compare GPU's one wi…

dd145ac

…th CPU's one

Add wgpuBufferRelease for CopyData

590f257

Add wgpuCommandBufferRelease after calling wgpuQueueSubmit

6c52b98

Add wgpuCommandEncoderRelease after calling wgpuCommandEncoderFinish

a6140e0

junjihashimoto force-pushed the feature/cache branch from 7b896c0 to a6140e0 Compare October 18, 2024 08:37

Add wgpuComputePassEncoderRelease after calling wgpuComputePassEncode…

30ed026

…rEnd

austinvhuang reviewed Oct 19, 2024

View reviewed changes

austinvhuang changed the base branch from main to dev October 19, 2024 12:26

austinvhuang merged commit f4e1683 into AnswerDotAI:dev Oct 20, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement kernel cache #68

Implement kernel cache #68

junjihashimoto commented Oct 16, 2024 •

edited

Loading

austinvhuang Oct 19, 2024

austinvhuang Oct 19, 2024

austinvhuang Oct 19, 2024 •

edited

Loading

austinvhuang Oct 19, 2024

austinvhuang Oct 19, 2024

junjihashimoto Oct 19, 2024 •

edited

Loading

junjihashimoto Oct 19, 2024

austinvhuang commented Oct 19, 2024 •

edited

Loading

junjihashimoto commented Oct 21, 2024

Implement kernel cache #68

Implement kernel cache #68

Conversation

junjihashimoto commented Oct 16, 2024 • edited Loading

austinvhuang Oct 19, 2024

Choose a reason for hiding this comment

austinvhuang Oct 19, 2024

Choose a reason for hiding this comment

austinvhuang Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

austinvhuang Oct 19, 2024

Choose a reason for hiding this comment

austinvhuang Oct 19, 2024

Choose a reason for hiding this comment

junjihashimoto Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

junjihashimoto Oct 19, 2024

Choose a reason for hiding this comment

austinvhuang commented Oct 19, 2024 • edited Loading

junjihashimoto commented Oct 21, 2024

junjihashimoto commented Oct 16, 2024 •

edited

Loading

austinvhuang Oct 19, 2024 •

edited

Loading

junjihashimoto Oct 19, 2024 •

edited

Loading

austinvhuang commented Oct 19, 2024 •

edited

Loading