Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement kernel cache #68

Merged
merged 13 commits into from
Oct 20, 2024
Merged

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Oct 16, 2024

This PR implements shader caching to reduce the cost of createKernel.
In case of matmul(#67), the cost is about 50 times that of a GPU operation.
To do the caching, it changes theKernel data-type to a shared pointer.

@junjihashimoto junjihashimoto marked this pull request as ready for review October 16, 2024 11:07
@austinvhuang austinvhuang self-assigned this Oct 18, 2024
@junjihashimoto junjihashimoto force-pushed the feature/cache branch 3 times, most recently from 7484178 to 7b896c0 Compare October 18, 2024 08:33
Tensor output = createTensor(ctx, Shape{b * t * c}, kf32);
printf("Created tensors\n");
// Generate the key of the cache by arguments.
std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be okay as a start, but we probably don't want to do this with all the implicit string allocations here from the to_string and concatenation.

As a start, might use snprintf for the string construction, but later can consider some alternative structure to a string->Kernel map for the pool cache.

Tensor mean_t = createTensor(ctx, Shape{b * t}, kf32);
Tensor rstd_t = createTensor(ctx, Shape{b * t}, kf32);
// Generate the key of the cache by arguments.
std::string key = "layernorm_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above re: strings (in general trying not to use too much STL unless it's too onerous to find a lighter alternative)

std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C);
Kernel op;
if (ctx.kernelPool.data.find(key) == ctx.kernelPool.data.end()) {
Tensor input = createTensor(ctx, Shape{b * t}, ki32);
Copy link
Contributor

@austinvhuang austinvhuang Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be out of scope for this PR but eventually there shouldn't be any createTensor operations in ops - they should all be passed in.

In the ideal state (don't have to tackle this all in this PR):

  • an op should take in any resources needed for a dispatch and submit the dispatch
  • not do any GPU allocation or perform any CPU/GPU data movement.
  • the inputs to the op function should probably be the GPU resources themselves instead of pointers to CPU resources.

};

typedef std::shared_ptr<RawKernel> Kernel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a unique_ptr with operations requiring a non-owning views being a raw pointer?

shared_ptr often results in being unclear on what resource is responsible for ownership/lifetime. In this case, I think it should be clear that the ownership and lifetime is handled by the KernelPool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)

Copy link
Collaborator Author

@junjihashimoto junjihashimoto Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)
If you cache a RawKernel, it is owned by the KernelPool, so createTensor must return a reference to the Kernel, not the Kernel itself.

When we cache it, it is owned by the KernelPool, and when we do not cache it, it is not owned by the KernelPool, so separate calls are required to return a reference and not to return a reference, depending on whether we cache it or not. For now, I used shared_ptr to avoid having to prepare two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we might want to have a separate function that the user can put into the cache after calling createTensor, so we don't have to have two functions, one that returns a reference and one that returns the value.

@austinvhuang austinvhuang changed the base branch from main to dev October 19, 2024 12:26
@austinvhuang
Copy link
Contributor

austinvhuang commented Oct 19, 2024

Thanks a lot! Targeting dev so we can merge and follow-up on broader refactoring in the dev branch. Have a look at comments and we can go ahead and merge there.

I feel like a lot of things will be much clearer what the function signature for an op is, it should eventually look pretty different than the current state (GPU resources as inputs, no allocations or data movement), but I probably need to implement a few examples of this to get the ball rolling (or find gaps/faws in my conceptualization).

@austinvhuang austinvhuang merged commit f4e1683 into AnswerDotAI:dev Oct 20, 2024
1 check passed
@junjihashimoto
Copy link
Collaborator Author

Thank you for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants