-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore / discuss for potential ideas or improvements #52
Comments
Hello,
|
I have also created a post on Vulkan reddit, people there may be interested too. |
This is very useful insights @DTolm, thank you very much for sharing your thoughts! And thanks for extending the discussion into the Vulkan subreddit, I didn't know that sub existed but looks very useful. In regards to your points, here are my thoughts on your points (numbered) 1. Users looking for GPU optimization framework
I totally see what you mean, and I agree, I don't think it would make sense to make this framework target people that don't want to introduce optimizations that Vulkan provides. The initial motivation for this framework is primarily due to seeing quite a few people writing a lot of similar code to abstract specialized non-NVIDIA GPU hardware (such as mobile) for advanced data processing such as ML. 2. Catering for Vulkan developers
I totally agree with you, that is exactly why I wanted to drive forward with the BYOV (bring your own vulkan) principle, where it should augment the capabilities of Vulkan developers through powerful abstractions, without limiting lower level access to the Vulkan APIs. I would be very keen on exploring the best way to ensure Kompute doesn't get on the way, and provides a baseline for people to be able to work from efficiently (increasing developer workflow efficiency). 3. Shader code
I could not agree more, this is one of the main motivaiton for the concepts of
I see what you mean, I have been researching whether there are any tools that can be used to write shader via C++, however there is potentially an opportunity to provide these type of abstractions at a higher level; that is, once users are able to build a large number of 4. Library primitives
I totally agree, and I am very curious to dive into the VkFTT codebase as that does sound quite interesting. This is something that is currently being explored in Kompute, by exposing the ability to "pre-record" 5. Zero GPU on CPU dependency
This sounds really interesting - I'm not sure I fully understand tho, what do you mean by asynchronous data saves, is this specifically into host visible memory? Or what do you mean by asynchronous data saves from the GPU. If it's referered to "recreating command buffers", I defnitely know what you mean, I would be keen to hear your thoughts on this - currently this is achieved through operations like ---Thank you very much for taking the time for sharing your thoughts @DTolm - these are very interesting points, I would be keen to hear any further thoughts! |
Hi Alejandro - congratulations on the EthicalML's work with the Kompute. It really does simplify the use of Vulkan. One suggestion that I think could help you guys take it to the next level is to try and implement it as a low-level backend to one of the main deep learning libraries (TensorFlow and PyTorch), similar to what Apple did with TensorFlow for MacOS recently. This would enable an incredibly larger share of ML-interested folks to harness the power of their GPUs while benefitting from the existing highly-developed ecosystems around these libraries. Another alternative route is to do the same, but with functional programming packages such as PyMC3 and others, which could really benefit from GPU acceleration. Anyway, just some thoughts, along with my continued encouragement. |
@dkgaraujo I hugely appreciate your suggestions, and I could not agree more! The initial motivations (https://github.com/EthicalML/vulkan-kompute#motivations) that led to the creation of this project were exactly those - it would be an absolutely fantastic milestone to explore integrating Kompute as the backend of one of the existing main deep learning libraries. If this is something that you have knowledge on, I would be keen to get some pointers of what would be the best library to start with. At this point Pytorch does seem to be growing in popularity so it could be a good place to start. Do you have experience in the C++ backend of pytorch by any chance? If not I can open an issue for now and start documenting initial investigations there. |
Many thanks for the positive feedback, @axsaucedo. Pytorch does indeed seem like a good place to start, although of course TensorFlow also would come with an ecosystem of functionalities. Now, while unfortunately my C++ skills are almost nil for practical purposes, looking at the source code of PyTorch, TensorFlow, and RStudio's implementation of PyTorch in R (mlverse/torch), my subjective impression is that perhaps the latter could be a good place to start given that the source code appears to be more streamlined (again, my subjective impression and probably correlated with the fact that R Torch is not a wrapper on PyTorch, but a new implementation altogether). Another possibility, if the team wants to test the waters before embarking on a more ambitious project, could be to implement Vulkan Kompute as the backend of a more streamlined neural network library; an example of which that recently cross my path was iperov/litenn. It basically uses numpy together with an OpenCL backend, so that could be more amenable to a first try at using Kompute as a neural network backend and possibly help scout out any design or bugs in the process, thus laying the ground for using it as backend to the major libraries. |
Another option (something I am planning to do) is to write a backend for Jax. |
@alexander-g that sounds quite exciting, I would be very keen to get your thoughts of what may be required in order to achieve the integration as backend for Jax. Mainly as previously when performing integrations like the Android JNI and Godot Module integrations required further features to be in place. One of the things that are still outstanding is to extend further feature completeness on Vulkan features, such as enabling for shader types beyond buffers (image2d, image3d, etc), or data types beyond floats (int, int32, unint, etc), or even further support for native operations (currently I have only implemented the op_mult (e.g. op_sum, op_log, etc). Please do let me know if you run into any blockers, and I woudl certainly be interested in your findings as well. Separate to this, I will be doing a talk on Vulkan & Kompute during the upcoming FOSDEM 2021 (https://fosdem.org/2021/) in the HPC / Data Science Track, and would be very keen to showcase some of these findings then if there is any progress - there's still quite a bit of time until then, so would be great to explore further until, and of course also after then. @dkgaraujo thank you for the pointers to the other implementations, I agree that other smaller libraries could be an interesting route as well. I will also have a look at this, and potentially take the initial usecase with Jax that Alexander is looking at as a first starter to explore what are the features and requirements in the roadmap to enable for these type of usecases. Speed/efficiency will also be a key one, so optimizations that ensure best performance will also be a key component, especially with the python bindings. |
Right now, one of Vulkan Kompute's really great advantages is that it's light weight and easy to install, but as it gains more features and capabilities, the size might eventually grow in size, in which some features might end up not being used by everyone, but have to be included. So, how about creating a section, possibly another repository(ies), that people can choose extensions from if they needed? That way, the core of Vulkan Kompute will remain light weigh&simple and also, it will have lots of features available. |
@aliPMPAINT good point, I think we came across this issue when @alexander-g started exploring adding the GLSL shader compilation. HAving said that, this is less of an actual extension and more like utility functions. This would however make a lot of sense for things like Operations. I would be very keen if people are intereseted on contributing operations (such as a FFT, or parallel sum aggregate (#27). At this point I would still be happy to add these operations to the main repo, but I do think at some point it would make sense to have them in a different |
One thing that I would be interested to ask everyone, is for feedback / thoughts on the talks for the FOSDEM conference I had mentioned earlier. I have finished recording the talks for both talks around Kompute, and would be very keen to get thoughts / ideas around either these videos or potentially other material (as well as ways to share it with the community). One thing to mention is that in the videos, the intro / motivations, vulkan overview is almost the same, so I'll add a time where you'll be able to skip to for the KOmpute content. Talk 1Track FOSDEM PythonTalk Title: "Beyond CUDA: GPU Accelerated Python on Cross-Vendor Graphics Cards with Vulkan & Kompute"
Talk 2Track: FOSDEM HPC TrackTalk Title: GPU computing using Vulkan & Kompute for Cross-vendor Graphic Cards (AMD, Qualcomm, NVIDIA & friends)Video (Skip to 13:33): https://www.youtube.com/watch?v=Xz4fiQNmGSA By the way, the FOSDEM conference is free to attend, so I certainly recommend checking out other tracks / talks if there is interest! |
Just watched the video, and I actually think that it's thoughtful and comprehensive. |
@aliPMPAINT that's great positive feedback, thank you for sharing the key reasons why you found interest in the Kompute framework. These are also the principles that drive our motivation to continue furthering the features and functionality of this framework - namely 1) integrate the Kompute framework into a popular ML / Scientific toolkit to enable for cross-vendor (and mobile) ML, and 2) contribute to the ongoing discussion around the Vulkan SDK and the topic of open source cross-vendor General Purpose GPU computing. Looking forward to continue working with this great community to expand and further these and the rest of the core principles of Kompute! |
An update from my side: I've went public with my vkJAX project, a JAX interpreter for Vulkan. |
This is absolutely awesome @alexander-g! I will have a proper look today and share further thoughts, but this is amazing, especially the ResNet50 example, that looks EPIC! In regards to the point on speed, that makes absolute sense, I think there are several optimizations that can be explored on the library as well as on Vulkan Kompute that can ensure we can achieve as optimal performance as possible. Really keen to dive further into this - I have also identified some interesting areas of optimization for how the Tensors are used, which may be useful to explore further. |
Following up on this thread, I want to request further thoughts on the road towards 1.0 - currently we have been able to extend the capabilities to broader Vulkan capabilities making this discussion more tangible. I have added an issue to capture the current discussions as well as a project where the issues will be tracked - it would be great to hear people's thoughts / ideas: |
Hello, |
Hi, Right now the only synchronization options (that I can see) are running For example, suppose I have algorithm A using tensors a, algorithm B using tensors b, and algorithm C using tensors a, b, c. A and B are independent, but C is dependent on the result of A and B. We only need the result from C, not intermediate results from A and B. This is how I wish the code would look in Python: timeline_a = kp.TimelineSemaphore()
timeline_b = kp.TimelineSemaphore()
timeline_c = kp.TimelineSemaphore()
sequence
.record(kp.OpTensorSyncDevice(params_a))
.eval_async(timeline_a(wait=0, signal=1)) # copy params_a to device asap
.record(kp.OpAlgoDispatch(algo_a))
.eval_async(timeline_a(wait=1, signal=2)) # run algo_a after params_a is copied to device
.record(kp.OpTensorSyncDevice(params_b))
.eval_async(timeline_b(wait=0, signal=1)) # copy params_b to device asap
.record(kp.OpAlgoDispatch(algo_b))
.eval_async(timeline_b(wait=1, signal=2)) # run algo_b after params_b is copied to device
.record(kp.OpTensorSyncDevice(c))
.eval_async(timeline_c(wait=0, signal=1)) # copy params_c to device asap
.record(kp.OpAlgoDispatch(algo_c))
.eval_async(
timeline_a(wait=2, signal=4),
timeline_b(wait=2, signal=4),
timeline_c(wait=1, signal=2)) #run algo_c after algo_a and algo_b finish, and params_c is copied
.record(kp.OpTensorSyncLocal(params_c))
.eval_async(timeline_c(wait=2, signal=3)) # copy params_c to host after algo_c is done
.eval_await(timeline_c(wait=3, signal=4)) # wait for params_c to be copied to host
# now we can use the result from C on host
print( param.data() for param in params_c) There is a (partial) workaround by creating multiple threads and |
I think that's a good idea, currently we're exploring creating a library of "kernels" as operations that can be reused, but the idea would be that higher level SDKs can be developed on top of Kompute, or use it as a backend to provide more advanced use-case specific interfaces |
@ChenKuo I think that's a really good idea actually. I am thinking there may be a way to provide a higher level abstraction, but that would be a good principle to set it on. To be more specific, we already support Memory barriers which enable for control in the GPU itself, and as you pointed out, we currently support fences to allow for host synchronisation namely through the eval_async / eval_await. In this case I think adding the semaphore functionality would make complete sense, I will open an issue to continue the discussion there. |
@ChenKuo I have just opened #238 to continue the discussion, it would be great if you can provide further thoughts, and also provide some insights of whether the current OpMemoryBarrier could actually help you address the current work without the need for the semaphore timelines. You can see an example of this here: https://github.com/KomputeProject/kompute/blob/master/test/TestMultipleAlgoExecutions.cpp#L99-L115 std::shared_ptr<kp::OpMemoryBarrier> shaderBarrier{
new kp::OpMemoryBarrier({ tensorA },
vk::AccessFlagBits::eTransferRead,
vk::AccessFlagBits::eShaderWrite,
vk::PipelineStageFlagBits::eComputeShader,
vk::PipelineStageFlagBits::eComputeShader)
};
mgr.sequence()
->record<kp::OpTensorSyncDevice>({ tensorA })
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record(shaderBarrier)
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record(shaderBarrier)
->record<kp::OpAlgoDispatch>(mgr.algorithm({ tensorA }, spirv))
->record<kp::OpTensorSyncLocal>({ tensorA })
->eval(); |
@axsaucedo Thanks for your response. I see how I can use OpMemoryBarrier to implement dependency. This way it can also submit everything in one batch, so it should be more efficient than the coarse-grain synchronization by using semaphores. The way I think semaphores would be useful is we can synchronized across different queues. So we can use the result of one queue in the other queue. For example, we can run algo_a in queue1 and algo_b in queue2, then use the result from algo_a and algo_b to run algo_c in queue3. |
In my opinion, trying to create a library of premade "kernels" will not be useful for developing "higher level SDKs" or "use-case specific interfaces", or at least not beyond the prototyping phase, because of the following reasons:
Based on (*), I think the direction you should go for is make writing a custom shader the primary method for creating an
I also think this is beyond the scope of this project. The responsibility of writing shaders and making sure they work should rest on the users. However, we can add utilities to help the user write reusable and composable shader code, while still giving the user full control of their shader code. Some idea I have in mind are shader factory method, shader templates, shader functions import, basic validation, and integration helper to
If you want I can make a more detailed proposal later on, if we think this way is worth exploring. |
I am still learning c++, not even vulkan, glsl or gpgpu yet so I guess I look at a full SDK. To explain myself : I was thinking of doing something like CuBLAS or CUSolver or their ROCm equivalent but I am not sure how they work or how to use them in gpgpu. I think I can see why your idea is best. |
Thank you for sharing your thoughts @ChenKuo , I do agree in large part with your sentiment, and I do feel like there should be a core focus on making Kompute focus on serving as a flexible backend for higher level frameworks. I do feel like there would still be value for providing two things:
|
@axsaucedo I do not understand the use case of OperationAlgoFactory very well. I think some code examples (tentative is fine) would help us understand how it is going to be used. Does it generate shaders dynamically for different types? Or do we pre-generate all possible variations of each shader, and load them on demand? From your code in the method |
Open issue to openly discuss potential ideas or improvements, whether on documentation, interfaces, examples, bug fixes, etc.
The text was updated successfully, but these errors were encountered: