Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasm-SPIRV #9491

Open
SkillfulElectro opened this issue Oct 21, 2024 · 8 comments
Open

wasm-SPIRV #9491

SkillfulElectro opened this issue Oct 21, 2024 · 8 comments

Comments

@SkillfulElectro
Copy link

Feature

SPIRV compilation target

Benefit

Adding SPIRV target for wasm makes it the best way to write code once and use it with CPU and GPU so it's going to be powerful option to use

Implementation

I think we should convert or code to naga-ir , and then use wgpu for running or spirv . I think abstraction over memory allocation copy and etc cpu<->gpu transfers can improve development time

Alternatives

Directly compiling to spirv

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 21, 2024

Wasm and SPIR-V have a fundamentally different memory model from each other. Wasm models memory as a single array of bytes, while SPIR-V models it as a bunch of typed objects. Some of these may be arrays into which you can index, but it fundamentally doesn't support arbitrary pointers like wasm does. https://github.com/EmbarkStudios/spirt can lift some uses of untyped memory (rust, wasm, ...) into typed memory (spir-v), but can't lift all of them. Also to make any effective use of a GPU you also need to support work-group local memory and more, which wasm doesn't support.

@SkillfulElectro
Copy link
Author

Wasm and SPIR-V have a fundamentally different memory model from each other. Wasm models memory as a single array of bytes, while SPIR-V models it as a bunch of typed objects. Some of these may be arrays into which you can index, but it fundamentally doesn't support arbitrary pointers like wasm does. https://github.com/EmbarkStudios/spirt can lift some uses of untyped memory (rust, wasm, ...) into typed memory (spir-v), but can't lift all of them. Also to make any effective use of a GPU you also need to support work-group local memory and more, which wasm doesn't support.

All you say is right but , it's still possible .
For last one For example we can add option to user set them , if not setted by default 1,1,1 . Or for first the goal is to job be done not how it done

@cfallin
Copy link
Member

cfallin commented Oct 21, 2024

@SkillfulElectro thanks for the issue!

I was involved in some discussions around this in 2020 or so -- and the conclusion then was essentially the same as @bjorn3's points now, that the target is quite different and this would not be an easy adaptation. The use-case in question ended up finding a different way to program GPUs portably.

That discussion was purely about a Cranelift port, but the Wasm runtime as well is an even bigger question mark: what would it mean for Wasmtime to run on a GPU where there is no operating system, (sometimes) no virtual memory, etc.? Or does the Wasm VM get split between GPU and CPU, with (expensive) calls between them?

And then how does one actually take advantage of the parallelism? Do we need a new "vectorized Wasm call" API in Wasmtime? (Keep in mind that a single thread of a GPU has lower performance than a single thread on a CPU; GPUs only make sense when leveraging the SIMT model. And SIMT != SIMD, i.e., the programming model is not the same as what Wasm has exposed for data parallelism.)

What do we do about branch divergence? Do we have estimates or modeling that show this would be reasonably low overhead for typical Wasms?

For all these reasons I'm pretty skeptical. That doesn't mean we should shut down discussion now, at all. What it does mean is that probably there should be a more detailed writeup: what is the use-case, how would all of these high-level design questions be resolved, etc. This should probably take the form of an RFC discussion eventually, but before that, it would help if you could write a bit more about motivation and these other questions here.

@SkillfulElectro
Copy link
Author

@SkillfulElectro thanks for the issue!

I was involved in some discussions around this in 2020 or so -- and the conclusion then was essentially the same as @bjorn3's points now, that the target is quite different and this would not be an easy adaptation. The use-case in question ended up finding a different way to program GPUs portably.

That discussion was purely about a Cranelift port, but the Wasm runtime as well is an even bigger question mark: what would it mean for Wasmtime to run on a GPU where there is no operating system, (sometimes) no virtual memory, etc.? Or does the Wasm VM get split between GPU and CPU, with (expensive) calls between them?

And then how does one actually take advantage of the parallelism? Do we need a new "vectorized Wasm call" API in Wasmtime? (Keep in mind that a single thread of a GPU has lower performance than a single thread on a CPU; GPUs only make sense when leveraging the SIMT model. And SIMT != SIMD, i.e., the programming model is not the same as what Wasm has exposed for data parallelism.)

What do we do about branch divergence? Do we have estimates or modeling that show this would be reasonably low overhead for typical Wasms?

For all these reasons I'm pretty skeptical. That doesn't mean we should shut down discussion now, at all. What it does mean is that probably there should be a more detailed writeup: what is the use-case, how would all of these high-level design questions be resolved, etc. This should probably take the form of an RFC discussion eventually, but before that, it would help if you could write a bit more about motivation and these other questions here.

Well all of his point is correct but look all the languages compile to wasm so if we can run wasm it means we can run all of our ordinary code without touching on GPU . Also what about compiling to wgsl if managing things this way is hard? We just need to add a functionality to user specify number of blocks and workgroups size and compilation needs to be done once so it's cheap price to make simple codes run on GPU and we can reuse that module again without recompiling . For this we can use wgpu and naga

@cfallin
Copy link
Member

cfallin commented Oct 22, 2024

so if we can run wasm it means we can run all of our ordinary code without touching on GPU

Yes, I don't think anyone doubts that having this target would be very useful. The difficult design questions are really the heart of the problem though -- the question is how to map Wasm to the GPU programming abstraction in a way that makes sense and yields speedup. I'd invite you to give your thoughts on any of the questions I wrote out above!

(I'll actually say a little more directly: the way open-source works is that interested parties come in with time and energy and drive interesting new directions or additions to projects. Leaving a comment asking for a very general high-level goal, and then arguing why you want it without driving the engineering, isn't likely to lead anywhere. What I'm trying to steer you toward is driving the design exploration here yourself, in a way that could break the problem down into actionable pieces.)

@SkillfulElectro
Copy link
Author

@cfallin oki so i think first of all why would we need to use GPU? parallel computing . so some of wasm file types cant be compiled to GPU kernel functions which are using WASI or others which are not related to computing , second we create an struct which stores number of blocks in each dim and number of workgroups (threads) in each block , third the wasm function must get index of type int as its first para and an array of supported data types by wgsl as its second para , with this simple rules most of codes which are compiled to wasm can be ran on gpu . now we compile the wasm bytecode to wgsl and pass it to wgpu ( i say wgpu because i am familiar with it ) the index becomes global invocation id and its calculation and data storage arrays or textures becomes input of our array and we pass them to gpu buffer using wgpu buffer also wasm functions which are compiled to run on GPU must not return anything . their returned value must be written on the input arrays

we have multiple GPU devices for example in the server or smth . so we need to add a way to iterate over them by index for choosing prefered device . smth like https://github.com/SkillfulElectro/EMCompute/blob/main/src/gpu_device.rs .

@cfallin
Copy link
Member

cfallin commented Oct 22, 2024

@SkillfulElectro thanks for your reply. I think there needs to be a deeper exploration of the engineering tradeoffs here. I'll go through your points and my questions above to try to help guide you a bit.

first of all why would we need to use GPU? parallel computing

Sure, again, no one is doubting how useful this would be if it were built!

second we create an struct which stores number of blocks in each dim and number of workgroups (threads) in each block , third the wasm function must get index of type int as its first para and an array of supported data types by wgsl as its second para , with this simple rules most of codes which are compiled to wasm can be ran on gpu

This is a very high-level and vague description of a more detailed system design that I think you have in your head. A few followup questions that could help expand it:

  • What kind of computation is this intended for? Is there one Wasm instance overall, or are there many invocations of a single Wasm instance, and we are taking blocks of them to run as GPU warps? (I suspect the latter, but let's say it explicitly.)
  • You say "the Wasm function must get index ... and an array of supported data types ...": here you seem to be confusing the Wasm abstraction layer with a higher-level ABI of some sort. Wasm in general supports functions of any allowable signature. Are you describing a way to use this parallelized Wasm instance invocation for a certain problem type?
  • " also wasm functions which are compiled to run on GPU must not return anything" -- as above: a Wasm engine has to be able to support any Wasm module; we can't ship something that only works for a small subset of Wasm, as that wouldn't be Wasm anymore.
  • "now we compile the wasm bytecode to wgsl and pass it to wgpu" -- this single statement is encapsulating the hardest part, with many many open questions. All of the questions above apply, and we need to argue that we can support all the needed abstractions (Wasm heaps, tables, hostcalls, etc). Not to mention the questions around compiling to the target: register allocation, do we make use of scratchpad memory or not, etc.

At a higher level, I'll repeat the questions I wrote above; we need crisp answers to all of these I think:

  • Does the Wasmtime runtime itself run on the GPU or the CPU? If the CPU, how do we handle hostcalls? If the GPU, are we convinced that every abstraction Wasmtime needs is available on the GPU, with no operating system underneath it?

  • Do we have a "vectorized Wasm call" API in Wasmtime? Or some other way to build the parallel invocation (lazy batching or something)?

  • What do we do about branch divergence?

  • Given the answers to the above, do we have some early evidence, even napkin math of some sort, that this will yield feasible performance?

@SkillfulElectro
Copy link
Author

@cfallin well you are right wasmtime is just runtime for wasm .

What kind of computation is this intended for? Is there one Wasm instance overall, or are there many invocations of a single Wasm instance, and we are taking blocks of them to run as GPU warps? (I suspect the latter, but let's say it explicitly.)

  • look what i said is the way to run from these overheads with a few rules . because as long as you know nothing comes free in engineering , so by compiling to wgsl and using wgpu we run away from these overheads so none we need , we will just use wasm code to produce equivalent wgsl code and then we only manage wgpu instances so we only need one instance and one device and queue for each function in wgpu style code

You say "the Wasm function must get index ... and an array of supported data types ...": here you seem to be confusing the Wasm abstraction layer with a higher-level ABI of some sort. Wasm in general supports functions of any allowable signature. Are you describing a way to use this parallelized Wasm instance invocation for a certain problem type?

  • well like you said , if we do not use SIMD , using GPU threads instead of CPU ones wont be that effective because CPU threads are so much faster so not all the wasm function are good to use on GPU side unless they are capable of . so this is a cost which helps us convert wasm to wgsl or spirv for gpu use

" also wasm functions which are compiled to run on GPU must not return anything" -- as above: a Wasm engine has to be able to support any Wasm module; we can't ship something that only works for a small subset of Wasm, as that wouldn't be Wasm anymore.

  • well we still could run it on CPU still , i consider what i said as simple rule to make it possible use SIMD on wasm functions to work with them on gpu side

"now we compile the wasm bytecode to wgsl and pass it to wgpu" -- this single statement is encapsulating the hardest part, with many many open questions. All of the questions above apply, and we need to argue that we can support all the needed abstractions (Wasm heaps, tables, hostcalls, etc). Not to mention the questions around compiling to the target: register allocation, do we make use of scratchpad memory or not, etc.

  • most of these goes away because the only thing we need is answer to our calc , which wgpu takes care of it . btw not all of the functions are meant to be run on GPU

Does the Wasmtime runtime itself run on the GPU or the CPU? If the CPU, how do we handle hostcalls? If the GPU, are we convinced that every abstraction Wasmtime needs is available on the GPU, with no operating system underneath it?

  • with what i said , wasmtime will be bridge between CPU and wgpu which will handle most it for us

Given the answers to the above, do we have some early evidence, even napkin math of some sort, that this will yield feasible performance?

  • its based on our function and calc type

Do we have a "vectorized Wasm call" API in Wasmtime? Or some other way to build the parallel invocation (lazy batching or something)?

  • i dont get this question because of my lack of knowledge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants