Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgl] Removal of GPGPU example #478

Open
dleeftink opened this issue Jul 9, 2024 · 7 comments
Open

[webgl] Removal of GPGPU example #478

dleeftink opened this issue Jul 9, 2024 · 7 comments

Comments

@dleeftink
Copy link

Just wondering why this code example (or the GPGPU class in general) has been removed from the current build:

https://github.com/thi-ng/umbrella/blob/161b4f8afaef0df742a8e2c7776993b828662589/examples/webgl-gpgpu-basics/src/index.ts

This seems a very useful abstraction (especially the jobrunner), but I can understand if it has since been superseded by a newer version. If that is these case, could you point me to a recent example that demonstrates GPGPU capabilities?

@postspectacular
Copy link
Member

Fret not, @dleeftink! The feature is still there (and actually expanded in scope). Here're a few examples showing the new approach (fundamentally the same, just reading the results from FBOs/textures back is currently a out-of-scope and must be handled separately (see below):

Demo Source
https://demo.thi.ng/umbrella/webgl-game-of-life/ https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-game-of-life/src/index.ts
https://demo.thi.ng/umbrella/webgl-texture-paint/ https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-texture-paint/src/index.ts
https://demo.thi.ng/umbrella/webgl-multipass/ https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-multipass/src/index.ts
https://demo.thi.ng/umbrella/webgl-float-fbo/ https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-float-fbo/src/index.ts

For an example of how to read back a texture, hava a look here:

// read texture from fbo
const copyCurrentFrame = () => {
// get pixel buffer of 2D canvas
const idata = ctx.getImageData(0, 0, width, height);
// bind the WebGL frame buffer of the 1st shader pass
app.fbos[0].bind();
// read that shader pass' output texture (aka the "curr" texture)
// (readPixels always reads from the currently bound frame buffer)
readPixels(
gl,
0,
0,
width,
height,
TextureFormat.RGBA,
TextureType.UNSIGNED_BYTE,
// target pixel array is the image data of the 2d canvas
idata.data
);
// unbind the frame buffer
app.fbos[0].unbind();
// WebGL textures are stored "upside down"
// depending on intended use we might need to manually flip the texture
flipY(idata.data, width, height);
// copy to 2D canvas
ctx.putImageData(idata, 0, 0);
};

Hth! :)

@dleeftink
Copy link
Author

Thank you for pointing me to these, and the webgl layer in general!

In regards to IO via readPixels, I wonder whether using Dataview + transform feedback would prove more performant for data IO? The use case I am envisioning is parallel reduction on the GPU, and out of many Webgl packages I've tested, not many (if any) provide parallel reduction out of the box. Would be cool to try and implement this using the @thi.ng/webgl package.

For instance, here's an older example using regl:

https://github.com/regl-project/regl/blob/gh-pages/example/reduction.js

@postspectacular
Copy link
Member

hi @dleeftink - this has tickled my interest and I've just uploaded a new example showing a version of this kind of reduction (using thi.ng/webgl & thi.ng/shader-ast):

Demo:
https://demo.thi.ng/umbrella/gpgpu-reduce/

Source code:
https://github.com/thi-ng/umbrella/blob/develop/examples/gpgpu-reduce/src/index.ts

Readme w/ benchmark results:
https://github.com/thi-ng/umbrella/tree/develop/examples/gpgpu-reduce

Ps. Can you please explain your "DataView w/ transform feedback" comment/approach? Not sure how this fits into this picture here... 😉

@dleeftink
Copy link
Author

dleeftink commented Jul 11, 2024

Thank you for taking the time, will have a bit of a play around to see how regl/@thi.ng-webgl compare in terms of setting up a parallel reduction pipeline. The performance profile of the example you provided seems promising enough!

Re; the dataview/transform feedback approach: see this cell on Observable which is part of a notebook where I compare various GPGPU libraries. Beware that you may have to hit the 'profile' button after all libraries have run once to get a proper comparison (I will have to clean up the examples some more, isolate the gl contexts and add proper cooldowns and disposal between runs).

In any case, the approach in the linked cell uses the forked WebGP library to write a sizeable array to an array of N by N textures, which are processed using a transform feedback mechanism. Instead of readPixels, you can construct a TypedArray directly from the result buffer, after which the max is found on the CPU and pushed to an output array. Here, reduction is thus applied in CPU- rather than GPU land, but afaik the transform feedback mechanism allows you to run multiple passes on the bound buffer before passing the data back to the CPU to reduce to a final result.

Both the 'subgpu' and 'supgpu' cells show that quite a fast turnaround can be achieved from the CPU > GPU > CPU using this approach. My thinking is to do as much work on the GPU (e.g. map/reduce) before handing over a relatively small array to the CPU to apply final processing (e.g., deriving an array of centroids from a large N x N matrix).

The relevant lines in the WebGP source:

https://github.com/glennirwin/webgp/blob/d6139188401fdce7379d83c705b77a105b0dfbe8/src/webgp.js#L393-L407

@postspectacular
Copy link
Member

Thanks for this, I only just now realized that you were talking about the built-in WebGL2 transform feedback feature, whereas I previously assumed you meant some generic/custom mechanism 🤦 Alas, I have not yet had a need to use this feature and so also don't have any direct experience with it...

So I'm still trying to wrap my head around how this would be working here & I know I'll have to do more reading about this (and reading some of the code you linked to)... From the little understanding I have about createTransformFeedback(), my hunch is that you're proposing to perform all GPU processing in the vertex shader stage, rather than in the fragment shader as I've been doing so far? And then you'd use getBufferData() on a still bound vertex array instead of readPixels() on an FBO? Hmmm, if that is what you're talking about, then the approach would involve a major restructuring (or really a full rewrite) ... 🤔

FWIW for texture sizes upto 64x64 (aka up to 4096 result values) the process of binding and reading a FBO takes on my M1 0.04-0.09ms (avg of 1000 iterations)... I'd say that's absolutely acceptable in relation to main computation time...

@dleeftink
Copy link
Author

dleeftink commented Jul 11, 2024

Yes, you summarised it better than I did. If major restructuring is required, then please ignore! I think the example is more than serviceable to demonstrate an important
GPGPU use case.

Re; transform feedback, the following gist provides a concise example: https://gist.github.com/CodyJasonBennett/34c36b91719171c45ec50e850dc38a34

Although I haven't been able to get instancing to work fully yet for the above gist, this would theoretically allow you to achieve even greater speed-ups as described here:
https://webgl2fundamentals.org/webgl/lessons/webgl-instanced-drawing.html

Beyond that, I do see value in providing a Dataview to a vertex buffer, as there is less copying involved. For 4096 values the difference might be negligible compared to readPixels(), but I am investigating how to quickly process 2**24/31 array elements on a first pass with an optional reduction on a second pass (TextEncoder() on CPU -> BytePair encoding on GPU -> BytePair frequency tables on GPU -> Sorting frequency tables on CPU).

Shader-ast seems of great help to implement this functionality.

@postspectacular
Copy link
Member

postspectacular commented Jul 11, 2024

Again, thank you! I will try to take a look at these links over the weekend. Just some side notes here:

  1. Instancing is fully supported by thi.ng/webgl, also (individually configurable) for passes in a multi-pass pipeline. You can find some instancing examples here (still have to extract some more interesting ones from other projects):
  1. Re: 4096 values: with that I mean that if it's a proper (full) reduction, then you'd only ever have to read a single pixel (vec4), all other data would stay on the GPU and wouldn't have to be read back. The 4096 comes from the example I built earlier today, where I'm also reading out all the intermediate textures. But I'll try do some experiments with that other approach too... 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants