Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divide large constant buffer into subsets and implement push constants for performance #855

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

SRSaunders
Copy link

@SRSaunders SRSaunders commented Jan 25, 2024

This PR replaces the performance part of #818, which will be closed and not merged. It has one dependency on nvrhi changes: RobertBeckebans/nvrhi#6 for relaxing nvrhi push constant limits to permit platform-specific runtime checks. (UPDATE: dependency now merged into nvrhi)

This fixes the performance part of #763.

Details are as follows:

  1. Separates the single large constant buffer into renderparm subsets (12 in total: 3 of 128 bytes in size, 6 greater than 128 bytes but less than 256 bytes, and 3 greater than 256 bytes but less than 1024 bytes).
  2. Adds new binding layout types to associate and differentiate between the new subsets (BINDING_LAYOUT_GBUFFER, BINDING_LAYOUT_GBUFFER_SKINNED, BINDING_LAYOUT_TEXTURE, BINDING_LAYOUT_TEXTURE_SKINNED, BINDING_LAYOUT_WOBBLESKY, BINDING_LAYOUT_SSGI, BINDING_LAYOUT_SSGI_SKINNED, BINDING_LAYOUT_POST_PROCESS).
  3. Implements push constants for Vulkan and DX12 across all platforms: Linux, macOS, Windows. This has varying degrees of performance improvement, the largest being on Vulkan for Linux and macOS. Windows Vulkan shows modest improvement dependant on the GPU vendor (Nvidia's 256 byte limit is better than AMD's 128 byte limit on Windows). Windows DX12 shows similar performance when using push constants vs. volatile constant buffers. However, DX12 does show a performance improvement due to constant buffer subsetting with better change detection logic. I have defined a new boolean r_useDX12PushConstants cvar which is turned off by default. This can optionally be turned on using autoexec.cfg for experimentation.
  4. Reduced the volatile constant max buffer count from 16,384 to 8,192. I believe this is sufficient but if testing reveals differently, then it could be boosted back up. Note that when push constants are enabled it reduces the requirement.
  5. Adds basic infrastructure for static constant buffers but these are disabled for now. This could be a possibility for the future but further subset refactoring would likely be needed, and sync issues would have to be resolved.
  6. Modified uniforms change detection logic (orthogonal to push constants) which has a very significant positive impact on performance. See performance timings below. Also added new cvar r_useVulkanPushConstants (default on) which is useful for performance comparisons.
  7. Don't allocate constant buffers unless required (i.e. when push constants disabled for a given binding layout type).
  8. Fixed ImmediateMode (mainly for debug tools) to work with push constants enabled.
  9. Additional Info: One other benefit of these changes is a reduction in Vulkan GPU memory usage. For example, on the Marine HQ hallway scene, GPU memory usage is reduced from ~2400 MB to ~1800 MB, or about 25%. The benefits come from three areas (from largest to smallest impact): a) splitting up the single large uniforms buffer into smaller subsets which avoids duplication of unused renderparms memory for each binding layout type, b) reducing the number of volatile constant buffer copies required for each binding layout type, and c) enabling push constants. For DX12, interestingly there appears to be no difference in GPU memory usage (~1900 MB) before and after this PR.

Tested on Windows 10 (AMD and Nvidia), Linux Manjaro, and macOS Ventura 13.5

Performance timings for this PR vs. current master, generated using a simple home-made timedemo:

Windows Nvidia System (1070 Ti)
DX12: 263 fps before, 360 fps after (with r_useDX12PushConstants = 0) --> significant improvement
Vulkan: 218 fps before, 333 fps after --> significant improvement

Windows AMD System (6600 XT)
DX12: 295 fps before, 305 fps after (with r_useDX12PushConstants = 0) --> neutral/positive improvement
Vulkan: 150 fps before, 160 fps after --> neutral/positive improvement

Linux AMD System (6600 XT)
Vulkan: 150 fps before, 270 fps after --> large improvement

macOS AMD System (6600 XT)
Vulkan: 77 fps before, 245 fps after --> very large improvement

macOS Apple Silicon System (M1 Air)
Vulkan: 6 fps before, 85 fps after --> massive improvement

See on-screen HUD statistics (FPS, GPU Memory, CPU/GPU Relative Usage % for com_fixedTic = 1) in the following screenshots showing the independent impact of: a) uniforms buffer subsetting, and b) push constants.

Notes re test setup:

  1. For my specific h/w setup (Intel Core i7 + AMD 6600XT GPU) macOS saturates on the CPU before the GPU. This can happen due to Vulkan to Metal encoding which occurs only after the renderer backend is finished. If this takes longer than the GPU to finish its work, then the frame must wait for completion and frame sync times will increase. For other hardware setups including Apple Silicon, this may not be the case and the GPU may saturate before the CPU. If the extra encoding step was not required on macOS, FPS numbers would likely be closer to linux where Vulkan is native.
  2. Note all tests were done at 1920x1080 except for Windows DX12, which was at 2560x1440. This was necessary to show DX12 benchmark differences before and after this PR. Otherwise FPS rates would be CPU-limited (to around 250 fps for this scene) and no differences would be observable. As a result, you cannot directly compare DX12 frame rates with Vulkan frame rates, but only relative before-and-after differences.

macOS Vulkan: Baseline using current master + PR #854 but without this PR:
rbdoom-3-bfg-20240130-092353-001

macOS Vulkan: Impact of uniforms buffer subsetting with push constants disabled:
rbdoom-3-bfg-20240130-093651-001

macOS Vulkan: Impact of uniforms buffer subsetting with push constants enabled:
rbdoom-3-bfg-20240130-161150-002 fixedTic

linux Vulkan: Baseline using current master + PR #854 but without this PR:
rbdoom-3-bfg-20240130-125300-003

linux Vulkan: Impact of uniforms buffer subsetting with push constants disabled:
rbdoom-3-bfg-20240130-161807-002 fixedTic

linux Vulkan: Impact of uniforms buffer subsetting with push constants enabled:
rbdoom-3-bfg-20240130-162011-002 fixedTic

Windows Vulkan: Baseline using current master + PR #854 but without this PR:
rbdoom-3-bfg-20240130-180511-001

Windows Vulkan: Impact of uniforms buffer subsetting with push constants disabled:
rbdoom-3-bfg-20240130-180959-001

Windows Vulkan: Impact of uniforms buffer subsetting with push constants enabled:
rbdoom-3-bfg-20240130-181205-002

Windows DX12: Baseline using current master + PR #854 but without this PR:
rbdoom-3-bfg-20240130-182409-002

Windows DX12: Impact of uniforms buffer subsetting with push constants disabled:
rbdoom-3-bfg-20240130-182731-002

Windows DX12: Impact of uniforms buffer subsetting with push constants enabled:
rbdoom-3-bfg-20240130-182847-003

@SRSaunders
Copy link
Author

Now updated to your latest master with retro rendering modes and crt post processing. The merge was large given the number of changes to master, but fairly straight forward with respect to the new shaders. Fortunately they all fell into the BINDING_LAYOUT_POST_PROCESS_FINAL binding layout type with its existing renderparm subset. I did not have to make any changes to the mappings, just simply mod the new shaders to add the pc. preface and include the correct subset.

Here is the updated mapping spreadsheet: Binding to Shader Mapping v5.xlsx

At first I was not sure about the new modes, but now I really like the PSX + Newpixie CRT setting. Feels very 90s!

@RobertBeckebans
Copy link
Owner

Yeah the retro modes are something I had in mind for a longer time because there are enough people who lurk into this engine for indie dev and I personally grew up as a kid with the C64, CPC 6128 and Amiga 600.
I also thought about forking this project and doing the retro there just for standalone development but everyone is focused on RBDOOM and I don't want to do the marketing and backporting of new features. The PSX rendering isn't done yet. The PSX branch also introduces a new renderparm which is used among many shaders for the vertex snapping and texture warping effects. They will be optional by r_psx* cvars but still be needed for a faithful PSX fake experience.

@SRSaunders
Copy link
Author

Thanks for the heads up re your PSX branch and a new renderparm. Is there any way I could have a look at that to understand the implications for this branch and future subset handing?

@SRSaunders
Copy link
Author

SRSaunders commented Feb 25, 2024

Updated to be compatible with nvrhi + ShaderMake rebase.

Still depends on RobertBeckebans/nvrhi#6. (this dependency now merged into nvrhi)

@SRSaunders
Copy link
Author

SRSaunders commented Apr 2, 2024

r_useVulkanPushConstants renamed to r_vkUsePushConstants on merge.
r_useDX12PushConstants renamed to r_dxUsePushConstants on merge.

Also set ShaderMake retryCount=20 to recover from any macOS/linux shell failures during parallel SPIRV compilation. Increased from default retryCount=10 to handle doubling of shader combinations due to USE_PUSH_CONSTANTS.

@SRSaunders SRSaunders mentioned this pull request Jul 6, 2024
@SRSaunders
Copy link
Author

SRSaunders commented Aug 19, 2024

@RobertBeckebans I have updated this branch/PR to be compatible with your new PSX vertex jitter and affine rendering modes. It was a bit more work than last time since the renderparms had changed for many shaders. I had to do a bit of refactoring to make as much as possible fit into Vulkan push constant limits: 128 bytes for AMD on Windows, 256 bytes for nVidia on Windows/linux plus AMD on linux. macOS was not an issue since the push constant limit there is 4096 bytes, but smaller is still faster and helps with performance on Apple. I have attached the new renderparm to shader mapping spreadsheet here for anyone who cares to see how it works. Binding to Shader Mapping v6.xlsx

I am wondering what to do with this branch going forward. I know you will likely not merge this into main, but I am putting it out there for other macOS, linux, and Windows users who want a bit more performance out of Vulkan. Do you have the concept of optional components or branches (e.g. VR) that you may want to distribute in the future? Perhaps this is something that could fit into that kind of optionality framework, at least for a fixed release version.

In any case, great work on the PSX rendermode stuff! I find the PSX retro look combined with the New Pixie crt filter to be pretty cool. Should be excellent as an engine for developing retro games with the original Tomb Raider look/style.

Here is a screen grab of PSX mode running on macOS Ventura x86 with an AMD 6600XT card at full speed 120 fps with this PR. You can see the affine warping very well on the floor tiles:
rbdoom-3-bfg-20240819-175800-005

@RobertBeckebans
Copy link
Owner

Ah well I made a backup of your new shader table in my GDrive and it is interesting what kind of an impact it has to just reference a few more renderparms besides the new rpPSXDistortions. I'm glad that you could track the necessary changes.

I would tend to keep your changes into a macOS branch.

As you saw I also fixed a math problem with the SSAO shader. More precisely the reconstruction of a world position vector from any non-linear depth value which is usually necessary for all kinds of screen space related raytracing / marching effects. Therefore I would like to give #498 a try before freezing RBDoom 1.6 and that feature will not only add a new shader but will also require changes in the gbuffer shader and the indirect lighting / lightgrid shaders. However I will give that feature a week and that does not work out then it will be post poned to 1.7.

I also would like to update NVRHI & ShaderMake again to have the latest bugfixes available.

@SRSaunders
Copy link
Author

SRSaunders commented Aug 20, 2024

it is interesting what kind of an impact it has to just reference a few more renderparms besides the new rpPSXDistortions

This is mainly due to the heavy restriction on Vulkan push constant sizes in Windows and linux: 128 bytes for some implementations and 256 for others. That's not a lot of data when you are doing innovative shader calculations that require many float parameters, matrices, etc. It's a struggle to fit the parameters into these small limits.

I would tend to keep your changes into a macOS branch.

Interesting thought. Up to now I have been thinking about this problem as a general performance solution for Vulkan across all platforms. If I were to look at a solution for macOS only, I might approach it differently since push constant limits are not a concern on that platform, and juggling to fit things into 128 or 256 bytes would not be necessary. Size optimization is still useful there, but eliminating these low threshold limits would relax the need to worry about every single shader parm addition when doing new work. In addition, if I were solving this for macOS only, the number of render parm sets could be reduced or eliminated and the general-purpose code for dynamic activation of push constant sets at runtime could be simplified.

If I were to come up with a new macOS-only design for this would you be interested?

@SRSaunders
Copy link
Author

Updated Binding to Shader Mapping v7.xlsx spreadsheet that includes the new SMAA binding layouts.

@SRSaunders
Copy link
Author

SRSaunders commented Dec 9, 2024

Updated Binding to Shader Mapping v8.xlsx spreadsheet that includes the new Octahedron Cube binding layouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants