-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qcommon: create q_math.h and transform most functions as inline ones #505
Conversation
69560a5
to
8174671
Compare
Please remove unnecessary whitespace changes in q_math.cpp. |
This PR is about reorganizing existing code, for example there was math functions in both This is basically a PR about rearranging the code and beautifying it a bit in a way the code is easier to read by both humans and compilers, so white space rewriting may happen (for humans). There are things that do not need be to be reviewed or be commented:
Things that are expecting comments:
Other comment that may happen:
Example: Review: I'm not very happy with the plane/collision code being in that Review: I'm not very happy with some type definition like Review: I'm unhappy with the |
So, I updated my workspace to build Unvanquished to make it easier to test various compilers and build options easily. And I tested both I then ran the builds on those hardware:
As usual, this is the scene I benchmark with
The purpose is to test the performance gain on hardware requiring So I have very bad news, while I can firmly reproduce the gain of such reorganization on Intel hardware known to need optimizations, this only happens on Clang 12 build without LTO on this specific hardware. All the other uses case, GCC with or without LTO, Clang12 on other hardware, etc. get a huge slow down instead, a 20% performance loss on GCC with or without LTO on both hardware.
What matters the most about performances is LTO, I remind some people told me LTO were not likely to produce big changes in our case and I unfortunately trusted that so I usually did not enable it to speed-up linkage. In fact LTO has huge impact. See also Unvanquished/release-scripts#12 for more details. So we better do classic functions in .cpp files and rely heavily on LTO than inlining functions. Our release builds are known to use LTO since 0.52. So I'm closing this PR with the current shape, it has to be entirely redone from scratch, though some things may still be good to do, like moving everything from various files in |
In #494 I wrote:
So to see how that behaves, I transformed most functions as inline ones, while I was at it, I created a dedicted
q_math.h
(q_math.cpp
already existed) to strip downq_shared.h
. Theq_shard.h
still includesq_math.h
so don't expect compilation speed up, that's not the topic (though, someone future PR may rely on this effort to improve that).Edit: this PR also moves some mathematics function from
q_shared.cpp
intoq_math.*
file.Unfortunately, to see better performances, it looks like more work is required to make sure to produce vectored bytecode. For example I copied some code into godbolt, and trying for example three variant of the same code, I noticed gcc turned variant one into SSE code while clang was turning variant two into SSE code and MSVC was turning variant 3 into SSE code. This is unpredictable and to make sure to produce SSE code, one may want to write SSE code explicitly.
Though, this is a good base for implementing future optimizations. When more functions will be turned into SSE code, we may see the inline benefit more: using multiple SSE inline functions in a row on the same data would just apply on the same registers.
The measured gain is currently low, because of that lack of SSE code.
On an old computer with slow CPU and slow GPU, when using this code over the IQM revamp code (see #389) the framerate went from 15 to 16 fps. That performance gain may sound small (1 fps…) or big (6%). For reference on this hardware that can't hardware accelerate the model animation, the Mesa CPU implementation fallback get 8 fps, our own CPU implementation fallback now gets 16 fps.
IQM revamp:
IQM revamp + inline q_math:
Turning those inline functions into SSE code may improve more.
Note: this scene is very heavy to render, there are in fact 12 buildables to render.