qcommon: create q_math.h and transform most functions as inline ones #505

illwieckz · 2021-07-11T18:22:56Z

In #494 I wrote:

In non-VBO IQM code I'm trying to optimize in #389, I just noticed many of the vector functions are not inline ones, they come from q_math.cpp and what they do is not known by the compiler at the time the tr_surface.cpp file is compiled, the function calls are solved at build time.

So to see how that behaves, I transformed most functions as inline ones, while I was at it, I created a dedicted q_math.h (q_math.cpp already existed) to strip down q_shared.h. The q_shard.h still includes q_math.h so don't expect compilation speed up, that's not the topic (though, someone future PR may rely on this effort to improve that).

Edit: this PR also moves some mathematics function from q_shared.cpp into q_math.* file.

Unfortunately, to see better performances, it looks like more work is required to make sure to produce vectored bytecode. For example I copied some code into godbolt, and trying for example three variant of the same code, I noticed gcc turned variant one into SSE code while clang was turning variant two into SSE code and MSVC was turning variant 3 into SSE code. This is unpredictable and to make sure to produce SSE code, one may want to write SSE code explicitly.

Though, this is a good base for implementing future optimizations. When more functions will be turned into SSE code, we may see the inline benefit more: using multiple SSE inline functions in a row on the same data would just apply on the same registers.

The measured gain is currently low, because of that lack of SSE code.

On an old computer with slow CPU and slow GPU, when using this code over the IQM revamp code (see #389) the framerate went from 15 to 16 fps. That performance gain may sound small (1 fps…) or big (6%). For reference on this hardware that can't hardware accelerate the model animation, the Mesa CPU implementation fallback get 8 fps, our own CPU implementation fallback now gets 16 fps.

IQM revamp:

IQM revamp + inline q_math:

Turning those inline functions into SSE code may improve more.

Note: this scene is very heavy to render, there are in fact 12 buildables to render.

slipher · 2021-07-12T02:27:23Z

Please remove unnecessary whitespace changes in q_math.cpp.

illwieckz · 2021-07-12T09:05:36Z

Please remove unnecessary whitespace changes in q_math.cpp.

This PR is about reorganizing existing code, for example there was math functions in both q_shared.cpp and q_math.cpp (and there was no q_math.h) and now all of those math functions are expected to be moved in a q_math.* file. White space, definition or declaration ordering may be changed. At the same time some of those functions were turned into inline ones.

This is basically a PR about rearranging the code and beautifying it a bit in a way the code is easier to read by both humans and compilers, so white space rewriting may happen (for humans).

There are things that do not need be to be reviewed or be commented:

function bodies, it's a rearrangement
function definition and declaration order, it's a rearrangement
whitespace, it's a rearrangement

Things that are expecting comments:

global opinion on moving math functions from q_shared to q_math
opinion on moving a specific function from q_shared to q_math
global opinion on inlining math functions
opinion on inlining a specific function

Other comment that may happen:

suggests for solving some chicken-and-egg problems between q_shared.h and q_math.h like type definition. Though, those suggests are very likely to be implemented in subsequent PR using this one as a foundation.

Example:

Review: I'm not very happy with the plane/collision code being in that q_math thing.
Answer: Doing it that way seems to be the easier way to solve chicken-and-egg type definition problems and creating a specific .cpp/.h couple of files would be too much intrusive for this PR.

Review: I'm not very happy with some type definition like byte) living in q_math.h, a way to solve chicken-and-egg problem would be to create a third header for type definition.
Answer: I agree but this would be too much intrusive for this PR.

Review: I'm unhappy with the Q_UNUSED define being done in q_math.h.
Answer: This is very unfortunate but solving the related chicken-and-egg problems may skyrocket the amount of files changed, and that would move this PR outside of math topic. We may better live with it for now.

illwieckz · 2021-07-22T19:13:53Z

So, I updated my workspace to build Unvanquished to make it easier to test various compilers and build options easily.

And I tested both master and qmath branch with GCC 9 and Clang 12, each with or without LTO.

I then ran the builds on those hardware:

CPU: Intel Core 2 Duo L7500
GPU: Intel GMA965
CPU: AMD FX-9590
GPU: AMD R9 390X

As usual, this is the scene I benchmark with r_vboVertexSkinning 0:

+devmap plat23 +delay 100f setviewpos 1893 1920 0 0 0.

The purpose is to test the performance gain on hardware requiring r_vboVertexSkinning 0.

So I have very bad news, while I can firmly reproduce the gain of such reorganization on Intel hardware known to need optimizations, this only happens on Clang 12 build without LTO on this specific hardware.

All the other uses case, GCC with or without LTO, Clang12 on other hardware, etc. get a huge slow down instead, a 20% performance loss on GCC with or without LTO on both hardware.

compiler	vm	branch	L7500	FX-9590
gcc9	nexe	master	15	48
gcc9+lto	nexe	master	15	55
gcc9	nexe	qmath	12	38
gcc9+lto	nexe	qmath	12	43
clang12	nexe	master	14	56
clang12+lto	nexe	master	15	59
clang12	nexe	qmath	16	54
clang12+lto	nexe	qmath	15	52

What matters the most about performances is LTO, I remind some people told me LTO were not likely to produce big changes in our case and I unfortunately trusted that so I usually did not enable it to speed-up linkage. In fact LTO has huge impact. See also Unvanquished/release-scripts#12 for more details.

So we better do classic functions in .cpp files and rely heavily on LTO than inlining functions.

Our release builds are known to use LTO since 0.52.

So I'm closing this PR with the current shape, it has to be entirely redone from scratch, though some things may still be good to do, like moving everything from various files in q_math-named files.

illwieckz mentioned this pull request Jul 11, 2021

some revamp on IQM and MD5 model code #389

Merged

illwieckz force-pushed the qmath branch 2 times, most recently from 69560a5 to 8174671 Compare July 11, 2021 19:08

qcommon: create q_math.h and transform most functions as inline ones

69498a3

illwieckz force-pushed the qmath branch from 8174671 to 69498a3 Compare July 11, 2021 19:53

illwieckz mentioned this pull request Jul 12, 2021

SSE/AVX alternative to lrintf and other things like that? Or use a third party library others have well optimized? #494

Open

illwieckz closed this Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qcommon: create q_math.h and transform most functions as inline ones #505

qcommon: create q_math.h and transform most functions as inline ones #505

illwieckz commented Jul 11, 2021 •

edited

Loading

slipher commented Jul 12, 2021

illwieckz commented Jul 12, 2021

illwieckz commented Jul 22, 2021 •

edited

Loading

qcommon: create q_math.h and transform most functions as inline ones #505

qcommon: create q_math.h and transform most functions as inline ones #505

Conversation

illwieckz commented Jul 11, 2021 • edited Loading

slipher commented Jul 12, 2021

illwieckz commented Jul 12, 2021

illwieckz commented Jul 22, 2021 • edited Loading

illwieckz commented Jul 11, 2021 •

edited

Loading

illwieckz commented Jul 22, 2021 •

edited

Loading