Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcommon: create q_math.h and transform most functions as inline ones #505

Closed
wants to merge 1 commit into from

Conversation

illwieckz
Copy link
Member

@illwieckz illwieckz commented Jul 11, 2021

In #494 I wrote:

In non-VBO IQM code I'm trying to optimize in #389, I just noticed many of the vector functions are not inline ones, they come from q_math.cpp and what they do is not known by the compiler at the time the tr_surface.cpp file is compiled, the function calls are solved at build time.

So to see how that behaves, I transformed most functions as inline ones, while I was at it, I created a dedicted q_math.h (q_math.cpp already existed) to strip down q_shared.h. The q_shard.h still includes q_math.h so don't expect compilation speed up, that's not the topic (though, someone future PR may rely on this effort to improve that).

Edit: this PR also moves some mathematics function from q_shared.cpp into q_math.* file.

Unfortunately, to see better performances, it looks like more work is required to make sure to produce vectored bytecode. For example I copied some code into godbolt, and trying for example three variant of the same code, I noticed gcc turned variant one into SSE code while clang was turning variant two into SSE code and MSVC was turning variant 3 into SSE code. This is unpredictable and to make sure to produce SSE code, one may want to write SSE code explicitly.

Though, this is a good base for implementing future optimizations. When more functions will be turned into SSE code, we may see the inline benefit more: using multiple SSE inline functions in a row on the same data would just apply on the same registers.

The measured gain is currently low, because of that lack of SSE code.

On an old computer with slow CPU and slow GPU, when using this code over the IQM revamp code (see #389) the framerate went from 15 to 16 fps. That performance gain may sound small (1 fps…) or big (6%). For reference on this hardware that can't hardware accelerate the model animation, the Mesa CPU implementation fallback get 8 fps, our own CPU implementation fallback now gets 16 fps.

IQM revamp:

iqm slow on cpu

IQM revamp + inline q_math:

iqm slow on cpu

Turning those inline functions into SSE code may improve more.

Note: this scene is very heavy to render, there are in fact 12 buildables to render.

@illwieckz illwieckz force-pushed the qmath branch 2 times, most recently from 69560a5 to 8174671 Compare July 11, 2021 19:08
@slipher
Copy link
Member

slipher commented Jul 12, 2021

Please remove unnecessary whitespace changes in q_math.cpp.

@illwieckz
Copy link
Member Author

Please remove unnecessary whitespace changes in q_math.cpp.

This PR is about reorganizing existing code, for example there was math functions in both q_shared.cpp and q_math.cpp (and there was no q_math.h) and now all of those math functions are expected to be moved in a q_math.* file. White space, definition or declaration ordering may be changed. At the same time some of those functions were turned into inline ones.

This is basically a PR about rearranging the code and beautifying it a bit in a way the code is easier to read by both humans and compilers, so white space rewriting may happen (for humans).

There are things that do not need be to be reviewed or be commented:

  • function bodies, it's a rearrangement
  • function definition and declaration order, it's a rearrangement
  • whitespace, it's a rearrangement

Things that are expecting comments:

  • global opinion on moving math functions from q_shared to q_math
  • opinion on moving a specific function from q_shared to q_math
  • global opinion on inlining math functions
  • opinion on inlining a specific function

Other comment that may happen:

  • suggests for solving some chicken-and-egg problems between q_shared.h and q_math.h like type definition. Though, those suggests are very likely to be implemented in subsequent PR using this one as a foundation.

Example:

Review: I'm not very happy with the plane/collision code being in that q_math thing.
Answer: Doing it that way seems to be the easier way to solve chicken-and-egg type definition problems and creating a specific .cpp/.h couple of files would be too much intrusive for this PR.

Review: I'm not very happy with some type definition like byte) living in q_math.h, a way to solve chicken-and-egg problem would be to create a third header for type definition.
Answer: I agree but this would be too much intrusive for this PR.

Review: I'm unhappy with the Q_UNUSED define being done in q_math.h.
Answer: This is very unfortunate but solving the related chicken-and-egg problems may skyrocket the amount of files changed, and that would move this PR outside of math topic. We may better live with it for now.

@illwieckz
Copy link
Member Author

illwieckz commented Jul 22, 2021

So, I updated my workspace to build Unvanquished to make it easier to test various compilers and build options easily.

And I tested both master and qmath branch with GCC 9 and Clang 12, each with or without LTO.

I then ran the builds on those hardware:

  • CPU: Intel Core 2 Duo L7500
    GPU: Intel GMA965
  • CPU: AMD FX-9590
    GPU: AMD R9 390X

As usual, this is the scene I benchmark with r_vboVertexSkinning 0:

  • +devmap plat23 +delay 100f setviewpos 1893 1920 0 0 0.

The purpose is to test the performance gain on hardware requiring r_vboVertexSkinning 0.

So I have very bad news, while I can firmly reproduce the gain of such reorganization on Intel hardware known to need optimizations, this only happens on Clang 12 build without LTO on this specific hardware.

All the other uses case, GCC with or without LTO, Clang12 on other hardware, etc. get a huge slow down instead, a 20% performance loss on GCC with or without LTO on both hardware.

compiler vm branch L7500 FX-9590
gcc9 nexe master 15 48
gcc9+lto nexe master 15 55
gcc9 nexe qmath 12 38
gcc9+lto nexe qmath 12 43
clang12 nexe master 14 56
clang12+lto nexe master 15 59
clang12 nexe qmath 16 54
clang12+lto nexe qmath 15 52

What matters the most about performances is LTO, I remind some people told me LTO were not likely to produce big changes in our case and I unfortunately trusted that so I usually did not enable it to speed-up linkage. In fact LTO has huge impact. See also Unvanquished/release-scripts#12 for more details.

So we better do classic functions in .cpp files and rely heavily on LTO than inlining functions.

Our release builds are known to use LTO since 0.52.

So I'm closing this PR with the current shape, it has to be entirely redone from scratch, though some things may still be good to do, like moving everything from various files in q_math-named files.

@illwieckz illwieckz closed this Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants