-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable better vectorization for generic convolution #692
base: master
Are you sure you want to change the base?
Conversation
Break the single dependence chain into two parallel sub-chains. Provides 2-4% performance uplift as measured on modern ARM systems when using the generic codepath.
Thanks for providing this information and sharing this optimization. I do have some questions though. You mention this change is an improvement for "modern ARM systems". Did you do testing on other systems and architectures too? Can you provide some numbers? I guess the intrinsics parts are stripped for the benchmark, so performance of this bit of code is important on x86-64 too? Also, can you explain how you tested this? By default FLAC is configured to compile with the GCC options |
Sure thing. Since we are running within the SPEC CPU harness, all files are built with the same compiler switches (that is one of the requirements for the benchmark). We used Here are the measured runtimes in seconds. one-sum is the original baseline, two-sum is with the new code. You are right, intrinsics are stripped so we are comparing apples to apples. It does help x86 as well.
After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same. Those switches don't benefit us here: |
Just to be clear, you added those options to the options you mentioned above? The thing is, I've tried GCC's LTO in the past, and without exception it produced slower builds. So perhaps this optimization only works with LTO, because I am seeing no gains at all. My best guess is that without LTO, these functions are very well optimized, and with LTO, they are less so (because it results in quite a large binary), but your change drives the compiler to optimize a little more. |
No, we replaced our flags with your flags. I confirmed that the gains are not coming from LTO by removing -flto and rerunning; the asm produced is isolated to one file. Which machine are you testing on? My 7 year old x86 desktop did not show the gains I listed above. |
I've tested with an Intel Xeon E-2224G and an Intel Core i5 7200U. Results for the latter are graphed here. No difference in speed, at the cost of a bigger binary. Compile with GCC 13.2.0. Just to be sure, are you using 16-bit, 24-bit input or something else entirely? |
That's a cool graph! Can you explain the axes? The title says "CPU-time vs compression", so I thought that means "Y-axis vs X-axis", but the axes are labeled differently (Compression is on Y-axis, and CPU-time is not plotted but CPU-usage is?). Is this from one run or twelve runs? If you share your input and cmdlines, I can work to produce the same data on my side and send it over. It's good that you reproduced my Intel results :-) The desktop machine I mentioned above an i7-8809G. The reason behind the lack of improvement is that the cores in older x86 machines don't have as many execution units as the newer ones. Hence they cannot take advantage of the increased instruction level parallelism that we are exposing with the two-sum patch. My inputs are 32-bit, WAV files. If it helps, I have placed them on a dropbox here. I have full rights to the audio content, I produced these myself, and I am giving them away freely for any use. |
Hi @ktmf01! I took your changes from !700 and !702, applied them on top of the changes here, and observed some more performance on my dev machine (Ampere Altra). I ran a couple of times to make sure. My |
I want to share that the SPEC CPU development committee has been testing with this edit for a month, without any issues across a variety of hardware and compilers (AMD AOCC, Intel ICC, NVHPC, IBM OpenXL, MSVC, LLVM, GCC) and they are all stable and verify with the same results. And we see performance gains on some platforms. What are the next steps on your side? Anything else I can help with? |
To be honest, I'm not yet convinced. I think that a gain of 1.7% with non-standard compiler options, non-standard compression settings and somewhat uncommon hardware is not something worth integrating. Also, I found a system where this patch results in slower execution. I tested on a BCM2835 (Raspberry Pi Zero W) with options -8ep and standard compiler settings, and I got an 1.2% decrease in execution speed. |
flac is being considered for the next version of the SPEC CPU benchmark suite, currently dubbed CPUv8. As such, we are interested in the generic (non-intrinsic) code paths that can run on all current and future architectures. And as part of the intense scrutiny that benchmarks undergo, we saw a performance improvement opportunity in the generic code path in
libFLAC/lpc.c
and would like to share it with the community.In the loop for computing residual from qlp coefficients, if we break the single dependence chain into two parallel sub-chains, we allow vectorization instructions to be interleaved during program execution. This provides 2-4% performance uplift as measured on modern ARM systems due to increased instruction level parallelism. Our two benchmark workloads are encoding podcasts using
-7ep --replay-gain
and-8p
. We did try breaking the chain down further into four sub-chains, but didn't see any additional gains.