-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio: Simplify volume ramp functions #6636
Conversation
singalsu
commented
Nov 18, 2022
•
edited
Loading
edited
@singalsu whats the MCPS improvement ? |
Currently tiny, only a fraction of MCPS from about +3 MCPS when ramp is executed. I wonder if the cache trash impact from this function to hot code part in volume shadows all improvement. Experiment with doubled update period 125 us -> 250 us could be worth it. I was maybe too critical when I tried to hear zipper effect i volume adjust. Opinion from a larger group could be useful. |
For ramp, I reported a ramp disfunctional bug to sof kernel team, Rander is working on it: |
966c00b
to
d7eaaaa
Compare
@lgirdwood @btian1 Thanks for reviewing. Please check this new version also. I took today another shot to this and got some better 0.5 MCPS saving during ramp. The linear ramp still adds 2.5 MCPS to normal volume load, almost double. Windows fade probably more, I haven't measured. |
If it was direct table lookup, it removes totally ramp customization and sample rate independence. I'm not changing features in this patch. It's another discussion. What is 0.9, windows fade ramp MCPS delta to volume MCPS? yes, windows ramp cost is 0.9 mcps, I tested. |
When 2.5 mcps happened, did you add ZC function? if added, then we can try to optimize it. |
The 2.5 MCPS is with linear. The high cost is surprising and I think there must be some cache performance issue. I will test by inlining also the ramp math (remove function call pointer). If the optimize brings help, the rest to replace arithmetic with HiFi SIMD is simple. |
d7eaaaa
to
db79e50
Compare
I think the above per 1 ms trace computed MCPS is not very suitable for low loads. But maybe the observation of 1.5x load during ramp is more reliable. Below is the per 1024 ms output performance counters based MCPS (trace value x 400/38400). If average is 2.3 or 2.6 MCPS then during ramp the load would be 3.5 or 3.9 MCPS. |
*/ | ||
static void volume_ramp(struct processing_module *mod) | ||
static inline void volume_ramp(struct vol_data *cd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and inline
shouldn't be needed here either, the compiler decides by itself when to inline static functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load increased by 0.4 MCPS when I removed the inline from volume_ramp(), volume_linear_ramp(), and volume_windows_fade_ramp(). Seems it's beneficial with xt-xcc version RG-2017.8-linux that I use for TGL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, interesting, I'd literary expect the compiler to do that for us automatically. Since the gain isn't huge, to confirm that there's really a difference - could you perhaps just build the firmware with and without inline
and check that the resulting image changed (in size)? At least volume_windows_fade_ramp()
really shouldn't need inline
since you take its address:
cd->ramp_func = &volume_windows_fade_ramp;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so we dont have time to figure out why we get the XCC speed improvement with inline here, lets just mark this with an inline comment so that its obvious.
cd->channels = sink_c->stream.channels; | ||
cd->sample_rate = sink_c->stream.rate; | ||
cd->sample_rate_inv = (int32_t)((int64_t)1000 * INT32_MAX / sink_c->stream.rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 1000LL
would have the same effect and looks prettier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
With this PR, can we say we lower down volume module 20% performance? perviously, I downed 5%, so if this PR can save 10%, left 5-10% can be saved by more advanced HIFI4 architecture. |
Maybe more safe to say this reduces by 14-16% the worst-case. Maybe 15% for single number. There's some trace print overhead included so it could be more. Also I tried as quick test to change the linear ramp multiplication to shift as fast approximation of multiply but the saving was minimal. So I didn't include any intrinsic multiplications. What might work additionally could be if we could change for selected components like volume the xcc optimization to -O3 and enable HiFi instructions for C. Now the optimization is -O2 for entire SOF. |
db79e50
to
465fb1f
Compare
@singalsu looks like there are some build failures. |
5eea43d
to
96f7557
Compare
@singalsu stiil some build failure
|
96f7557
to
db5ff50
Compare
@singalsu can you check CI, thanks ! |
db5ff50
to
7915400
Compare
7915400
to
e18b3e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, lets juts add the inline comment and we are good.
*/ | ||
static void volume_ramp(struct processing_module *mod) | ||
static inline void volume_ramp(struct vol_data *cd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so we dont have time to figure out why we get the XCC speed improvement with inline here, lets just mark this with an inline comment so that its obvious.
This patch saves about 15% of processing time or about 0.9 MCPS during volume ramp. The volume_ramp() functions are inlined to volume_process() function. The other use of function in idle is replaced by volume_ramp_check(). In ramp calculation the division in ramp time calculation is converted to faster multiplication with inverse of sample rate. The unnecessary checks for vol greater than vol_max or vol_min are eliminated by ensuring that tvolume[] is limited to vol_min and vol_max. The clear of ramp_coef[] at ramp finish is unnecessary. It's internal for linear ramp and is not used by the other ramp type. The other optimization is to pass to ramp math functions struct vol_data pointer instead of processing_module pointer to direct access to needed state variables. Signed-off-by: Seppo Ingalsuo <[email protected]>
e18b3e4
to
956941f
Compare
I just added the comment. Let's see if it passes now the CI. |
CI still failed, and title still have don't merge, please change title accordingly. |
The fail with TGLU_RVP_NOCODEC_IPC4ZPH seems to be a check-suspend-resume failure. I doubt it's related. The Internal Intel CI check that earlier failed has passed now. |
@andrula-song @kv2019i pls review |
yes, same failure on my PR as well. |
One system PM failure in https://sof-ci.01.org/sofpr/PR6636/build4370/devicetest/index.html , not related to volume ramp. Proceeding with merge. |