Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio: Simplify volume ramp functions #6636

Merged
merged 1 commit into from
Mar 9, 2023

Conversation

singalsu
Copy link
Collaborator

@singalsu singalsu commented Nov 18, 2022

This patch saves about 15% of processing time or about 0.9 MCPS
during volume ramp.

The volume_ramp() functions are inlined to volume_process() function. The
other use of function in idle is replaced by volume_ramp_check().

In ramp calculation the division in ramp time calculation is converted
to faster multiplication with inverse of sample rate.

The unnecessary checks for vol greater than vol_max or vol_min are
eliminated by ensuring that tvolume[] is limited to vol_min and vol_max.

The clear of ramp_coef[] at ramp finish is unnecessary. It's
internal for linear ramp and is not used by the other ramp type.

The other optimization is to pass to ramp math functions struct
vol_data pointer instead of processing_module pointer to direct
access to needed state variables.

Signed-off-by: Seppo Ingalsuo <[email protected]>

@lgirdwood
Copy link
Member

@singalsu whats the MCPS improvement ?

@singalsu
Copy link
Collaborator Author

@singalsu whats the MCPS improvement ?

Currently tiny, only a fraction of MCPS from about +3 MCPS when ramp is executed. I wonder if the cache trash impact from this function to hot code part in volume shadows all improvement. Experiment with doubled update period 125 us -> 250 us could be worth it. I was maybe too critical when I tried to hear zipper effect i volume adjust. Opinion from a larger group could be useful.

@btian1
Copy link
Contributor

btian1 commented Nov 25, 2022

For ramp, I reported a ramp disfunctional bug to sof kernel team, Rander is working on it:
thesofproject/linux#4026
I hard coded ramp in local and test performance with real time, it is around 0.9 for window ramp.

@singalsu singalsu changed the title [DRAFT][Do not review] Audio: Simplify volume ramp functions Audio: Simplify volume ramp functions Nov 25, 2022
@singalsu
Copy link
Collaborator Author

@lgirdwood @btian1 Thanks for reviewing. Please check this new version also.

I took today another shot to this and got some better 0.5 MCPS saving during ramp. The linear ramp still adds 2.5 MCPS to normal volume load, almost double. Windows fade probably more, I haven't measured.

@singalsu
Copy link
Collaborator Author

singalsu commented Nov 25, 2022

For ramp, I reported a ramp disfunctional bug to sof kernel team, Rander is working on it: thesofproject/linux#4026 I hard coded ramp in local and test performance with real time, it is around 0.9 for window ramp.

If it was direct table lookup, it removes totally ramp customization and sample rate independence. I'm not changing features in this patch. It's another discussion.

What is 0.9, windows fade ramp MCPS delta to volume MCPS?

yes, windows ramp cost is 0.9 mcps, I tested.

@btian1
Copy link
Contributor

btian1 commented Nov 28, 2022

When 2.5 mcps happened, did you add ZC function? if added, then we can try to optimize it.
Also linear itself should not cause so big mcps cost.

@singalsu
Copy link
Collaborator Author

singalsu commented Nov 28, 2022

When 2.5 mcps happened, did you add ZC function? if added, then we can try to optimize it. Also linear itself should not cause so big mcps cost.

The 2.5 MCPS is with linear. The high cost is surprising and I think there must be some cache performance issue. I will test by inlining also the ramp math (remove function call pointer). If the optimize brings help, the rest to replace arithmetic with HiFi SIMD is simple.

@singalsu
Copy link
Collaborator Author

I in-lined the ramp math too, now the saving is about 0.9 MCPS. Below is a figure of two PGAs 1.2 and 30.67 from s32 playback to hda-generic headset. Upper pictures are before, lower are after. The higher part of the MCPS plot is 20 ms ramp, the lower part is the normal operation after ramp.

Screenshot from 2022-11-28 17-38-05

@singalsu
Copy link
Collaborator Author

singalsu commented Nov 28, 2022

I think the above per 1 ms trace computed MCPS is not very suitable for low loads. But maybe the observation of 1.5x load during ramp is more reliable. Below is the per 1024 ms output performance counters based MCPS (trace value x 400/38400). If average is 2.3 or 2.6 MCPS then during ramp the load would be 3.5 or 3.9 MCPS.

Screenshot from 2022-11-28 18-20-23

*/
static void volume_ramp(struct processing_module *mod)
static inline void volume_ramp(struct vol_data *cd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and inline shouldn't be needed here either, the compiler decides by itself when to inline static functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load increased by 0.4 MCPS when I removed the inline from volume_ramp(), volume_linear_ramp(), and volume_windows_fade_ramp(). Seems it's beneficial with xt-xcc version RG-2017.8-linux that I use for TGL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, interesting, I'd literary expect the compiler to do that for us automatically. Since the gain isn't huge, to confirm that there's really a difference - could you perhaps just build the firmware with and without inline and check that the resulting image changed (in size)? At least volume_windows_fade_ramp() really shouldn't need inline since you take its address:

	cd->ramp_func =  &volume_windows_fade_ramp;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we dont have time to figure out why we get the XCC speed improvement with inline here, lets just mark this with an inline comment so that its obvious.

cd->channels = sink_c->stream.channels;
cd->sample_rate = sink_c->stream.rate;
cd->sample_rate_inv = (int32_t)((int64_t)1000 * INT32_MAX / sink_c->stream.rate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1000LL would have the same effect and looks prettier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@btian1
Copy link
Contributor

btian1 commented Nov 30, 2022

With this PR, can we say we lower down volume module 20% performance? perviously, I downed 5%, so if this PR can save 10%, left 5-10% can be saved by more advanced HIFI4 architecture.

@singalsu
Copy link
Collaborator Author

singalsu commented Nov 30, 2022

With this PR, can we say we lower down volume module 20% performance? perviously, I downed 5%, so if this PR can save 10%, left 5-10% can be saved by more advanced HIFI4 architecture.

Maybe more safe to say this reduces by 14-16% the worst-case. Maybe 15% for single number. There's some trace print overhead included so it could be more.

Also I tried as quick test to change the linear ramp multiplication to shift as fast approximation of multiply but the saving was minimal. So I didn't include any intrinsic multiplications. What might work additionally could be if we could change for selected components like volume the xcc optimization to -O3 and enable HiFi instructions for C. Now the optimization is -O2 for entire SOF.

@lgirdwood
Copy link
Member

@singalsu looks like there are some build failures.

@singalsu singalsu force-pushed the simplify_volume_ramp branch 2 times, most recently from 5eea43d to 96f7557 Compare December 2, 2022 16:11
@lgirdwood
Copy link
Member

@singalsu stiil some build failure

[ 70%] Building C object CMakeFiles/sof.dir/src/audio/module_adapter/module/volume/volume.c.o
[ 70%] Building C object CMakeFiles/sof.dir/src/audio/module_adapter/module/volume/volume.c.o
staging/sof-tplg: directory
staging/sof-tplg
|-- sof-adl-sdw-max98373-rt5682.tplg
|-- sof-adl-max98360a-nau8825.tplg
|-- sof-adl-max98360a-rt5682-4ch.tplg
|-- sof-adl-max98360a-rt5682.tplg
|-- sof-adl-rt1015-nau8825.tplg
|-- sof-adl-max98390-rt5682.tplg
|-- sof-tgl-sdw-max98373-rt5682.tplg
|-- sof-tgl-max98357a-rt5682-pdm1-drceq.tplg
|-- sof-adl-max98360a-rt5682-2way.tplg
├── ...
..
mkdir -p staging/tools
cd /home/sof/work/sof.git/installer/../installer-builds/build_tools && \
  cp -p ctl/sof-ctl  logger/sof-logger  probes/sof-probes /home/sof/work/sof.git/installer/staging/tools
[ 70%] Building C object CMakeFiles/sof.dir/src/audio/copier/copier.c.o
[ 70%] Building C object CMakeFiles/sof.dir/src/audio/copier/copier.c.o
/home/sof/work/sof.git/src/audio/module_adapter/module/volume/volume.c: In function 'volume_ramp':
/home/sof/work/sof.git/src/audio/module_adapter/module/volume/volume.c:267:10: error: variable 'ramp_time' set but not used [-Werror=unused-but-set-variable]
  267 |  int32_t ramp_time;
      |          ^~~~~~~~~
/home/sof/work/sof.git/src/audio/module_adapter/module/volume/volume.c: In function 'volume_ramp':
/home/sof/work/sof.git/src/audio/module_adapter/module/volume/volume.c:267:10: error: variable 'ramp_time' set but not used [-Werror=unused-but-set-variable]
  267 |  int32_t ramp_time;
      |          ^~~~~~~~~
cc1: all warnings being treated as errors
make[4]: *** [CMakeFiles/sof.dir/build.make:941: CMakeFiles/sof.dir/src/audio/module_adapter/module/volume/volume.c.o] Error 1
make[4]: *** Waiting for unfinished jobs....
cc1: all warnings being treated as errors
make[4]: *** [CMakeFiles/sof.dir/build.make:941: CMakeFiles/sof.dir/src/audio/module_adapter/module/volume/volume.c.o] Error 1
make[4]: *** Waiting for unfinished jobs....
Scanning dependencies of target bootloader_dump

@lgirdwood
Copy link
Member

@singalsu can you check CI, thanks !

@lgirdwood
Copy link
Member

@singalsu still a failure - best to ping @wszypelt or @lrudyX

@singalsu
Copy link
Collaborator Author

@singalsu still a failure - best to ping @wszypelt or @lrudyX

Thanks, yes there's a gain test failure. I'll find out what exactly fails.

@singalsu singalsu changed the title Audio: Simplify volume ramp functions [Don't merge] Audio: Simplify volume ramp functions Jan 26, 2023
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, lets juts add the inline comment and we are good.

*/
static void volume_ramp(struct processing_module *mod)
static inline void volume_ramp(struct vol_data *cd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we dont have time to figure out why we get the XCC speed improvement with inline here, lets just mark this with an inline comment so that its obvious.

This patch saves about 15% of processing time or about 0.9 MCPS
during volume ramp.

The volume_ramp() functions are inlined to volume_process() function.
The other use of function in idle is replaced by volume_ramp_check().

In ramp calculation the division in ramp time calculation is
converted to faster multiplication with inverse of sample rate.

The unnecessary checks for vol greater than vol_max or vol_min are
eliminated by ensuring that tvolume[] is limited to vol_min and vol_max.

The clear of ramp_coef[] at ramp finish is unnecessary. It's
internal for linear ramp and is not used by the other ramp type.

The other optimization is to pass to ramp math functions struct
vol_data pointer instead of processing_module pointer to direct
access to needed state variables.

Signed-off-by: Seppo Ingalsuo <[email protected]>
@singalsu
Copy link
Collaborator Author

singalsu commented Mar 7, 2023

LGTM, lets juts add the inline comment and we are good.

I just added the comment. Let's see if it passes now the CI.

@btian1
Copy link
Contributor

btian1 commented Mar 8, 2023

CI still failed, and title still have don't merge, please change title accordingly.

@singalsu singalsu changed the title [Don't merge] Audio: Simplify volume ramp functions Audio: Simplify volume ramp functions Mar 8, 2023
@singalsu
Copy link
Collaborator Author

singalsu commented Mar 8, 2023

CI still failed, and title still have don't merge, please change title accordingly.

The fail with TGLU_RVP_NOCODEC_IPC4ZPH seems to be a check-suspend-resume failure. I doubt it's related. The Internal Intel CI check that earlier failed has passed now.

@lgirdwood
Copy link
Member

@andrula-song @kv2019i pls review

@btian1
Copy link
Contributor

btian1 commented Mar 9, 2023

yes, same failure on my PR as well.

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 9, 2023

One system PM failure in https://sof-ci.01.org/sofpr/PR6636/build4370/devicetest/index.html , not related to volume ramp. Proceeding with merge.

@kv2019i kv2019i merged commit 84ecf46 into thesofproject:main Mar 9, 2023
@singalsu singalsu deleted the simplify_volume_ramp branch March 9, 2023 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants