Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_utils.yield_stdev is very slow #315

Closed
lhenkelm opened this issue Jan 27, 2022 · 16 comments · Fixed by #316
Closed

model_utils.yield_stdev is very slow #315

lhenkelm opened this issue Jan 27, 2022 · 16 comments · Fixed by #316
Labels
performance Related to performance

Comments

@lhenkelm
Copy link
Contributor

First, thank you for providing and maintaining this nice package! I get a lot of utility out of it, in particular the model_utils module.

I use cabinetry.model_utils.yield_stdev in version 0.4.0 to get post-fit uncertainties for plots, and when I profiled my fitting code I noticed that most time is actually taken up preparing the plots after the fits are done.
Specifically, the post-fit plotting script calls yield_stdev 8 times, and since every call takes roughly 17s, in total the script runs for 160s or so, most of which is spent in yield_stdev :

image
(the x-axis is cumulative time spent in a function and all functions it calls in turn, which are placed lower on the y-axis)

The functions taking up all the time down-stack in cabinetry.model_utils.yield_stdev are a mix of numpy code and hooks for numpy provided by awkward-array.
It seems these are individually quite fast, e.g. __array_ufunc__ calls average to 6 ms each,
however they are being called very many times from yield_stdev (202 512 times in the case of __array_ufunc__).

From a first glance at the implementation it seems the nested loop here
is a likely candidate for the bottleneck: my model has O(100) parameters, and most of them are at least a little bit correlated,
so the number of non-zero off-diagonal elements in the correlation matrix is about 12k.

Is it feasible to optimize yield_stdev? E.g. to replace the nested loop over 6k nested parameter pairs by an equivalent operation over larger arrays, such that the loop is left to numpy?

@alexander-held alexander-held added the performance Related to performance label Jan 27, 2022
@alexander-held
Copy link
Member

Hi @lhenkelm, thanks for investigating and bringing this up! I have a few thoughts below.

The implemented method uses linear error propagation, but this is not the only possibility. Especially for models with many parameters, a sampling-based approach may be more suitable, as briefly described in #221. We could implement that, and it may be significantly faster for the setup you are describing.

The usage of awkward-array here in this code is strictly speaking not needed, I added it to retain the channel split which simplifies debugging. I assumed this would not introduce any relevant overhead, but we could try removing it in case the overhead does matter.

How does your post-fit plotting script look like exactly? Results from yield_stdev should be cached, so if you call it multiple times for the same model and fit result, it should not repeat the expensive calculation. Are you plotting different models / fit results?

From a first glance at the implementation it seems the nested loop here is a likely candidate for the bottleneck

How does the performance change when removing this loop completely? I would have expected that the pyhf calls to expected_data also end up contributing significantly, and those could probably be batched if they are a bottleneck.

Is it feasible to optimize yield_stdev? E.g. to replace the nested loop over 6k nested parameter pairs by an equivalent operation over larger arrays, such that the loop is left to numpy?

I cannot spontaneously think of a way to get rid of the loop completely, but it should be possible to pre-calculate sym_unc_i for all i (same for j). As a first step, sym_unc_i can also be moved outside of the j loop, which may already have a measurable impact. As indicated by the comment in the code, it is also possible in principle to skip cases where the correlation is below some threshold. This is not currently implemented, as it seemed a bit dangerous to me (two very large effects with small correlation could matter more than small effects with large correlation), but a conservative setting could also speed things up considerably.

@lhenkelm
Copy link
Contributor Author

Hi @alexander-held,

The implemented method uses linear error propagation, but this is not the only possibility. Especially for models with many parameters, a sampling-based approach may be more suitable, as briefly described in #221. We could implement that, and it may be significantly faster for the setup you are describing.

The alternatives look really interesting. I'll have a go at swapping them in in place of yield_stddev to see how the timings compare in my case.

The usage of awkward-array here in this code is strictly speaking not needed, I added it to retain the channel split which simplifies debugging. I assumed this would not introduce any relevant overhead, but we could try removing it in case the overhead does matter.

I am not sure that the awkward arrays are the issue. Since the number of in-interpreter calls to the awkward array methods is so large, it could be that an implementation with pure numpy arrays is still similarly slow.

How does your post-fit plotting script look like exactly? Results from yield_stdev should be cached, so if you call it multiple times for the same model and fit result, it should not repeat the expensive calculation. Are you plotting different models / fit results?

Oh, that is a very good point, I missed that this is already cached! The 8 calls are all for the same model and data.
However the model instance is newly constructed in every case, (in the same way from the same workspace).
So the old result is never found since the hashes do not compare equal:

>>> pdf
<pyhf.pdf.Model object at 0x7f6e8f31c280>
>>> other
<pyhf.pdf.Model object at 0x7f6e8f336e50>
>>> pdf.spec == other.spec
True
>>> pdf == other
False
>>> hash(pdf) == hash(other)
False
>>> pdf in {other : 1}
False

(the code calling yield_stdev is here)
So if the caching logic can be adapted to recognize these as equivalent, it would already give a roughly x8 speedup.
Is it safe to cache based on pyhf.pdf.Model.spec rather than the full object?
Otherwise pyhf.pdf.Model would need to define a specialized __hash__ (and __eq__, for consistency).

From a first glance at the implementation it seems the nested loop here is a likely candidate for the bottleneck

How does the performance change when removing this loop completely? I would have expected that the pyhf calls to expected_data also end up contributing significantly, and those could probably be batched if they are a bottleneck.

I'll do the experiment of skipping the loop and let you know. Regarding the expected_data calls, they are at about 400 us individually. That still accumulates to ~12 s over the whole script, due to migrad and hesse calling it so often in the fit.

Is it feasible to optimize yield_stdev? E.g. to replace the nested loop over 6k nested parameter pairs by an equivalent operation over larger arrays, such that the loop is left to numpy?

I cannot spontaneously think of a way to get rid of the loop completely, but it should be possible to pre-calculate sym_unc_i for all i (same for j). As a first step, sym_unc_i can also be moved outside of the j loop, which may already have a measurable impact. As indicated by the comment in the code, it is also possible in principle to skip cases where the correlation is below some threshold. This is not currently implemented, as it seemed a bit dangerous to me (two very large effects with small correlation could matter more than small effects with large correlation), but a conservative setting could also speed things up considerably.

I agree that skipping correlations requires careful thought. But I can try how far I get with pre-computing/masking and so on.

Thanks a lot for all of the suggestions! I will see how they work out and let you know.

@alexander-held
Copy link
Member

alexander-held commented Jan 28, 2022

Is it safe to cache based on pyhf.pdf.Model.spec rather than the full object?

I think this is almost safe, the only thing I can think of are the interpolation codes which are not contained in the specification. I think we could cache with a combination of spec + these codes. Thanks a lot for brining this up, I had not thought of this complication when re-building the model before.

I am benchmarking another setup I have, and pre-calculating the data for sym_unc_i and sym_unc_j once outside the loop brings this off-diagonal contribution from 184 seconds to 80 seconds. Together with the factor 8 from caching, that would lead to a speed-up of roughly a factor 20 for your setup presumably. Moving just sym_unc_i outside the loop takes 136 seconds for me, so these calculations seem to dominate. The diagonal contribution is less than a second, and it takes 4 seconds to set up all yields initially.

I am going to try whether I can optimize it further, but it seems in any case there is a lot of performance to be gained here.

edit: It should be possible to do this via matrix multiplication, it's just a matter of setting it up correctly with numpy and awkward. I am investigating.

@lhenkelm
Copy link
Contributor Author

Thank you for the update! I had a go at re-running the profiling with some of the above suggestions monkey-patched in to replace yield_stdev:

patch:                                    cumulative time: speedup rel. to 0.4.0:
skipping the offdiagonal loop             3-4s             x40
caching based on spec and interpcodes     35s              x4
random sampling (50k toys)                184s             -
iminuit jacobian                          7s               x20

I put the exact implementations I used in this gist.
My comments in no particular order:

  • the random sampling and iminuit jacobian versions are overly optimistic since I did not include any code to compute the uncertainty on the total channel yield. A proper implementation of these would necessarily be a little slower due to this.
  • making the caching treat models more like a value type gives a 4x improvement. Which is great, and the reason it is not x8 is that I overlooked (sorry!) that half of the 8 calls in my use case are for pre-fit plots, the other half for post-fit, so two un-cached calls are necessary, not one. Still, the changed caching would be great to have IMO.
  • simply skipping the correlation loop gives an enourmous x40 speed increase, confirming your conclusion.
  • the iminuit propagate function is really fast in my case (I would guess in the ballpark of what optimizing the correlation loop could achieve?).
  • the random sampling was very slow for me, even slower than the 0.4.0 implementation. This seems to be due to the now very large number of calls (about 400k) to pyhfs' expected_data. If there were a way to batch these, that might be different though. Also simpler models with less expensive modifiers (i.e. no/few interpolators or similar) could maybe gain much more from a sampling approach.

I am going to hold off on trying my hand at optimizing the loop for now, since you are already investigating that.
But let me know if there is any info I could provide that would help!

@alexander-held
Copy link
Member

See #316 for an update: thanks to help from @agoose77, the off-diagonal calculation can be fully vectorized. In my setup this results now in a runtime of 0.4 s for that part (compared to ~180 s previously). The bottleneck is now actually the initial setup of all yields for all variations. There is probably some further room for optimization there.

the iminuit propagate function is really fast in my case (I would guess in the ballpark of what optimizing the correlation loop could achieve?).

I had noticed this as well and need to compare more closely. cabinetry approximates the derivative just from the slope of up and down variations, while iminuit does a proper finite difference method I think. I would expect iminuit to be slightly slower for that reason, assuming an optimized cabinetry implementation.

If there were a way to batch these

There is, but I believe it would involve re-creating the model (scikit-hep/pyhf#545). That could probably be done and I would expect that to drastically speed up this approach.

@lhenkelm
Copy link
Contributor Author

lhenkelm commented Feb 1, 2022

See #316 for an update: thanks to help from @agoose77, the off-diagonal calculation can be fully vectorized. In my setup this results now in a runtime of 0.4 s for that part (compared to ~180 s previously).

I am very glad to see this nice optimization. Thank you @agoose77 and @alexander-held!

If there were a way to batch these

There is, but I believe it would involve re-creating the model (scikit-hep/pyhf#545). That could probably be done and I would expect that to drastically speed up this approach.

I now tried a batched version of random sampling. It took 118s for the use case that with 0.4.0 would take 137s. So it is faster, but not by as much as I thought. But I also found that as I increase the number of toys, the interpolator call in normsys takes more and more time (more so than the rest of the expected_data call stack). So it seems that random sampling is atm limited by the scaling of the pyhf modifier/interpolator implementation. (Or maybe I missed something.)
The batched random sampling implementation I tried is here.

@lhenkelm
Copy link
Contributor Author

lhenkelm commented Feb 3, 2022

The bottleneck is now actually the initial setup of all yields for all variations. There is probably some further room for optimization there.

Maybe you already looked into it @alexander-held (if so, sorry for the noise), but I had a little bit of time and I tried out also vectorizing the initial setup of the yield variations (including batching the pyhf Model evaluation).
It does give some more improvement: my example went from 0.15s per call when using the perf/yield-uncertainty-vectorization version to 0.1s per call. That could be halved if the input model is already set up with the right batch size, so the function does not need to re-create it. (Though I think in most cases one might not want to bother)
Here is the implementation: gist.

BTW, how do you feel about caching based on spec + interpcodes instead of the model directly? I'd be happy to submit a PR.

@alexander-held
Copy link
Member

I now tried a batched version of random sampling. It took 118s

How many samples are you considering for this? I do not have a good feeling for how many are needed, but thought that one may get a way with a fairly small number.

it seems that random sampling is atm limited by the scaling of the pyhf modifier/interpolator implementation

Is the bottleneck within pyhf itself, or further down in the backends doing the actual calculation? If there is a lot of overhead from pyhf, we should follow that up.

I tried out also vectorizing the initial setup of the yield variations
It does give some more improvement: my example went from 0.15s per call when using the perf/yield-uncertainty-vectorization version to 0.1s per call

Nice! We can try this for a few other setups, but it might very well be that even with model re-creation this is worth it in general as an additional optimization.

BTW, how do you feel about caching based on spec + interpcodes instead of the model directly? I'd be happy to submit a PR.

Yes please, a PR would be very welcome! I still need to understand the difference I saw after the refactor in #316, but we can include other improvements in separate PRs. The change to spec + interpcode makes perfect sense to me, I don't see a reason to stick with the current model-based caching.

@lhenkelm
Copy link
Contributor Author

lhenkelm commented Feb 3, 2022

How many samples are you considering for this? I do not have a good feeling for how many are needed, but thought that one may get a way with a fairly small number.

For the timings with random sampling above I used 50k samples. When batching, I split that into 2 batches of 25k samples each, to stay in RAM.
It may be that a smaller number of samples is fine. I imagine the smallest useable number may vary depending on the model and the structure of the correlations.

Is the bottleneck within pyhf itself, or further down in the backends doing the actual calculation? If there is a lot of overhead from pyhf, we should follow that up.

That is a good question. I only tried numpy here, and I never looked into how normsys is implemented, so I don't have an answer.
All of these functions are called from several different places, so it would take some work to disambiguate what is happening.
Probably it would be better just a profile some pyhf.Model.expected_data calls for a variety of models/modifiers to understand this.
But, at least for my purposes, the main thing is to get the linear propagation method to a useable speed when iterating through versions of fits and plots, and #316 already does that.

@alexander-held
Copy link
Member

alexander-held commented Feb 8, 2022

I found out why the tests in #316 failed for the per-channel uncertainties: there is a bug in the old implementation. The bug has been around since the original implementation in #189 as far as I can tell.

if (
labels[i_par][0:10] == "staterror_"
and labels[j_par][0:10] == "staterror_"
):
continue # two different staterrors are orthogonal, no contribution

This was a performance optimization, introduced before per-channel uncertainties were calculated. The argument that the cross-term vanishes for two staterror modifiers holds when calculating the per-bin uncertainties, but it is no longer true when calculating per-channel uncertainties.

I will implement a fix for this first for clarity (tracked in #323), and then replace it all with the new implementation (which matches the old results after this fix).

I expect that the impact of this is generally small, since typically staterror modifiers do not have a huge impact and are not strongly correlated.

@alexander-held
Copy link
Member

I had a little bit of time and I tried out also vectorizing the initial setup of the yield variations (including batching the pyhf Model evaluation).
It does give some more improvement: my example went from 0.15s per call when using the perf/yield-uncertainty-vectorization version to 0.1s per call. That could be halved if the input model is already set up with the right batch size, so the function does not need to re-create it. (Though I think in most cases one might not want to bother)
Here is the implementation: gist.

I'd propose to close this issue via #316 and track this part in a new issue as an additional improvement. I can open a new issue for that once #316 is done. I quite like the solution

  pars_square = np.tile(parameters.copy().astype(float), (model.config.npars, 1))

  up_variations_ak   = _prepare_yield_variations(model, pars_square + np.diagflat(uncertainty))
  down_variations_ak = _prepare_yield_variations(model, pars_square - np.diagflat(uncertainty))

for parameter variations. All the vectorization (also #316) can potentially make the code quite a bit harder to read, and I think the comments you have in the gist are very helpful. I'll try to incorporate this information into #316 as well.

@lhenkelm
Copy link
Contributor Author

lhenkelm commented Feb 9, 2022

I'd propose to close this issue via #316 and track this part in a new issue as an additional improvement. I can open a new issue for that once #316 is done.

I like that plan. Once #316 is in, the title of this issue will no longer be so appropriate, after all.

I quite like the solution

  pars_square = np.tile(parameters.copy().astype(float), (model.config.npars, 1))

  up_variations_ak   = _prepare_yield_variations(model, pars_square + np.diagflat(uncertainty))
  down_variations_ak = _prepare_yield_variations(model, pars_square - np.diagflat(uncertainty))

for parameter variations. All the vectorization (also #316) can potentially make the code quite a bit harder to read, and I think the comments you have in the gist are very helpful. I'll try to incorporate this information into #316 as well.

Thanks! I am glad you can see some use for it :)

@agoose77
Copy link
Contributor

agoose77 commented Feb 9, 2022

  pars_square = np.tile(parameters.copy().astype(float), (model.config.npars, 1))

  up_variations_ak   = _prepare_yield_variations(model, pars_square + np.diagflat(uncertainty))
  down_variations_ak = _prepare_yield_variations(model, pars_square - np.diagflat(uncertainty))

I think you can drop the parameters.copy here, because np.tile will make a copy unless you have a 0-dim array?

@lhenkelm
Copy link
Contributor Author

lhenkelm commented Feb 9, 2022

  pars_square = np.tile(parameters.copy().astype(float), (model.config.npars, 1))

  up_variations_ak   = _prepare_yield_variations(model, pars_square + np.diagflat(uncertainty))
  down_variations_ak = _prepare_yield_variations(model, pars_square - np.diagflat(uncertainty))

I think you can drop the parameters.copy here, because np.tile will make a copy unless you have a 0-dim array?

You're probably right about np.tile (Is this promised somewhere by numpy docs, or just an effect of the implementation?). np.ndarray.astype also copies, and in many non-ancient versions this can be made explicit to readers with parameters.astype(float, copy=True), so the parameters.copy is redundant in either case.

@alexander-held
Copy link
Member

The batching of model calls for yield variations can be tracked in #325.

@alexander-held
Copy link
Member

alexander-held commented May 31, 2023

#408 adds another significant gain in performance (and decrease in memory use) by switching from using awkward to pure numpy. New updates will be tracked in #409.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants