Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model hashing #1762

Open
1 task done
alexander-held opened this issue Feb 4, 2022 · 5 comments
Open
1 task done

Model hashing #1762

alexander-held opened this issue Feb 4, 2022 · 5 comments
Labels
feat/enhancement New feature or request needs-triage Needs a maintainer to categorize and assign

Comments

@alexander-held
Copy link
Member

Summary

It could be convenient to be able to hash a pyhf.Model to allow comparing models to each other. For an example, see @lhenkelm 's comment scikit-hep/cabinetry#315 (comment) and scikit-hep/cabinetry#322, where this is used in a model-specific cache.

Additional Information

As far as I am aware, model specification + interpcodes is all that is required to uniquely identify a model. If there is additional information, that would be great to know!

Code of Conduct

  • I agree to follow the Code of Conduct
@kratsg
Copy link
Contributor

kratsg commented Feb 4, 2022

The measurement definition is what uniquely identifies that model fyi -- a workspace is unique (and has a hash) via pyhf digest and the combination of the channel specification + a measurement is what makes the model unique.

Specifically,

pyhf digest workspace.json

should always give you the corresponding hash for that workspace that's uniquely identifiable. The problem is that two workspaces could be practically identical, but different due to things like floating-point differences... (I think sorting is done by default)

@alexander-held
Copy link
Member Author

alexander-held commented Feb 4, 2022

The pyhf digest approach does not include interpcodes, right?

The reason for only hashing the model is that in this use case of yield uncertainty calculation the measurement does not matter, two models are "equal" if they provide the same expected_data for the same input parameters. Whether those parameters come from a measurement or are made up does not matter in the calculation. Does the pyhf digest have a model contribution that could be re-used?

edit: Yes, looks like utils.digest could be used for the model. That would leave the interpcodes out though I think.

@kratsg
Copy link
Contributor

kratsg commented Feb 4, 2022

two models are "equal" if they provide the same expected_data for the same input parameters.

this isn't true because a measurement can contain overrides, different values for lumi, different bounds, etc.. Although arguably, I think only lumi is the only special one that is set through a measurement object which can change expected data, but the rest is pretty fixed.

edit: Yes, looks like utils.digest could be used for the model. That would leave the interpcodes out though I think.

yeah, this was written to be somewhat generic enough, and yes, the interpcodes are left out. We need to incorporate the part of HiFa specification that has interpcodes, so that they're treated as first-class citizens rather than necessarily an implementation configuration (which is how they're treated right now).

@alexander-held
Copy link
Member Author

I think only lumi is the only special one that is set through a measurement object which can change expected data

I gave this a try and did not manage to create a setup where the lumi config has an impact on expected_data. Starting with the model in #1516, I first expected that auxdata might matter. My thought was that with auxdata=[2] and a parameter of 2.2, samples would be scaled by 2.2/2=1.1. As far as I can tell though this is not the case, samples are always scaled by the parameter value directly and the setup of the constraint term is irrelevant to expected_data.

I think now that this makes sense, when not scaling samples to luminosity and setting the auxdata to luminosity instead, the best-fit result for the lumi modifier will be close to the luminosity and scale the samples as desired (so not dividing by auxdata is a feature, allowing the use of samples that have not already been scaled to luminosity). When just evaluating expected_data, the constraint term setup is irrelevant.

Is there any other way in which the lumi configuration matters?

@alexander-held
Copy link
Member Author

When comparing models and checking for equality via model spec + interpcodes, one thing that will of course be missed is different measurement config related things like model.config.suggested_init(). Despite these things, which are relevant for fits, the "core" pieces of pdf + predicted distributions via logpdf and expected_data are always going to be the same if spec + interpcodes match I think.

I think there's some conceptual distinction between aspects of the model itself (pdf / predicted distributions) and information regarding the use of the model (measurement). They all live in pyhf.Model, but matter for different tasks. The measurement information can also be overridden in fits, which is another difference: the model only provides suggestions, and data/auxdata which can all be changed easily without building a new model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/enhancement New feature or request needs-triage Needs a maintainer to categorize and assign
Projects
None yet
Development

No branches or pull requests

2 participants