-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added reader and help functions to convert scikit-hep histogram to Ta…
…bles
- Loading branch information
Showing
3 changed files
with
641 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,374 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Reading histograms\n", | ||
"\n", | ||
"For the new python-based frameworks, another common task would be to translate\n", | ||
"histogram in the [`scikit-hep.hist`](https://hist.readthedocs.io/en/latest/)\n", | ||
"package into the HEPData format. The functions in the `hepdata_lib` will help\n", | ||
"you with that, and this notebook will demonstrate how to do that.\n", | ||
"\n", | ||
"As explained in the [Getting started notebook](Getting_started.ipynb), a\n", | ||
"`Submission` needs to exist or be created. Here, we'll just create one without\n", | ||
"any additional information:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Welcome to JupyROOT 6.28/04\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from hepdata_lib import Submission\n", | ||
"\n", | ||
"submission = Submission()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The common use-case `scikit-hep` histograms is to allow for after-the-fact\n", | ||
"slicing and grouping from a main histogram. Let us first generate a fake\n", | ||
"histogram that may appear in common histograms, as well as a common slicing\n", | ||
"routine" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"Hist(\n", | ||
" StrCategory(['data', 'QCD', 'ttbar'], name='dataset'),\n", | ||
" IntCategory([-1, 0, 4, 5], name='flavor'),\n", | ||
" Regular(60, -3, 3, name='eta'),\n", | ||
" Regular(20, 0, 500, name='pt'),\n", | ||
" storage=Weight()) # Sum: WeightedSum(value=221221, variance=123802) (WeightedSum(value=256973, variance=143935) with flow)" | ||
] | ||
}, | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import hist\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"rng = np.random.default_rng(seed=123_456_789)\n", | ||
"\n", | ||
"h = hist.Hist(\n", | ||
" hist.axis.StrCategory([\"data\", \"QCD\", \"ttbar\"], name=\"dataset\"),\n", | ||
" hist.axis.IntCategory([-1, 0, 4, 5], name=\"flavor\"),\n", | ||
" hist.axis.Regular(60, -3, +3, name=\"eta\"),\n", | ||
" hist.axis.Regular(20, 0, 500, name=\"pt\"),\n", | ||
" storage=hist.storage.Weight(),\n", | ||
")\n", | ||
"\n", | ||
"h.fill( ## For mock data\n", | ||
" dataset=\"data\",\n", | ||
" flavor=-1,\n", | ||
" eta=rng.normal(0, 2.0, size=123_456),\n", | ||
" pt=rng.exponential(100, size=123_456),\n", | ||
")\n", | ||
"h.fill( ## For Mock QCD\n", | ||
" dataset=\"QCD\",\n", | ||
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]),\n", | ||
" eta=rng.normal(0.0, 2.0, size=1_000_000),\n", | ||
" pt=rng.exponential(100, size=1_000_000),\n", | ||
" weight=0.123456 * 2 * rng.random(size=1_000_000),\n", | ||
")\n", | ||
"h.fill( ## For mock ttbar\n", | ||
" dataset=\"ttbar\",\n", | ||
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]),\n", | ||
" eta=rng.normal(0.0, 1.5, size=1_000_000),\n", | ||
" pt=rng.exponential(200, size=1_000_000),\n", | ||
" weight=0.01 * 2 * rng.random(size=1_000_000),\n", | ||
")\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example of manual processing to 1D array\n", | ||
"\n", | ||
"Let use create a simple slicing routine to get the various histograms of\n", | ||
"interest, then use the most versatile function, the\n", | ||
"`hepdata_lib.hist_utils.read_hist` method, to create arrays that will be\n", | ||
"compatible with variable creation." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'pt': array([( 0., 25.), ( 25., 50.), ( 50., 75.), ( 75., 100.),\n", | ||
" (100., 125.), (125., 150.), (150., 175.), (175., 200.),\n", | ||
" (200., 225.), (225., 250.), (250., 275.), (275., 300.),\n", | ||
" (300., 325.), (325., 350.), (350., 375.), (375., 400.),\n", | ||
" (400., 425.), (425., 450.), (450., 475.), (475., 500.)],\n", | ||
" dtype=[('f0', '<f4'), ('f1', '<f4')]), 'hist_value': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n", | ||
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n", | ||
" 485., 401., 294., 230.]), 'hist_variance': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n", | ||
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n", | ||
" 485., 401., 294., 230.])}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from hepdata_lib.hist_utils import read_hist\n", | ||
"\n", | ||
"data_hist = h[dict(dataset='data', flavor=sum, eta=sum)]\n", | ||
"fqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(1.4j,None,sum))] + h[dict(dataset='QCD', flavor=sum, eta=slice(None,-1.4j,sum))]\n", | ||
"cqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(-1.4j, 1.4j,sum))]\n", | ||
"tt_b_hist = h[dict(dataset='ttbar', flavor=4j, eta=sum)]\n", | ||
"tt_l_hist = h[dict(dataset='ttbar', flavor=0j, eta=sum)]\n", | ||
"\n", | ||
"tab_data = read_hist(data_hist)\n", | ||
"tab_fqcd = read_hist(fqcd_hist)\n", | ||
"tab_cqcd = read_hist(cqcd_hist)\n", | ||
"tab_tt_b = read_hist(tt_b_hist)\n", | ||
"tab_tt_l = read_hist(tt_l_hist)\n", | ||
"\n", | ||
"print(tab_data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"All axes remaining will generate a corresponding array that can be used to\n", | ||
"declare `Variable` instances. Notice that because the histogram was declared\n", | ||
"with `storage=Weight`, entries for `hist_value` (sum of weights) and\n", | ||
"`hist_variance` (sum of weight-squared) will be presented to the user. This\n", | ||
"information can the be used for the uncertainty generation." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from hepdata_lib import Table, Variable, Uncertainty\n", | ||
"import hist.intervals\n", | ||
"\n", | ||
"tab1d = Table(\"my table\")\n", | ||
"\n", | ||
"var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n", | ||
"var.values = tab_data[\"pt\"]\n", | ||
"tab1d.add_variable(var)\n", | ||
"\n", | ||
"# Filling in the data entries\n", | ||
"data_var = Variable(\n", | ||
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n", | ||
")\n", | ||
"data_var.values = tab_data[\"hist_value\"]\n", | ||
"tab1d.add_variable(data_var)\n", | ||
"\n", | ||
"# Fill and example MC entry with uncertainty\n", | ||
"fqcd_var = Variable(\n", | ||
" \"QCD (MC) forward region\",\n", | ||
" is_independent=False,\n", | ||
" is_binned=False,\n", | ||
" units=\"Number of jets\",\n", | ||
")\n", | ||
"fqcd_var.values = tab_fqcd[\"hist_value\"]\n", | ||
"fqcd_stat = Uncertainty(\"statistical uncertainty\", is_symmetric=False)\n", | ||
"s, s2 = tab_fqcd[\"hist_value\"], tab_fqcd[\"hist_variance\"]\n", | ||
"lo, up = hist.intervals.poisson_interval(s, s2)\n", | ||
"lo, up = lo - s, up - s\n", | ||
"fqcd_stat.values = zip(lo, up)\n", | ||
"fqcd_var.add_uncertainty(fqcd_stat)\n", | ||
"tab1d.add_variable(fqcd_var)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To improve the readability of uncertainty variable declaration, the `hist_utils`\n", | ||
"package also provides, `hist_as_variable` method to easily declare variables.\n", | ||
"Uncertainties should be provided using a dictionary, with the dictionary keys\n", | ||
"being the desired name of the uncertainty, and the dictionary values being how\n", | ||
"the uncertainties should be defined. This can either be a string indicating how\n", | ||
"the uncertainty should be calculated (`poisson_sym` or `poisson_asym`), a\n", | ||
"floating point (pair) to indicate flat, a histogram (pair) of the same format as\n", | ||
"the input histogram representing the uncertainty, or a numpy array (pair) the is\n", | ||
"compatible with the final array output." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from hepdata_lib.hist_utils import hist_as_variable\n", | ||
"\n", | ||
"tt_b_var = hist_as_variable(\n", | ||
" \"ttbar b jets\",\n", | ||
" tt_b_hist,\n", | ||
" uncertainty={\n", | ||
" \"statistical\": \"poisson_sym\",\n", | ||
" \"flat symmetric uncertainty\": 0.1,\n", | ||
" \"flat asymmetric uncertainty\": (-0.02, +0.03),\n", | ||
" \"asymmetric uncertainty from histogram\": (\n", | ||
" tt_l_hist * (-0.02),\n", | ||
" tt_l_hist * 0.03,\n", | ||
" ),\n", | ||
" },\n", | ||
")\n", | ||
"\n", | ||
"tab1d.add_variable(tt_b_var)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example using N-dimensional histogram \n", | ||
"\n", | ||
"Notice that because the flexibility of the scikit-hep histogram syntax, the same\n", | ||
"syntax can be used for the histograms of arbitrary dimensions\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"tab2d = Table(\"my table 2d\")\n", | ||
"\n", | ||
"\n", | ||
"data_hist = h[dict(dataset=\"data\", flavor=sum)]\n", | ||
"qcd_hist = h[dict(dataset=\"QCD\", flavor=sum)]\n", | ||
"tt_b_hist = h[dict(dataset=\"ttbar\", flavor=4j)]\n", | ||
"tt_l_hist = h[dict(dataset=\"ttbar\", flavor=0j)]\n", | ||
"\n", | ||
"tab_data = read_hist(data_hist)\n", | ||
"# tab_qcd = read_hist(qcd_hist)\n", | ||
"# tab_tt_b = read_hist(tt_b_hist)\n", | ||
"# tab_tt_l = read_hist(tt_l_hist)\n", | ||
"\n", | ||
"pt_var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n", | ||
"pt_var.values = tab_data[\"pt\"]\n", | ||
"tab2d.add_variable(pt_var)\n", | ||
"eta_var = Variable(\"Jet eta\", is_independent=True, is_binned=True)\n", | ||
"eta_var.values = tab_data[\"eta\"]\n", | ||
"tab2d.add_variable(eta_var)\n", | ||
"\n", | ||
"# Filling in the data entries\n", | ||
"data_var = Variable(\n", | ||
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n", | ||
")\n", | ||
"data_var.values = tab_data[\"hist_value\"]\n", | ||
"tab2d.add_variable(data_var)\n", | ||
"\n", | ||
"## Fill QCD, just statistical uncertainty\n", | ||
"qcd_var = hist_as_variable(\n", | ||
" \"QCD MC\", qcd_hist, uncertainty={\"statistical\": \"poisson_asym\"}\n", | ||
")\n", | ||
"tab2d.add_variable(qcd_var)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To further simply the construction of table from N-dimensional histograms, we\n", | ||
"provide a `create_hist_base_table` function such that the axes variables are\n", | ||
"automatically setup.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from hepdata_lib.hist_utils import create_hist_base_table\n", | ||
"\n", | ||
"tab2d_v2 = create_hist_base_table(\n", | ||
" \"my simplifed 2d table\",\n", | ||
" data_hist,\n", | ||
" axes_rename={\"pt\": \"Jet pT\", \"eta\": \"Jet eta\"},\n", | ||
" axes_units={\"pt\": \"GeV\"},\n", | ||
")\n", | ||
"\n", | ||
"tab2d_v2.add_variable(hist_as_variable(\"Data\", data_hist))\n", | ||
"tab2d_v2.add_variable(\n", | ||
" hist_as_variable(\n", | ||
" \"ttbar l jets\", tt_l_hist, uncertainty={\"statistical\": \"poisson_sym\"}\n", | ||
" )\n", | ||
")\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Outputting the submission\n", | ||
"\n", | ||
"Finally, we can add the table to the sumission and create the files. Please\n", | ||
"refer to the [Getting started notebook](Getting_started.ipynb) for a complete\n", | ||
"example." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"submission.add_table(tab1d)\n", | ||
"submission.add_table(tab2d)\n", | ||
"submission.add_table(tab2d_v2)\n", | ||
"submission.create_files(\"example_output_hist\", remove_old=True)\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "hepdata_lib development environment", | ||
"language": "python", | ||
"name": "hepdata_env" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.