Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting from scikit-hep histogram to Table #243

Merged
merged 12 commits into from
Nov 15, 2023
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ There are a few more examples available that can directly be run using the [bind
- [Reading TGraph and TGraphError from '.C' files](https://github.com/HEPData/hepdata_lib/blob/main/examples/read_c_file.ipynb)
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/HEPData/hepdata_lib/main?filepath=examples/read_c_file.ipynb)
<br/><br/>
- [Preparing scikit-hep histograms](https://github.com/HEPData/hepdata_lib/blob/main/examples/reading_scikithep_histogram.ipynb)
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/HEPData/hepdata_lib/main?filepath=examples/reading_scikihep_histogram.ipynb)
<br/><br/>

## External dependencies

Expand Down
381 changes: 381 additions & 0 deletions examples/reading_scikihep_histograms.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,381 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading histograms\n",
"\n",
"For the new python-based frameworks, another common format would needs\n",
"translation are histogram in the\n",
"[`scikit-hep.hist`](https://hist.readthedocs.io/en/latest/). The functions in\n",
"the `hepdata_lib.hist_utils` will help you with that, and this notebook will\n",
"demonstrate how to do that.\n",
"\n",
"As explained in the [Getting started notebook](Getting_started.ipynb), a\n",
"`Submission` needs to exist or be created. Here, we'll just create one without\n",
"any additional information.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Welcome to JupyROOT 6.28/04\n"
]
}
],
"source": [
"from hepdata_lib import Submission\n",
"\n",
"submission = Submission()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The common use-case for `scikit-hep` histograms is to allow for after-the-fact\n",
"slicing and grouping from a primary histogram. Let us first generate a fake\n",
"histogram that may appear in common histograms, as well as a common slicing\n",
"routine\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Hist(\n",
" StrCategory(['data', 'QCD', 'ttbar'], name='dataset'),\n",
" IntCategory([-1, 0, 4, 5], name='flavor'),\n",
" Regular(60, -3, 3, name='eta'),\n",
" Regular(20, 0, 500, name='pt'),\n",
" storage=Weight()) # Sum: WeightedSum(value=221221, variance=123802) (WeightedSum(value=256973, variance=143935) with flow)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import hist\n",
"import numpy as np\n",
"\n",
"rng = np.random.default_rng(seed=123_456_789)\n",
"\n",
"h = hist.Hist(\n",
" hist.axis.StrCategory([\"data\", \"QCD\", \"ttbar\"], name=\"dataset\"),\n",
" hist.axis.IntCategory([-1, 0, 4, 5], name=\"flavor\"),\n",
" hist.axis.Regular(60, -3, +3, name=\"eta\"),\n",
" hist.axis.Regular(20, 0, 500, name=\"pt\"),\n",
" storage=hist.storage.Weight(),\n",
")\n",
"\n",
"h.fill( ## For mock data\n",
" dataset=\"data\",\n",
" flavor=-1,\n",
" eta=rng.normal(0, 2.0, size=123_456),\n",
" pt=rng.exponential(100, size=123_456),\n",
")\n",
"h.fill( ## For Mock QCD\n",
" dataset=\"QCD\",\n",
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]),\n",
" eta=rng.normal(0.0, 2.0, size=1_000_000),\n",
" pt=rng.exponential(100, size=1_000_000),\n",
" weight=0.123456 * 2 * rng.random(size=1_000_000),\n",
")\n",
"h.fill( ## For mock ttbar\n",
" dataset=\"ttbar\",\n",
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]),\n",
" eta=rng.normal(0.0, 1.5, size=1_000_000),\n",
" pt=rng.exponential(200, size=1_000_000),\n",
" weight=0.01 * 2 * rng.random(size=1_000_000),\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example of manual processing to 1D array\n",
"\n",
"Let use create a simple slicing routine to get the various histograms of\n",
"interest, then use the most general function, the\n",
"`hepdata_lib.hist_utils.read_hist` method, to create arrays that will be\n",
"compatible with `hepdata_lib.Variable` declaration.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'hist_value': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n",
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n",
" 485., 401., 294., 230.]),\n",
" 'hist_variance': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n",
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n",
" 485., 401., 294., 230.]),\n",
" 'pt': array([( 0., 25.), ( 25., 50.), ( 50., 75.), ( 75., 100.),\n",
" (100., 125.), (125., 150.), (150., 175.), (175., 200.),\n",
" (200., 225.), (225., 250.), (250., 275.), (275., 300.),\n",
" (300., 325.), (325., 350.), (350., 375.), (375., 400.),\n",
" (400., 425.), (425., 450.), (450., 475.), (475., 500.)],\n",
" dtype=[('f0', '<f4'), ('f1', '<f4')])}\n"
]
}
],
"source": [
"from hepdata_lib.hist_utils import read_hist\n",
"import pprint\n",
"\n",
"data_hist = h[dict(dataset=\"data\", flavor=sum, eta=sum)]\n",
"fqcd_hist = (\n",
" h[dict(dataset=\"QCD\", flavor=sum, eta=slice(1.4j, None, sum))]\n",
" + h[dict(dataset=\"QCD\", flavor=sum, eta=slice(None, -1.4j, sum))]\n",
")\n",
"cqcd_hist = h[dict(dataset=\"QCD\", flavor=sum, eta=slice(-1.4j, 1.4j, sum))]\n",
"tt_b_hist = h[dict(dataset=\"ttbar\", flavor=4j, eta=sum)]\n",
"tt_l_hist = h[dict(dataset=\"ttbar\", flavor=0j, eta=sum)]\n",
"\n",
"tab_data = read_hist(data_hist)\n",
"tab_fqcd = read_hist(fqcd_hist)\n",
"tab_cqcd = read_hist(cqcd_hist)\n",
"tab_tt_b = read_hist(tt_b_hist)\n",
"tab_tt_l = read_hist(tt_l_hist)\n",
"\n",
"pprint.pprint(tab_data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All axes remaining will generate a corresponding array that can be used to\n",
"declare `Variable` instances. Because the histogram was declared with\n",
"`storage=Weight`, entries for `hist_value` (sum of weights) and `hist_variance`\n",
"(sum of weight-squared) will be presented to the user. This information can the\n",
"be used for the uncertainty generation. A simple example is shown below:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib import Table, Variable, Uncertainty\n",
"import hist.intervals\n",
"\n",
"tab1d = Table(\"my table\")\n",
"\n",
"var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
"var.values = tab_data[\"pt\"]\n",
"tab1d.add_variable(var)\n",
"\n",
"# Filling in the data entries\n",
"data_var = Variable(\n",
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
")\n",
"data_var.values = tab_data[\"hist_value\"]\n",
"tab1d.add_variable(data_var)\n",
"\n",
"# Fill and example MC entry with uncertainty\n",
"fqcd_var = Variable(\n",
" \"QCD (MC) forward region\",\n",
" is_independent=False,\n",
" is_binned=False,\n",
" units=\"Number of jets\",\n",
")\n",
"fqcd_var.values = tab_fqcd[\"hist_value\"]\n",
"fqcd_stat = Uncertainty(\"statistical uncertainty\", is_symmetric=False)\n",
"s, s2 = tab_fqcd[\"hist_value\"], tab_fqcd[\"hist_variance\"]\n",
"lo, up = hist.intervals.poisson_interval(s, s2)\n",
"lo, up = lo - s, up - s\n",
"fqcd_stat.values = zip(lo, up)\n",
"fqcd_var.add_uncertainty(fqcd_stat)\n",
"tab1d.add_variable(fqcd_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To improve the readability of uncertainty variable declaration, the `hist_utils`\n",
"package also provides, `hist_as_variable` method to easily declare variables.\n",
"Uncertainties should be provided using a dictionary, with the dictionary keys\n",
"being the desired name of the uncertainty, and the dictionary values being how\n",
"the uncertainties should be defined. This can either be a string indicating how\n",
"the uncertainty should be calculated (`poisson_sym` or `poisson_asym`), a\n",
"floating point (pair) to indicate flat, a histogram (pair) of the same format as\n",
"the input histogram representing the uncertainty, or a numpy array (pair) the is\n",
"compatible with the final array output.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib.hist_utils import hist_as_variable\n",
"\n",
"tt_b_var = hist_as_variable(\n",
" \"ttbar b jets\",\n",
" tt_b_hist,\n",
" uncertainty={\n",
" \"statistical\": \"poisson_sym\",\n",
" \"flat symmetric uncertainty\": 0.1,\n",
" \"flat asymmetric uncertainty\": (-0.02, +0.03),\n",
" \"asymmetric uncertainty from histogram\": (\n",
" tt_l_hist * (-0.02),\n",
" tt_l_hist * 0.03,\n",
" ),\n",
" },\n",
")\n",
"\n",
"tab1d.add_variable(tt_b_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example using N-dimensional histogram\n",
"\n",
"Because of the flexibility of the scikit-hep histogram syntax, the same\n",
"functions can be used for the histograms of arbitrary dimensions.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"tab2d = Table(\"my table 2d\")\n",
"\n",
"\n",
"data_hist = h[dict(dataset=\"data\", flavor=sum)]\n",
"qcd_hist = h[dict(dataset=\"QCD\", flavor=sum)]\n",
"tt_b_hist = h[dict(dataset=\"ttbar\", flavor=4j)]\n",
"tt_l_hist = h[dict(dataset=\"ttbar\", flavor=0j)]\n",
"\n",
"tab_data = read_hist(data_hist)\n",
"# tab_qcd = read_hist(qcd_hist) # No-longer required to be declared!\n",
"# tab_tt_b = read_hist(tt_b_hist)\n",
"# tab_tt_l = read_hist(tt_l_hist)\n",
"\n",
"pt_var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
"pt_var.values = tab_data[\"pt\"]\n",
"tab2d.add_variable(pt_var)\n",
"eta_var = Variable(\"Jet eta\", is_independent=True, is_binned=True)\n",
"eta_var.values = tab_data[\"eta\"]\n",
"tab2d.add_variable(eta_var)\n",
"\n",
"# Filling in the data entries\n",
"data_var = Variable(\n",
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
")\n",
"data_var.values = tab_data[\"hist_value\"]\n",
"tab2d.add_variable(data_var)\n",
"\n",
"## Fill QCD, just statistical uncertainty\n",
"qcd_var = hist_as_variable(\n",
" \"QCD MC\", qcd_hist, uncertainty={\"statistical\": \"poisson_asym\"}\n",
")\n",
"tab2d.add_variable(qcd_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To further simply the construction of table from N-dimensional histograms, we\n",
"provide a `create_hist_base_table` function such that the axes variables are\n",
"automatically set up directly with table declaration.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib.hist_utils import create_hist_base_table\n",
"\n",
"tab2d_v2 = create_hist_base_table(\n",
" \"my simplifed 2d table\",\n",
" data_hist,\n",
" axes_rename={\"pt\": \"Jet pT\", \"eta\": \"Jet eta\"},\n",
" axes_units={\"pt\": \"GeV\"},\n",
")\n",
"\n",
"tab2d_v2.add_variable(hist_as_variable(\"Data\", data_hist))\n",
"tab2d_v2.add_variable(\n",
" hist_as_variable(\n",
" \"ttbar l jets\", tt_l_hist, uncertainty={\"statistical\": \"poisson_sym\"}\n",
" )\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outputting the submission\n",
"\n",
"Finally, we can add the table to the sumission and create the required files.\n",
"Please refer to the [Getting started notebook](Getting_started.ipynb) for a\n",
"complete example.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"submission.add_table(tab1d)\n",
"submission.add_table(tab2d)\n",
"submission.add_table(tab2d_v2)\n",
"submission.create_files(\"example_output_hist\", remove_old=True)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "hepdata_lib development environment",
"language": "python",
"name": "hepdata_env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading