Skip to content

Commit

Permalink
Added reader and help functions to convert scikit-hep histogram to Ta…
Browse files Browse the repository at this point in the history
…bles
  • Loading branch information
yimuchen committed Oct 27, 2023
1 parent 8dcf6a1 commit 3425353
Show file tree
Hide file tree
Showing 3 changed files with 641 additions and 0 deletions.
374 changes: 374 additions & 0 deletions examples/reading_scikihep_histograms.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,374 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading histograms\n",
"\n",
"For the new python-based frameworks, another common task would be to translate\n",
"histogram in the [`scikit-hep.hist`](https://hist.readthedocs.io/en/latest/)\n",
"package into the HEPData format. The functions in the `hepdata_lib` will help\n",
"you with that, and this notebook will demonstrate how to do that.\n",
"\n",
"As explained in the [Getting started notebook](Getting_started.ipynb), a\n",
"`Submission` needs to exist or be created. Here, we'll just create one without\n",
"any additional information:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Welcome to JupyROOT 6.28/04\n"
]
}
],
"source": [
"from hepdata_lib import Submission\n",
"\n",
"submission = Submission()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The common use-case `scikit-hep` histograms is to allow for after-the-fact\n",
"slicing and grouping from a main histogram. Let us first generate a fake\n",
"histogram that may appear in common histograms, as well as a common slicing\n",
"routine"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Hist(\n",
" StrCategory(['data', 'QCD', 'ttbar'], name='dataset'),\n",
" IntCategory([-1, 0, 4, 5], name='flavor'),\n",
" Regular(60, -3, 3, name='eta'),\n",
" Regular(20, 0, 500, name='pt'),\n",
" storage=Weight()) # Sum: WeightedSum(value=221221, variance=123802) (WeightedSum(value=256973, variance=143935) with flow)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import hist\n",
"import numpy as np\n",
"\n",
"rng = np.random.default_rng(seed=123_456_789)\n",
"\n",
"h = hist.Hist(\n",
" hist.axis.StrCategory([\"data\", \"QCD\", \"ttbar\"], name=\"dataset\"),\n",
" hist.axis.IntCategory([-1, 0, 4, 5], name=\"flavor\"),\n",
" hist.axis.Regular(60, -3, +3, name=\"eta\"),\n",
" hist.axis.Regular(20, 0, 500, name=\"pt\"),\n",
" storage=hist.storage.Weight(),\n",
")\n",
"\n",
"h.fill( ## For mock data\n",
" dataset=\"data\",\n",
" flavor=-1,\n",
" eta=rng.normal(0, 2.0, size=123_456),\n",
" pt=rng.exponential(100, size=123_456),\n",
")\n",
"h.fill( ## For Mock QCD\n",
" dataset=\"QCD\",\n",
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]),\n",
" eta=rng.normal(0.0, 2.0, size=1_000_000),\n",
" pt=rng.exponential(100, size=1_000_000),\n",
" weight=0.123456 * 2 * rng.random(size=1_000_000),\n",
")\n",
"h.fill( ## For mock ttbar\n",
" dataset=\"ttbar\",\n",
" flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]),\n",
" eta=rng.normal(0.0, 1.5, size=1_000_000),\n",
" pt=rng.exponential(200, size=1_000_000),\n",
" weight=0.01 * 2 * rng.random(size=1_000_000),\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example of manual processing to 1D array\n",
"\n",
"Let use create a simple slicing routine to get the various histograms of\n",
"interest, then use the most versatile function, the\n",
"`hepdata_lib.hist_utils.read_hist` method, to create arrays that will be\n",
"compatible with variable creation."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'pt': array([( 0., 25.), ( 25., 50.), ( 50., 75.), ( 75., 100.),\n",
" (100., 125.), (125., 150.), (150., 175.), (175., 200.),\n",
" (200., 225.), (225., 250.), (250., 275.), (275., 300.),\n",
" (300., 325.), (325., 350.), (350., 375.), (375., 400.),\n",
" (400., 425.), (425., 450.), (450., 475.), (475., 500.)],\n",
" dtype=[('f0', '<f4'), ('f1', '<f4')]), 'hist_value': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n",
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n",
" 485., 401., 294., 230.]), 'hist_variance': array([27405., 21382., 16585., 12740., 10069., 7878., 6007., 4678.,\n",
" 3666., 2903., 2333., 1734., 1352., 1048., 851., 634.,\n",
" 485., 401., 294., 230.])}\n"
]
}
],
"source": [
"from hepdata_lib.hist_utils import read_hist\n",
"\n",
"data_hist = h[dict(dataset='data', flavor=sum, eta=sum)]\n",
"fqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(1.4j,None,sum))] + h[dict(dataset='QCD', flavor=sum, eta=slice(None,-1.4j,sum))]\n",
"cqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(-1.4j, 1.4j,sum))]\n",
"tt_b_hist = h[dict(dataset='ttbar', flavor=4j, eta=sum)]\n",
"tt_l_hist = h[dict(dataset='ttbar', flavor=0j, eta=sum)]\n",
"\n",
"tab_data = read_hist(data_hist)\n",
"tab_fqcd = read_hist(fqcd_hist)\n",
"tab_cqcd = read_hist(cqcd_hist)\n",
"tab_tt_b = read_hist(tt_b_hist)\n",
"tab_tt_l = read_hist(tt_l_hist)\n",
"\n",
"print(tab_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All axes remaining will generate a corresponding array that can be used to\n",
"declare `Variable` instances. Notice that because the histogram was declared\n",
"with `storage=Weight`, entries for `hist_value` (sum of weights) and\n",
"`hist_variance` (sum of weight-squared) will be presented to the user. This\n",
"information can the be used for the uncertainty generation."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib import Table, Variable, Uncertainty\n",
"import hist.intervals\n",
"\n",
"tab1d = Table(\"my table\")\n",
"\n",
"var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
"var.values = tab_data[\"pt\"]\n",
"tab1d.add_variable(var)\n",
"\n",
"# Filling in the data entries\n",
"data_var = Variable(\n",
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
")\n",
"data_var.values = tab_data[\"hist_value\"]\n",
"tab1d.add_variable(data_var)\n",
"\n",
"# Fill and example MC entry with uncertainty\n",
"fqcd_var = Variable(\n",
" \"QCD (MC) forward region\",\n",
" is_independent=False,\n",
" is_binned=False,\n",
" units=\"Number of jets\",\n",
")\n",
"fqcd_var.values = tab_fqcd[\"hist_value\"]\n",
"fqcd_stat = Uncertainty(\"statistical uncertainty\", is_symmetric=False)\n",
"s, s2 = tab_fqcd[\"hist_value\"], tab_fqcd[\"hist_variance\"]\n",
"lo, up = hist.intervals.poisson_interval(s, s2)\n",
"lo, up = lo - s, up - s\n",
"fqcd_stat.values = zip(lo, up)\n",
"fqcd_var.add_uncertainty(fqcd_stat)\n",
"tab1d.add_variable(fqcd_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To improve the readability of uncertainty variable declaration, the `hist_utils`\n",
"package also provides, `hist_as_variable` method to easily declare variables.\n",
"Uncertainties should be provided using a dictionary, with the dictionary keys\n",
"being the desired name of the uncertainty, and the dictionary values being how\n",
"the uncertainties should be defined. This can either be a string indicating how\n",
"the uncertainty should be calculated (`poisson_sym` or `poisson_asym`), a\n",
"floating point (pair) to indicate flat, a histogram (pair) of the same format as\n",
"the input histogram representing the uncertainty, or a numpy array (pair) the is\n",
"compatible with the final array output."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib.hist_utils import hist_as_variable\n",
"\n",
"tt_b_var = hist_as_variable(\n",
" \"ttbar b jets\",\n",
" tt_b_hist,\n",
" uncertainty={\n",
" \"statistical\": \"poisson_sym\",\n",
" \"flat symmetric uncertainty\": 0.1,\n",
" \"flat asymmetric uncertainty\": (-0.02, +0.03),\n",
" \"asymmetric uncertainty from histogram\": (\n",
" tt_l_hist * (-0.02),\n",
" tt_l_hist * 0.03,\n",
" ),\n",
" },\n",
")\n",
"\n",
"tab1d.add_variable(tt_b_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example using N-dimensional histogram \n",
"\n",
"Notice that because the flexibility of the scikit-hep histogram syntax, the same\n",
"syntax can be used for the histograms of arbitrary dimensions\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"tab2d = Table(\"my table 2d\")\n",
"\n",
"\n",
"data_hist = h[dict(dataset=\"data\", flavor=sum)]\n",
"qcd_hist = h[dict(dataset=\"QCD\", flavor=sum)]\n",
"tt_b_hist = h[dict(dataset=\"ttbar\", flavor=4j)]\n",
"tt_l_hist = h[dict(dataset=\"ttbar\", flavor=0j)]\n",
"\n",
"tab_data = read_hist(data_hist)\n",
"# tab_qcd = read_hist(qcd_hist)\n",
"# tab_tt_b = read_hist(tt_b_hist)\n",
"# tab_tt_l = read_hist(tt_l_hist)\n",
"\n",
"pt_var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
"pt_var.values = tab_data[\"pt\"]\n",
"tab2d.add_variable(pt_var)\n",
"eta_var = Variable(\"Jet eta\", is_independent=True, is_binned=True)\n",
"eta_var.values = tab_data[\"eta\"]\n",
"tab2d.add_variable(eta_var)\n",
"\n",
"# Filling in the data entries\n",
"data_var = Variable(\n",
" \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
")\n",
"data_var.values = tab_data[\"hist_value\"]\n",
"tab2d.add_variable(data_var)\n",
"\n",
"## Fill QCD, just statistical uncertainty\n",
"qcd_var = hist_as_variable(\n",
" \"QCD MC\", qcd_hist, uncertainty={\"statistical\": \"poisson_asym\"}\n",
")\n",
"tab2d.add_variable(qcd_var)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To further simply the construction of table from N-dimensional histograms, we\n",
"provide a `create_hist_base_table` function such that the axes variables are\n",
"automatically setup.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from hepdata_lib.hist_utils import create_hist_base_table\n",
"\n",
"tab2d_v2 = create_hist_base_table(\n",
" \"my simplifed 2d table\",\n",
" data_hist,\n",
" axes_rename={\"pt\": \"Jet pT\", \"eta\": \"Jet eta\"},\n",
" axes_units={\"pt\": \"GeV\"},\n",
")\n",
"\n",
"tab2d_v2.add_variable(hist_as_variable(\"Data\", data_hist))\n",
"tab2d_v2.add_variable(\n",
" hist_as_variable(\n",
" \"ttbar l jets\", tt_l_hist, uncertainty={\"statistical\": \"poisson_sym\"}\n",
" )\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outputting the submission\n",
"\n",
"Finally, we can add the table to the sumission and create the files. Please\n",
"refer to the [Getting started notebook](Getting_started.ipynb) for a complete\n",
"example."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"submission.add_table(tab1d)\n",
"submission.add_table(tab2d)\n",
"submission.add_table(tab2d_v2)\n",
"submission.create_files(\"example_output_hist\", remove_old=True)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "hepdata_lib development environment",
"language": "python",
"name": "hepdata_env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 3425353

Please sign in to comment.