Added reader and help functions to convert scikit-hep histogram to Ta…

…bles
HEPData · Oct 27, 2023 · 3425353 · 3425353
1 parent 8dcf6a1
commit 3425353
Show file tree

Hide file tree

Showing 3 changed files with 641 additions and 0 deletions.
diff --git a/examples/reading_scikihep_histograms.ipynb b/examples/reading_scikihep_histograms.ipynb
@@ -0,0 +1,374 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Reading histograms\n",
+    "\n",
+    "For the new python-based frameworks, another common task would be to translate\n",
+    "histogram in the [`scikit-hep.hist`](https://hist.readthedocs.io/en/latest/)\n",
+    "package into the HEPData format. The functions in the `hepdata_lib` will help\n",
+    "you with that, and this notebook will demonstrate how to do that.\n",
+    "\n",
+    "As explained in the [Getting started notebook](Getting_started.ipynb), a\n",
+    "`Submission` needs to exist or be created. Here, we'll just create one without\n",
+    "any additional information:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Welcome to JupyROOT 6.28/04\n"
+     ]
+    }
+   ],
+   "source": [
+    "from hepdata_lib import Submission\n",
+    "\n",
+    "submission = Submission()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The common use-case `scikit-hep` histograms is to allow for after-the-fact\n",
+    "slicing and grouping from a main histogram. Let us first generate a fake\n",
+    "histogram that may appear in common histograms, as well as a common slicing\n",
+    "routine"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Hist(\n",
+       "  StrCategory(['data', 'QCD', 'ttbar'], name='dataset'),\n",
+       "  IntCategory([-1, 0, 4, 5], name='flavor'),\n",
+       "  Regular(60, -3, 3, name='eta'),\n",
+       "  Regular(20, 0, 500, name='pt'),\n",
+       "  storage=Weight()) # Sum: WeightedSum(value=221221, variance=123802) (WeightedSum(value=256973, variance=143935) with flow)"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import hist\n",
+    "import numpy as np\n",
+    "\n",
+    "rng = np.random.default_rng(seed=123_456_789)\n",
+    "\n",
+    "h = hist.Hist(\n",
+    "    hist.axis.StrCategory([\"data\", \"QCD\", \"ttbar\"], name=\"dataset\"),\n",
+    "    hist.axis.IntCategory([-1, 0, 4, 5], name=\"flavor\"),\n",
+    "    hist.axis.Regular(60, -3, +3, name=\"eta\"),\n",
+    "    hist.axis.Regular(20, 0, 500, name=\"pt\"),\n",
+    "    storage=hist.storage.Weight(),\n",
+    ")\n",
+    "\n",
+    "h.fill(  ## For mock data\n",
+    "    dataset=\"data\",\n",
+    "    flavor=-1,\n",
+    "    eta=rng.normal(0, 2.0, size=123_456),\n",
+    "    pt=rng.exponential(100, size=123_456),\n",
+    ")\n",
+    "h.fill(  ## For Mock QCD\n",
+    "    dataset=\"QCD\",\n",
+    "    flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.8, 0.15, 0.05]),\n",
+    "    eta=rng.normal(0.0, 2.0, size=1_000_000),\n",
+    "    pt=rng.exponential(100, size=1_000_000),\n",
+    "    weight=0.123456 * 2 * rng.random(size=1_000_000),\n",
+    ")\n",
+    "h.fill(  ## For mock ttbar\n",
+    "    dataset=\"ttbar\",\n",
+    "    flavor=rng.choice([0, 4, 5], size=1_000_000, p=[0.45, 0.1, 0.45]),\n",
+    "    eta=rng.normal(0.0, 1.5, size=1_000_000),\n",
+    "    pt=rng.exponential(200, size=1_000_000),\n",
+    "    weight=0.01 * 2 * rng.random(size=1_000_000),\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example of manual processing to 1D array\n",
+    "\n",
+    "Let use create a simple slicing routine to get the various histograms of\n",
+    "interest, then use the most versatile function, the\n",
+    "`hepdata_lib.hist_utils.read_hist` method, to create arrays that will be\n",
+    "compatible with variable creation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'pt': array([(  0.,  25.), ( 25.,  50.), ( 50.,  75.), ( 75., 100.),\n",
+      "       (100., 125.), (125., 150.), (150., 175.), (175., 200.),\n",
+      "       (200., 225.), (225., 250.), (250., 275.), (275., 300.),\n",
+      "       (300., 325.), (325., 350.), (350., 375.), (375., 400.),\n",
+      "       (400., 425.), (425., 450.), (450., 475.), (475., 500.)],\n",
+      "      dtype=[('f0', '<f4'), ('f1', '<f4')]), 'hist_value': array([27405., 21382., 16585., 12740., 10069.,  7878.,  6007.,  4678.,\n",
+      "        3666.,  2903.,  2333.,  1734.,  1352.,  1048.,   851.,   634.,\n",
+      "         485.,   401.,   294.,   230.]), 'hist_variance': array([27405., 21382., 16585., 12740., 10069.,  7878.,  6007.,  4678.,\n",
+      "        3666.,  2903.,  2333.,  1734.,  1352.,  1048.,   851.,   634.,\n",
+      "         485.,   401.,   294.,   230.])}\n"
+     ]
+    }
+   ],
+   "source": [
+    "from hepdata_lib.hist_utils import read_hist\n",
+    "\n",
+    "data_hist = h[dict(dataset='data', flavor=sum, eta=sum)]\n",
+    "fqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(1.4j,None,sum))] +  h[dict(dataset='QCD', flavor=sum, eta=slice(None,-1.4j,sum))]\n",
+    "cqcd_hist = h[dict(dataset='QCD', flavor=sum, eta=slice(-1.4j, 1.4j,sum))]\n",
+    "tt_b_hist = h[dict(dataset='ttbar', flavor=4j, eta=sum)]\n",
+    "tt_l_hist = h[dict(dataset='ttbar', flavor=0j, eta=sum)]\n",
+    "\n",
+    "tab_data = read_hist(data_hist)\n",
+    "tab_fqcd = read_hist(fqcd_hist)\n",
+    "tab_cqcd = read_hist(cqcd_hist)\n",
+    "tab_tt_b = read_hist(tt_b_hist)\n",
+    "tab_tt_l = read_hist(tt_l_hist)\n",
+    "\n",
+    "print(tab_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "All axes remaining will generate a corresponding array that can be used to\n",
+    "declare `Variable` instances. Notice that because the histogram was declared\n",
+    "with `storage=Weight`, entries for `hist_value` (sum of weights) and\n",
+    "`hist_variance` (sum of weight-squared) will be presented to the user. This\n",
+    "information can the be used for the uncertainty generation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from hepdata_lib import Table, Variable, Uncertainty\n",
+    "import hist.intervals\n",
+    "\n",
+    "tab1d = Table(\"my table\")\n",
+    "\n",
+    "var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
+    "var.values = tab_data[\"pt\"]\n",
+    "tab1d.add_variable(var)\n",
+    "\n",
+    "# Filling in the data entries\n",
+    "data_var = Variable(\n",
+    "    \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
+    ")\n",
+    "data_var.values = tab_data[\"hist_value\"]\n",
+    "tab1d.add_variable(data_var)\n",
+    "\n",
+    "# Fill and example MC entry with uncertainty\n",
+    "fqcd_var = Variable(\n",
+    "    \"QCD (MC) forward region\",\n",
+    "    is_independent=False,\n",
+    "    is_binned=False,\n",
+    "    units=\"Number of jets\",\n",
+    ")\n",
+    "fqcd_var.values = tab_fqcd[\"hist_value\"]\n",
+    "fqcd_stat = Uncertainty(\"statistical uncertainty\", is_symmetric=False)\n",
+    "s, s2 = tab_fqcd[\"hist_value\"], tab_fqcd[\"hist_variance\"]\n",
+    "lo, up = hist.intervals.poisson_interval(s, s2)\n",
+    "lo, up = lo - s, up - s\n",
+    "fqcd_stat.values = zip(lo, up)\n",
+    "fqcd_var.add_uncertainty(fqcd_stat)\n",
+    "tab1d.add_variable(fqcd_var)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To improve the readability of uncertainty variable declaration, the `hist_utils`\n",
+    "package also provides, `hist_as_variable` method to easily declare variables.\n",
+    "Uncertainties should be provided using a dictionary, with the dictionary keys\n",
+    "being the desired name of the uncertainty, and the dictionary values being how\n",
+    "the uncertainties should be defined. This can either be a string indicating how\n",
+    "the uncertainty should be calculated (`poisson_sym` or `poisson_asym`), a\n",
+    "floating point (pair) to indicate flat, a histogram (pair) of the same format as\n",
+    "the input histogram representing the uncertainty, or a numpy array (pair) the is\n",
+    "compatible with the final array output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from hepdata_lib.hist_utils import hist_as_variable\n",
+    "\n",
+    "tt_b_var = hist_as_variable(\n",
+    "    \"ttbar b jets\",\n",
+    "    tt_b_hist,\n",
+    "    uncertainty={\n",
+    "        \"statistical\": \"poisson_sym\",\n",
+    "        \"flat symmetric uncertainty\": 0.1,\n",
+    "        \"flat asymmetric uncertainty\": (-0.02, +0.03),\n",
+    "        \"asymmetric uncertainty from histogram\": (\n",
+    "            tt_l_hist * (-0.02),\n",
+    "            tt_l_hist * 0.03,\n",
+    "        ),\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "tab1d.add_variable(tt_b_var)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example using N-dimensional histogram \n",
+    "\n",
+    "Notice that because the flexibility of the scikit-hep histogram syntax, the same\n",
+    "syntax can be used for the histograms of arbitrary dimensions\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tab2d = Table(\"my table 2d\")\n",
+    "\n",
+    "\n",
+    "data_hist = h[dict(dataset=\"data\", flavor=sum)]\n",
+    "qcd_hist = h[dict(dataset=\"QCD\", flavor=sum)]\n",
+    "tt_b_hist = h[dict(dataset=\"ttbar\", flavor=4j)]\n",
+    "tt_l_hist = h[dict(dataset=\"ttbar\", flavor=0j)]\n",
+    "\n",
+    "tab_data = read_hist(data_hist)\n",
+    "# tab_qcd = read_hist(qcd_hist)\n",
+    "# tab_tt_b = read_hist(tt_b_hist)\n",
+    "# tab_tt_l = read_hist(tt_l_hist)\n",
+    "\n",
+    "pt_var = Variable(\"Jet pT\", is_independent=True, is_binned=True, units=\"GeV\")\n",
+    "pt_var.values = tab_data[\"pt\"]\n",
+    "tab2d.add_variable(pt_var)\n",
+    "eta_var = Variable(\"Jet eta\", is_independent=True, is_binned=True)\n",
+    "eta_var.values = tab_data[\"eta\"]\n",
+    "tab2d.add_variable(eta_var)\n",
+    "\n",
+    "# Filling in the data entries\n",
+    "data_var = Variable(\n",
+    "    \"data\", is_independent=False, is_binned=False, units=\"Number of jets\"\n",
+    ")\n",
+    "data_var.values = tab_data[\"hist_value\"]\n",
+    "tab2d.add_variable(data_var)\n",
+    "\n",
+    "## Fill QCD, just statistical uncertainty\n",
+    "qcd_var = hist_as_variable(\n",
+    "    \"QCD MC\", qcd_hist, uncertainty={\"statistical\": \"poisson_asym\"}\n",
+    ")\n",
+    "tab2d.add_variable(qcd_var)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To further simply the construction of table from N-dimensional histograms, we\n",
+    "provide a `create_hist_base_table` function such that the axes variables are\n",
+    "automatically setup.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from hepdata_lib.hist_utils import create_hist_base_table\n",
+    "\n",
+    "tab2d_v2 = create_hist_base_table(\n",
+    "    \"my simplifed 2d table\",\n",
+    "    data_hist,\n",
+    "    axes_rename={\"pt\": \"Jet pT\", \"eta\": \"Jet eta\"},\n",
+    "    axes_units={\"pt\": \"GeV\"},\n",
+    ")\n",
+    "\n",
+    "tab2d_v2.add_variable(hist_as_variable(\"Data\", data_hist))\n",
+    "tab2d_v2.add_variable(\n",
+    "    hist_as_variable(\n",
+    "        \"ttbar l jets\", tt_l_hist, uncertainty={\"statistical\": \"poisson_sym\"}\n",
+    "    )\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Outputting the submission\n",
+    "\n",
+    "Finally, we can add the table to the sumission and create the files. Please\n",
+    "refer to the [Getting started notebook](Getting_started.ipynb) for a complete\n",
+    "example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "submission.add_table(tab1d)\n",
+    "submission.add_table(tab2d)\n",
+    "submission.add_table(tab2d_v2)\n",
+    "submission.create_files(\"example_output_hist\", remove_old=True)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "hepdata_lib development environment",
+   "language": "python",
+   "name": "hepdata_env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}