-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.py
141 lines (122 loc) · 5.69 KB
/
README.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.14.4
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
# # Foundry
#
# Foundry is a package for forging interpretable predictive modeling pipelines with a sklearn style-API. It includes:
#
# - A `Glm` class with a [Pytorch](https://pytorch.org) backend. This class is highly extensible, supporting (almost) any distribution in pytorch's [distributions](https://pytorch.org/docs/stable/distributions.html) module.
# - A `preprocessing` module that includes helpful classes like `DataFrameTransformer` and `InteractionFeatures`.
# - An `evaluation` module with tools for interpreting any sklearn-API model via `MarginalEffects`.
#
# You should use Foundry to augment your workflows if any of the following are true:
#
# - You are attempting to model a target that is 'weird': for example, highly skewed data, binomial count-data, censored or truncated data, etc.
# - You need some help battling some annoying aspects of feature-engineering: for example, you want an expressive way of specifying interaction-terms in your model; or perhaps you just want consistent support for getting feature-names despite being stuck on python 3.7.
# - You want to interpret your model: for example, perform statistical inference on its parameters, or understand the direction and functional-form of its predictors.
#
# ## Getting Started
#
# `foundry` can be installed with pip:
#
# ```bash
# pip install git+https://github.com/strongio/foundry.git#egg=foundry
# ```
#
# Let's walk through a quick example:
# %%
# data:
from foundry.data import get_click_data
# preprocessing:
from foundry.preprocessing import DataFrameTransformer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PowerTransformer
from sklearn.pipeline import make_pipeline
# glm:
from foundry.glm import Glm
# evaluation:
from foundry.evaluation import MarginalEffects
# %% [markdown]
# Here's a dataset of click user pageviews and clicks for domain with lots of pages:
# %%
df_train, df_val = get_click_data()
df_train
# %% [markdown]
# We'd like to build a model that let's us predict future click-rates for different pages (page_id), page-attributes (e.g. market), and user-attributes (e.g. platform), and also _learn_ about each of these features -- e.g. perform statistical inference on model-coefficients ("are users with missing user-agent data significantly worse than average?")
#
# Unfortunately, these data don't fit nicely into the typical regression/classification divide: each observations captures _counts_ of clicks and _counts_ of pageviews. Our target is the click-rate (clicks/views) and our sample-weight is the pageviews.
#
# One workaround would be to expand our dataset so that each row indicates `is_click` (True/False) -- then we could use a standard classification algorithm:
# %%
df_train_expanded, df_val_expanded = get_click_data(expanded=True)
df_train_expanded
# %% [markdown]
# But this is hugely inefficient: our dataset of ~400K explodes to almost 8MM.
#
# Within `foundry`, we have the `Glm`, which supports binomial data directly:
# %%
Glm('binomial', penalty=10_000)
# %% [markdown]
# Let's set up a sklearn model pipeline using this Glm. We'll use `foundry`'s `DataFrameTransformer` to support passing feature-names to the Glm (newer versions of sklearn support this via the `set_output()` API).
# %%
preproc = DataFrameTransformer([
(
'one_hot',
make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()),
['attributed_source', 'user_agent_platform', 'page_id', 'page_market']
)
,
(
'power',
PowerTransformer(),
['page_feat1', 'page_feat2', 'page_feat3']
)
])
# %%
glm = make_pipeline(
preproc,
Glm('binomial', penalty=1_000)
).fit(
X=df_train,
y={
'value' : df_train['num_clicks'],
'total_count' : df_train['num_views']
},
)
# %% [markdown]
# By default, the `Glm` will estimate not just the parameters of our model, but also the uncertainty associated with them. We can access a dataframe of these with the `coef_dataframe_` attribute:
# %%
df_coefs = glm[-1].coef_dataframe_
df_coefs
# %% [markdown]
# Using this, it's easy to plot our model-coefficients:
# %%
df_coefs[['param', 'trans', 'term']] = df_coefs['name'].str.split('__', n=3, expand=True)
df_coefs[df_coefs['name'].str.contains('page_feat')].plot('term', 'estimate', kind='bar', yerr='se')
df_coefs[df_coefs['name'].str.contains('user_agent_platform')].plot('term', 'estimate', kind='bar', yerr='se')
# %% [markdown]
# Model-coefficients are limited because they only give us a single number, and for non-linear models (like our binomial GLM) this doesn't tell the whole story. For example, how could we translate the importance of `page_feat3` into understanable terms? This only gets more difficult if our model includes interaction-terms.
#
# To aid in this, there is `MarginalEffects`, a tool for plotting our model-predictions as a function of each predictor:
# %%
glm_me = MarginalEffects(glm)
glm_me.fit(
X=df_val_expanded,
y=df_val_expanded['is_click'],
vary_features=['page_feat3']
).plot()
# %% [markdown]
# Here we see that how this predictor's impact on click-rates varies due to floor effects.
#
# As a bonus, we ploted the actual values alongside the predictions, and we can see potential room for improvement in our model: it looks like very high values of this predictor have especially high click-rates, so an extra step in feature-engineering that captures this discontinuity may be warranted.