Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions1 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle2. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 10.2.3.
+
Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions1 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle2. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 11.2.3.
@@ -1043,7 +1060,7 @@
-Figure 10.1: Log Transformation and Heteroscedasticity
+Figure 11.1: Log Transformation and Heteroscedasticity
@@ -1066,14 +1083,14 @@
-
-
10.2.2 Categorical variables
-
Despite their ubiquity in data, we can’t analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called effects coding to test for specific types of group differences. Far and away the most common type is called dummy coding or one-hot encoding3, which we visited previously Section 2.7.2. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the season feature that was demonstrated previously.
+
+
11.2.2 Categorical variables
+
Despite their ubiquity in data, we can’t analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called effects coding to test for specific types of group differences. Far and away the most common type is called dummy coding or one-hot encoding3, which we visited previously Section 3.5.2. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the season feature that was demonstrated previously.
-
-
10.2.2.2 Multiclass targets
+
+
11.2.2.2 Multiclass targets
We’ve talked about and demonstrated models with binary targets, but what about when there are more than two classes? In statistical settings, we can use a multinomial regression, which is a generalization of (binomial) logistic regression to more than two classes via the multinomial distribution. Depending on the tool, you may have to use a multivariate target of the counts, though most commonly they would be zeros and ones for a classification model, which then is just a one-hot encoded target. The following table demonstrates how this might look.
-Table 10.3: Multinomial Data Example
+Table 11.3: Multinomial Data Example
@@ -2113,10 +2130,10 @@
With Bayesian tools, it’s common to use the categorical distribution, which is a different generalization of the Bernoulli distribution to more than two classes. Unlike the multinomial, it is not a count distribution, but an actual distribution over categories.
In the machine learning context, we can use a variety of models we’d use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use one-vs-all or one-vs-one strategies, where you treat each class as the target in a binary classification problem. In the first case of one vs. all, you would have a model for each class that predicts whether an observation is in that class or not. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand. As an example, we can’t expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. As such, it can be misleading when the models are conducted as if the categories are independent.
-
Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard relative effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (Chapter 3).
+
Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard relative effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (Chapter 4).
-
-
10.2.2.3 Multilabel targets
+
+
11.2.2.3 Multilabel targets
Multilabel targets are a bit more complicated, and are not as common as multiclass targets. In this case, each observation can have multiple labels. For example, if we wanted to predict genre based on the movie review data, we could choose to allow a movie to be both a comedy and action film, a sci-fi horror, or a romantic comedy. In this setting, labels are not mutually exclusive. If there are not too many unique label settings, we can treat the target as we would other multiclass targets, but if there are many, we might need to use a different model to go about things more efficiently.
@@ -2136,34 +2153,34 @@
-
-
10.2.3 Ordinal variables
+
+
11.2.3 Ordinal variables
So far in our discussion of categorical data, it’s assumed to have no order. But it’s quite common to have labels like “low”, “medium”, and “high”, or “very bad”, “bad”, “neutral”, “good”, “very good”, or simply are a few numbers, like ratings from 1 to 5. Ordinal data is categorical data that has a known ordering, but which still has arbitrary labels. Let us repeat that, ordinal data is categorical data.
-
-
10.2.3.1 Ordinal features
+
+
11.2.3.1 Ordinal features
The simplest way to treat ordinal features is as if they were numeric. If you do this, then you’re just pretending that it’s not categorical, but in practice this is usually fine for features. Most of the transformations we mentioned previously aren’t going to be as useful, but you can still use them if you want. For example, logging ratings 1-5 isn’t going to do anything for you model-wise, but it technically doesn’t hurt anything. But you should know that typical statistics like means and standard deviations don’t really make sense for ordinal data, so the main reason for treating them as numeric is for modeling convenience.
If you choose to treat an ordinal feature as categorical, you can ignore the ordering and do the same as you would with categorical data. This would allow for some nonlinearity since the category means will be whatever they need to be. There are some specific techniques to coding ordinal data for use in linear models, but they are not commonly used, and they generally aren’t going to help the model performance or interpreting the feature, so we do not recommend them. You could however use old-school effects coding that you would incorporate traditional ANOVA models, but again, you’d need a good reason to do so.
The take home message for ordinal features is generally simple. Treat them as you would numeric features or non-ordered categorical features. Either is fine.
-
-
10.2.3.2 Ordinal targets
+
+
11.2.3.2 Ordinal targets
Ordinal targets can be trickier to deal with. If you treat them as numeric, you’re assuming that the difference between 1 and 2 is the same as the difference between 2 and 3, and so on. This is probably not true. If you treat them as categorical and use standard models for that setting, you’re assuming that there is no connection between categories. So what should you do?
There are a number of ways to model ordinal targets, but probably the most common is the proportional odds model. This model can be seen as a generalization of the logistic regression model, and is very similar to it, and actually identical if you only had two categories. It basically is a model of 2 vs. 1, 3 vs. (2, 1), 4 vs. (3, 2, 1), etc. But other models beyond proportional odds are also possible, and your results could return something that gives coefficients for the model for the 1-2 category change, the 2-3 category change, and so on.
Ordinality of a categorical outcome is largely ignored in machine learning applications. The outcome is either treated as numeric or multiclass classification. This is not necessarily a bad thing, especially if prediction is the primary goal. But if you need a categorical prediction, treating the target as numeric means you have to make an arbitrary choice to classify the predictions. And if you treat it as multiclass, you’re ignoring the ordinality of the target, which may not work as well in terms of performance.
-
-
10.2.3.3 Rank data
+
+
11.2.3.3 Rank data
Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allowed for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.
-
-
10.3 Missing Data
+
+
11.3 Missing Data
-
-
10.6 Time Series
+
+
11.6 Time Series
@@ -2788,17 +2805,17 @@
-Figure 10.7: Time Series Data
+Figure 11.7: Time Series Data
Time series data is any data that incorporates values over a period of time. This could be something like a state’s population over years, or the max temperature of an area over days. Time series data is very common in data science, and there are a number of ways to model such data.
-
-
10.6.1 Time-based targets
-
As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target’s past values as features, for example, autoregressive moving average (ARIMA) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (Section 9.3). Still others can get quite complex, such recurrent neural networks and their generalizations that form the backbone of modern AI models. Lately transformer-based models have looked promising.
-
Longitudinal data8 is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a mixed model (Section 6.3), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to the linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.
+
+
11.6.1 Time-based targets
+
As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target’s past values as features, for example, autoregressive moving average (ARIMA) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (Section 10.3). Still others can get quite complex, such recurrent neural networks and their generalizations that form the backbone of modern AI models. Lately transformer-based models have looked promising.
+
Longitudinal data8 is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a mixed model (Section 7.3), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to the linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.
In general lots of models can be found specific to time series data, and the choice of model will depend on the data, the questions we want to ask, and the goals we have.
@@ -2817,9 +2834,9 @@
-
-
10.6.2 Time-based features
-
When it comes to time-series features, we can apply time-specific transformations. One technique is the fourier transform, which can be used to decompose a time series into its component frequencies, much like how we use PCA (Section 9.2). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.
+
+
11.6.2 Time-based features
+
When it comes to time-series features, we can apply time-specific transformations. One technique is the fourier transform, which can be used to decompose a time series into its component frequencies, much like how we use PCA (Section 10.2). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.
In marketing contexts, some perform adstocking with features. This method models the delayed effect of features over time, such that they may have their most important impact immediately, but still can impact the present target from past values. For example, a marketing campaign might have the most significant impact immediately after it’s launched, but it can still influence the target variable at later time points. Adstocking helps capture this delayed effect without having to include multiple lagged features in the model. That said, including lagged features is also an option. In this case, you would have a feature for the current time point (t), the same feature for the previous time point (t-1), the feature for the time point before that (t-2), and so on.
If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with a feature like year, a zero year isn’t meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn’t really matter which year/month/day is zero, just that zero refers to one of the actual time points observed.
Dates and/or times can be a bit trickier. Often you can just split dates out into year, month, day, etc., and proceed as discussed. In other cases you’d want to track the time period to assess possible seasonal effects. You can use something like a cyclic approach (e.g. cyclic spline or sine/cosine transformation) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle.
@@ -2839,8 +2856,8 @@
-
10.6.2.1 Covariance structures
+
+
11.6.2.1 Covariance structures
In many cases you’ll have features that vary over time but are not a time-oriented feature like year or month. For example, you might have a feature that is the number of people who visited a website over days. This is a time-varying feature, but it’s not a time metric in and of itself.
In general, we’d like to account for the time-dependent correlations in our data, and the main way to do so is to posit a covariance structure that accounts for this in some fashion. This helps us understand how data points are related to each other over time, and requires us to estimate the correlations between observations. As a starting point, consider linear regression. In a standard linear regression model, we assume that the samples are independent of one another, with a constant variance and no covariance.
Instead, we can also use something like a mixed model, where we include a random effect for each group and estimate the variance attributable to the grouping effect. By default, this ultimately assumes a constant correlation from time point to time point, but many tools allow you to specify a more complex covariance structure. A common method is to use autoregressive covariance structure that allows for correlations further apart in time to lessen. In this sense the covariance comes in as an added random effect, rather than being a model in and of itself as with ARIMA. Many such approaches to covariance structures are special cases of gaussian processes, which are a very general technique to model time series, spatial, and other types of data.
@@ -2852,7 +2869,7 @@
-Figure 10.8: AR (1) Covariance Structure Visualized
+Figure 11.8: AR (1) Covariance Structure Visualized
@@ -2861,23 +2878,23 @@
-
10.7 Spatial Data
+
+
11.7 Spatial Data
-
We visited spatial data in a discussion on non-tabular data (Section 9.4.1), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a spatial regression model, where the target is a function of location among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (Section 6.4), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model, where we include a random effect for county.
+
We visited spatial data in a discussion on non-tabular data (Section 10.4.1), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a spatial regression model, where the target is a function of location among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (Section 7.4), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model, where we include a random effect for county.
There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a spatial lag, where we incorporate information about the neighborhood of a observation’s location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from Walker (2023)). Such models fall under names such as CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and so on. These models can be very effective, but are in general different forms of random effects models very similar to those used for time-based settings, and likewise can be seen as special cases of gaussian processes.
-
-
10.8 Multivariate Targets
+
+
11.8 Multivariate Targets
Often you will encounter settings where the target is not a single value, but a vector of values. This is often called a multivariate target in statistical settings, or just the norm for deep learning. For example, you might be interested in predicting the number of people who will buy a product, the number of people who will click on an ad, and the number of people who will sign up for a newsletter. This is a common setting in marketing, where you are interested in multiple outcomes. The main idea is that there is a relationship among the targets, and you want to take this into account.
One model example we’ve already seen is the case where we have more than two categories for the target. Some default approaches may take that input and just do a one-vs-all, for each category, but this kind of misses the point. Others will simultaneously model the multiple targets in some way. On the other hand, it can be difficult to interpret results with multiple targets. Because of this, you’ll often see results presented in terms of the respective targets anyway, and often even ignoring parameters specifically associated with such a model9.
In deep learning contexts, the multivariate setting is ubiquitous. For example, if you want to classify the content of an image, you might have to predict something like different species of animals, or different types of car models. In natural language processing, you might want to predict the probability of different words in a sentence. In some cases, there are even multiple kinds of targets considered simultaneously! It can get very complex, but often in these settings prediction performance far outweighs the need to interpret specific parameters, and so it’s a good fit.
@@ -2885,8 +2902,8 @@
-
-
10.9 Latent Variables
+
+
11.9 Latent Variables
-Figure 10.10: Latent Variable Model (Bifactor)
+Figure 11.10: Latent Variable Model (Bifactor)
Latent variables are a fundamental aspect of modeling, and simply put, are variables that are not directly observed, but are inferred from other variables. Here are some examples of what might be called latent variables:
-
The linear combination of features in a linear regression model is a latent variable, but usually we only think of it as such before the link transformation in GLMs (Chapter 5).
-
The error term in any model is a latent variable representing all the unknown/unobserved/unmodeled factors that influence the target (Equation 2.3).
The linear combination of features in a linear regression model is a latent variable, but usually we only think of it as such before the link transformation in GLMs (Chapter 6).
+
The error term in any model is a latent variable representing all the unknown/unobserved/unmodeled factors that influence the target (Equation 3.3).
The factor scores in a factor analysis model or structural equation (visualization above).
-
The true target underlying the censored values (Section 10.5).
-
The clusters in cluster analysis/mixture models (Section 9.2.1.1).
-
The random effects in a mixed model (Section 6.3).
+
The true target underlying the censored values (Section 11.5).
+
The clusters in cluster analysis/mixture models (Section 10.2.1.1).
+
The random effects in a mixed model (Section 7.3).
The hidden states in a hidden Markov model.
-
The hidden layers in a deep learning model (Section 8.7).
+
The hidden layers in a deep learning model (Section 9.7).
It’s easy to see from such a list that latent variables are very common in modeling, so it’s good to get comfortable with the concept. Whether they’re appropriate to your specific situation will depend on a variety of factors, but they can be very useful in many settings, if not a required part of the modeling approach.
-
-
10.10 Data Augmentation
+
+
11.10 Data Augmentation
Data augmentation is a technique where you artificially increase the size of your dataset by creating new data points based on the existing data. This is a common technique in deep learning for computer vision, where you might rotate, flip, or crop images to create new training data. This can help improve the performance of your model, especially when you have a small dataset. Techniques are also available for text.
-
In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 10.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.
+
In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 11.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.
Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process10. Downsampling the majority class can potentially throw away usefu information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.
-
-
10.11 Wrapping Up
+
+
11.11 Wrapping Up
There’s a lot going on with data before you ever get to modeling, and which will affect every aspect of your modeling approach. This chapter outlines common data types, issues, and associated modeling aspects, but in the end, you’ll always have to make decisions based on your specific situation, and they will often not be easy ones. These are only some of the things to consider, so be ready for surprises, and be ready to learn from them!
-
-
10.11.1 The common thread
+
+
11.11.1 The common thread
Many of the transformations and missing data techniques could possibly be applied in many modeling settings. Likewise, you may find yourself dealing with different target variable issues like imbalance or censoring, and deal with temporal, spatial or other structures, in a variety of models. The key is to understand the data, the target, and the features, and to make the best decisions you can based on that understanding.
-
-
10.11.2 Choose your own adventure
+
+
11.11.2 Choose your own adventure
Consider revisiting a model covered in the other parts of this book in light of the data issues discussed here. For example,might you deal with class imbalance for a boosted tree model? How would you deal with spatial structure in a neural network? How would you deal with a multivariate target in a time series model?
-
-
10.11.3 Additional resources
+
+
11.11.3 Additional resources
Here are some additional resources to help you learn more about the topics covered in this chapter.
The World Happiness Report is a survey of the state of global happiness that ranks countries by how ‘happy’ their citizens perceive themselves to be. You can also find additional details in their supplemental documentation. Our 2018 data is from what was originally reported at that time (figure 2.2) and it also contains a life ladder score from the most recent survey, which is similar and very highly correlated.
The dataset contains the following columns:
@@ -1457,30 +1474,30 @@
-Table A.2: World Happiness Report Dataset (All Years)
+Table D.2: World Happiness Report Dataset (All Years)
-
+
@@ -2016,30 +2033,30 @@
-Table A.3: World Happiness Report Dataset (2018)
+Table D.3: World Happiness Report Dataset (2018)
-
+
@@ -2596,8 +2613,8 @@
-
A.3 Heart Disease UCI
+
+
D.3 Heart Disease UCI
This classic dataset comes from the UCI ML repository. We took a version from Kaggle, and features and target were renamed to be more intelligible. Here is a brief description from UCI:
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
A very simple data set with a count target variable available for an exercise in the GLM chapter. Also good if you want to try your hand at zero-inflated models. The background is that state wildlife biologists want to model how many fish are being caught by fishermen at a state park.
nofish: We’ve never seen this explained. Originally 0 and 1, 0 is equivalent to livebait == ‘yes’, so it may be whether the primary motivation of the camping trip is for fishing or not.
@@ -3679,30 +3696,30 @@
-Table A.5: Fish Dataset
+Table D.5: Fish Dataset
-
+
@@ -4163,26 +4180,26 @@
-
+
@@ -5137,13 +5154,13 @@
diff --git a/docs/estimation.html b/docs/estimation.html
index 4c64812..11dc853 100644
--- a/docs/estimation.html
+++ b/docs/estimation.html
@@ -7,7 +7,7 @@
-4 How Did We Get Here? – [Models Demystified]{.smallcaps}
+5 How Did We Get Here? – [Models Demystified]{.smallcaps}
@@ -1059,8 +1076,8 @@
-
-
4.2.1 Other Setup
+
+
5.2.1 Other Setup
For the R examples, after the above nothing beyond base R is needed. For Python examples, the following should be enough to get you through the examples:
import pandas as pd
@@ -1076,8 +1093,8 @@
-
-
4.3 Starting Out by Guessing
+
+
5.3 Starting Out by Guessing
So, we’ll start with a model in which we predict a country’s level of happiness by their life expectancy, where if you can expect to live longer, maybe you’re probably in a country with better health care, higher incomes, and other important stuff. We’ll stick with our simple linear regression model as well.
As a starting point we can just guess what the parameter should be, but how would we know what to guess? How would we know which guesses are better than others? Let’s try a couple and see what happens. Let’s say that we think all countries start at the bottom on the happiness scale (around 3), but life expectancy makes a big impact- for every standard deviation of life expectancy we go up a whole point on happiness1. We can plug this into the model and see what we get:
\[
@@ -1088,15 +1105,15 @@
How do we know which is better? Let’s find out!
-
-
4.4 Prediction Error
+
+
5.4 Prediction Error
We’ve seen that a key component to model assessment involves comparing the predictions from the model to the actual values of the target. This difference is known as the prediction error, or residuals in more statistical contexts. We can express this as:
This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 3.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 3.2.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible2.
+
This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 4.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 4.2.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible2.
However, if we just take the average of our errors from a linear regression model, you’ll see that it is roughly zero! This is by design for many common models, where we even will explicitly write the formula for the error as coming from a normal distribution with mean of zero. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same3. We’ll use squared error here, and we’ll calculate the mean of the squared errors for all our predictions.
Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice, they are not as common as you might think. The take-home message is that we can interpret our result in a linear or nonlinear space, but it can be a bit difficult2. Our own preference is to stick with predicted probabilities, but it’s good to know how to interpret odds ratios.
+
Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. As shown in Table 6.1, knowledge of the baseline rate is required for a good understanding of them. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice elsewhere, they are not as common. The take-home message is that we can interpret our result in terms of odds (ratios of probabilities), log-odds (linear space), or as probabilities (nonlinear space), but it can take a little more effort than our linear regression setting1. Our own preference is to stick with predicted probabilities, but it’s good to have familiarity of odds ratios, since they are often reported in academic papers and media reports.
-
-
5.3.3 A logistic regression model
-
For our model let’s return to our movie review data, but now we are going to use rating_good as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We will use word_count and gender as our features.
-
+
+
6.3.3 A logistic regression model
+
Now let’s get our hands dirty and do a classification model using logistic regression. For our model let’s return to the movie review data, but now we’ll use the binary rating_good (‘good’ vs. ‘bad’) as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We examine the relationship of word_count and gender features with the likelihood of getting a good rating.
df_reviews =read_csv('https://tinyurl.com/moviereviewsdata')
-
-# for the by-hand option later
-X = df_reviews |>
-select(word_count, male = gender) |>
-mutate(male =ifelse(male =='male', 1, 0)) |>
-as.matrix()
-
-y = df_reviews$rating_good
+
df_reviews =read_csv('https://tinyurl.com/moviereviewsdata')
+
+# for the by-hand option later
+X = df_reviews |>
+select(word_count, male = gender) |>
+mutate(male =ifelse(male =='male', 1, 0)) |>
+as.matrix()
+
+y = df_reviews$rating_good
-
+
-
import pandas as pd
-import numpy as np
-
-df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
-
-# for the by-hand option later
-X = (
- df_reviews[['word_count', 'gender']]
- .rename(columns = {'gender': 'male'})
- .assign(male = np.where(df_reviews[['gender']] =='male', 1, 0))
-)
-
-y = df_reviews['rating_good']
+
import pandas as pd
+import numpy as np
+
+df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
+
+# for the by-hand option later
+X = (
+ df_reviews[['word_count', 'gender']]
+ .rename(columns = {'gender': 'male'})
+ .assign(male = np.where(df_reviews[['gender']] =='male', 1, 0))
+)
+
+y = df_reviews['rating_good']
For an initial logistic regression model, we can use standard and common functions in our chosen language. Running a logistic regression model requires the specification of the family, but that’s pretty much the only difference compared to our previous linear regression. The default link function for the binomial distribution is the ‘logit’ link, so we don’t have to specify it explicitly.
import statsmodels.api as sm
-import statsmodels.formula.api as smf
-
-model_logistic = smf.glm(
-'rating_good ~ word_count + gender',
- data = df_reviews,
- family = sm.families.Binomial()
-).fit()
-
-model_logistic.summary()
+
import statsmodels.api as sm
+import statsmodels.formula.api as smf
+
+model_logistic = smf.glm(
+'rating_good ~ word_count + gender',
+ data = df_reviews,
+ family = sm.families.Binomial()
+).fit()
+
+model_logistic.summary()
Generalized Linear Model Regression Results
@@ -1253,13 +1312,13 @@
Date:
-
Thu, 15 Aug 2024
+
Sun, 18 Aug 2024
Deviance:
1257.4
Time:
-
20:48:45
+
18:50:09
Pearson chi2:
1.02e+03
@@ -1336,43 +1395,43 @@
-
The binomial distribution is a count distribution. The logistic regression model is used to model binary outcomes, but we can use the binomial distribution because the binary setting is a special case of the binomial distribution where the number of trials is 1, and the number of successes can only be 0 or 1. In this case, we can also use the Bernoulli distribution, which does not require the number of trials, since, when the number of trials is 1 the factorial part of Equation 5.1 drops out.
+
As noted, the binomial distribution is a count distribution. For a binary outcome, we can only have a 0 or 1 outcome for each ‘trial’, and the ‘size’ or ‘n’ for the binomial distribution is 1. In this case, we can also use the Bernoulli distribution (\(\textrm{Bern}(p)\)). This does not require the number of trials, since, when the number of trials is 1 the factorial part of Equation 6.1 drops out.
Many coming from a non-statistical background are not aware that their logistic model can actually handle count and/or proportional outcomes.
-
-
5.3.4 Interpretation and visualization
-
We need to know what those results mean. The coefficients that we get from our model are in log odds, but as we demonstrated we can exponentiate them to get the odds ratio. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from ‘bad’ to ‘good’, whereas a negative log odds would indicate that a decrease in the feature will decrease the log odds of moving from ‘bad’ to ‘good’. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease.
+
+
6.3.4 Interpretation and visualization
+
If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in log odds, but as we demonstrated we can exponentiate them to get the odds ratio. Interpreting log odds is difficult at best, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from ‘bad’ to ‘good’, whereas a negative log odds would indicate that a decrease in the feature will decrease the log odds of moving from ‘bad’ to ‘good’. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease.
-Table 5.2: Raw Coefficients and Odds Ratios for a Logistic Regression
+Table 6.2: Raw Coefficients and Odds Ratios for a Logistic Regression
-
+
@@ -1824,23 +1883,23 @@
-
+
-Figure 5.4: Model Predictions for Word Count Feature
+Figure 6.4: Model Predictions for Word Count Feature
-
In Figure 5.4, we can see a clear negative relationship between the number of words in a review and the probability of being considered a ‘good’ movie. As we get over 20 words, the predicted probability of being a ‘good’ movie is less than .2. We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference.
+
In Figure 6.4, we can see a clear negative relationship between the number of words in a review and the probability of being considered a ‘good’ movie. As we get over 20 words, the predicted probability of being a ‘good’ movie is less than .2. We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference.
@@ -1849,13 +1908,13 @@
-Figure 5.5: Model Predictions for Gender
+Figure 6.5: Model Predictions for Gender
-
In the end, whether you think these differences are practically significant is up to you. And you’ll still need to do the standard model exploration to further understand the model (Chapter 3 has lots of detail on this). But this is a good start.
+
In the end, whether you think these differences are practically significant is up to you. And you’ll still need to do the standard model exploration to further understand the model (Chapter 4 has lots of detail on this). But this is a good start.
@@ -1868,19 +1927,19 @@
-
Logistic regression does not have an \(R^2\) value in the way that a linear regression model does. Instead, there are pseudo-\(R^2\) values, but they are not the same as the \(R^2\) value that you are used to seeing Computing (2023).
+
Logistic regression does not have an \(R^2\) value in the way that a linear regression model does. Instead, there are pseudo-\(R^2\) values, but they are not the same as the \(R^2\) value that you are used to seeing (UCLA Advanced Research Computing (2023)).
-
-
5.4 Poisson Regression
+
+
6.4 Poisson Regression
Poisson regression also belongs to the class of generalized linear models, and is used specifically when you have a count variable as your target. After logistic regression for binary outcomes, Poisson regression is probably the next most common type of generalized linear model you will encounter. Unlike continuous targets, a count starts at 0 and can only be a whole number. Often it is naturally skewed as well, so we’d like a model well-suited to this situation. Unlike the binomial, there is no concept of number of trials, just the count of events.
-
-
5.4.1 The Poisson distribution
-
The Poisson distribution is very similar to the binomial distribution, because the binomial is also a count distribution, and in fact generalizes the poisson3. The Poisson has a single parameter noted as \(\lambda\), which makes it the simplest model setting we’ve seen so far4. Conceptually, this rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season.
-
Let’s see what the particular distribution might look like for different rates. We can see that for low count values, the distribution is skewed to the right, but note how the distribution becomes more symmetric and bell-shaped as the rate increases5. You might also be able to tell that the variance increases along with the mean, and in fact, the variance is equal to the mean for the Poisson distribution.
+
+
6.4.1 The Poisson distribution
+
The Poisson distribution is very similar to the binomial distribution, because the binomial is also a count distribution, and in fact generalizes the poisson2. The Poisson has a single parameter noted as \(\lambda\), which makes it the simplest model setting we’ve seen so far3. Conceptually, this rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season.
+
Let’s see what the particular distribution might look like for different rates. We can see that for low count values, the distribution is skewed to the right, but note how the distribution becomes more symmetric and bell-shaped as the rate increases4. You might also be able to tell that the variance increases along with the mean, and in fact, the variance is equal to the mean for the Poisson distribution.
@@ -1889,7 +1948,7 @@
-Figure 5.6: Poisson Distributions for Different Rates
+Figure 6.6: Poisson Distributions for Different Rates
-Figure 5.7: Distribution of the Personal Pronouns Seen Across Reviews
+Figure 6.7: Distribution of the Personal Pronouns Seen Across Reviews
-
-
5.4.2 A Poisson regression model
+
+
6.4.2 A Poisson regression model
-
Recall that every GLM distribution has a default link function. The Poisson distribution uses a log link function:
-
\[y \sim \textrm{Poisson}(\lambda)\]
+
Recall that GLM specific distributions have a default link function. The Poisson distribution uses a log link function:
+
\[y^* \sim \textrm{Poisson}(\lambda)\]
\[\text{log}(\lambda) = \alpha + X\beta\]
-
Using the log link keeps the outcome non-negative when we use the inverse of it. For model fitting with standard functions, all we have to do is switch the family from ‘binomial’ to ‘poisson’. As the default link is the ‘log’, so we don’t have to specify it explicitly6. We can run the model and get the results as we did before, but we keep our presentation here to just the coefficients.
+
Using the log link keeps the outcome non-negative when we use the inverse of it. For model fitting with standard functions, all we have to do is switch the family from ‘binomial’ to ‘poisson’. As the default link is the ‘log’, so we don’t have to specify it explicitly5.
+
In this model we’ll predict the number of personal pronouns used in a review. We’ll use word count and gender as our features like we did with the logistic model.
model_poisson = smf.glm(
- formula ='poss_pronoun ~ word_count + gender',
- data = df_reviews,
- family = sm.families.Poisson()
-).fit()
-
-model_poisson.summary()
-
-np.exp(model_poisson.params)
+
model_poisson = smf.glm(
+ formula ='poss_pronoun ~ word_count + gender',
+ data = df_reviews,
+ family = sm.families.Poisson()
+).fit()
+
+model_poisson.summary()
+
+np.exp(model_poisson.params)
-
-
5.4.3 Interpretation and visualization
+
+
6.4.3 Interpretation and visualization
Like with logistic, we can exponentiate the coefficients to get what’s now referred to as the rate ratio. This is the ratio of the rate of the outcome occurring for a one unit increase in the feature.
-Table 5.3: Rate Ratios for a Poisson Regression
+Table 6.3: Rate Ratios for a Poisson Regression
-
+
@@ -2483,7 +2543,7 @@
-Figure 5.8: Poisson Model Predictions for Word Count Feature
+Figure 6.8: Poisson Model Predictions for Word Count Feature
@@ -2502,109 +2562,109 @@
-
Did you notice that both our effects for word count in the logistic (Figure 5.4) and Poisson (Figure 5.8) models were not exactly the straightest of lines? Once we’re on the probability and count scales, we’re not going to see the same linear relationships that we might expect from a basic linear model due to the transformation. If we plot the effect on the log-odds or log-count scale, we’re back to straight lines. This is a first taste in how the linear model can be used to get at nonlinear relationships, which are of the focus of Chapter 6.
+
You’ll note again that our effects for word count in the logistic (Figure 6.4) and Poisson (Figure 6.8) models were not exactly the straightest of lines. Once we’re on the probability and count scales, we’re not going to see the same linear relationships that we might expect from a linear regression model due to the transformation. If we plot the effect on the log-odds or log-count scale, we’re back to straight lines, as demonstrated with the logistic model. This is a first taste in how the linear model can be used to get at nonlinear relationships depending on the scale we focus on. More explicit nonlinear relationships are the focus of Chapter 7.
-
-
5.5 How Did We Get Here?
-
If we really want to demystify the modeling process, let’s create our own function to estimate the coefficients. We can use maximum likelihood estimation to estimate the parameters of our model, which is the approach used by standard package functions. Feel free to skip this part if you only wanted the basics, but for even more information on maximum likelihood estimation, see Section 4.7 where we take a deeper dive into the topic and with a similar function. The following code is a simple version of what goes on behind the scenes with ‘glm’ type functions.
+
+
6.5 How Did We Get Here?
+
If we really want to demystify the modeling process, let’s create our own function to estimate the coefficients. We can use maximum likelihood estimation to estimate the parameters of our model, which is the approach used by standard package functions. Feel free to skip this part if you only wanted the basics, but for even more information on maximum likelihood estimation, see Section 5.7 where we take a deeper dive into the topic and with a similar function. The following code is a simple version of what goes on behind the scenes with ‘glm’ type functions.
glm_simple =function(par, X, y, family ='binomial') {
-# add an column for the intercept
- X =cbind(1, X)
-
-# Calculate the linear predictor
- mu = X %*% par # %*% is matrix multiplication
-
-# get the likelihood for the binomial or poisson distribution
-if (family =='binomial') {
-# Convert to a probability ('logit' link/inverse)
- p =1/ (1+exp(-mu))
- L =dbinom(y, size =1, prob = p, log =TRUE)
- }
-elseif (family =='poisson') {
-# Convert to a count ('log' link/inverse)
- p =exp(mu)
- L =dpois(y, lambda = p, log =TRUE)
- }
-
-# return the negative sum of the log-likelihood (for minimization)
- value =-sum(L)
-
-return(value)
-}
+
glm_simple =function(par, X, y, family ='binomial') {
+# add an column for the intercept
+ X =cbind(1, X)
+
+# Calculate the linear predictor
+ mu = X %*% par # %*% is matrix multiplication
+
+# get the likelihood for the binomial or poisson distribution
+if (family =='binomial') {
+# Convert to a probability ('logit' link/inverse)
+ p =1/ (1+exp(-mu))
+ L =dbinom(y, size =1, prob = p, log =TRUE)
+ }
+elseif (family =='poisson') {
+# Convert to a count ('log' link/inverse)
+ p =exp(mu)
+ L =dpois(y, lambda = p, log =TRUE)
+ }
+
+# return the negative sum of the log-likelihood (for minimization)
+ value =-sum(L)
+
+return(value)
+}
-
+
-
from scipy.stats import poisson, binom
-
-def glm_simple(par, X, y, family ='binomial'):
-# add an column for the intercept
- X = np.column_stack((np.ones(X.shape[0]), X))
-
-# Calculate the linear predictor
- mu = X @ par # @ is matrix multiplication
-
-# get the likelihood for the binomial or poisson distribution
-if family =='binomial':
- p =1/ (1+ np.exp(-mu))
- L = binom.logpmf(y, 1, p)
-elif family =='poisson':
- lambda_ = np.exp(mu)
- L = poisson.logpmf(y, lambda_)
-
-# return the negative sum of the log-likelihood (for minimization)
- value =-np.sum(L)
-
-return value
+
from scipy.stats import poisson, binom
+
+def glm_simple(par, X, y, family ='binomial'):
+# add an column for the intercept
+ X = np.column_stack((np.ones(X.shape[0]), X))
+
+# Calculate the linear predictor
+ mu = X @ par # @ is matrix multiplication
+
+# get the likelihood for the binomial or poisson distribution
+if family =='binomial':
+ p =1/ (1+ np.exp(-mu))
+ L = binom.logpmf(y, 1, p)
+elif family =='poisson':
+ lambda_ = np.exp(mu)
+ L = poisson.logpmf(y, lambda_)
+
+# return the negative sum of the log-likelihood (for minimization)
+ value =-np.sum(L)
+
+return value
Now that we have our objective function, we can fit our models, starting with the logistic model. We will use the optim function in R and the minimize function in Python.
-Table 5.4: Comparison of Coefficients
+Table 6.4: Comparison of Coefficients
-
+
@@ -3091,34 +3151,34 @@
Similarly, we can also use our function to estimate the coefficients for the poisson model. Just like the GLM function we might normally use, we can change the family option to specify the distribution we want to use.
-Table 5.5: Comparison of Coefficients
+Table 6.5: Comparison of Coefficients
-
+
@@ -3605,9 +3665,9 @@
This goes to show that just a little knowledge of the underlying mechanics can go a long way in understanding how many models work.
-
-
5.6 Wrapping Up
-
So at this point you have standard linear regression with the normal distribution for continuous targets, logistic regression for binary/proportional ones via the binomial distribution, and Poisson regression for counts. These models combine to provide much of what you need for starting out in the modeling world, and all serve well as baseline models for comparison when using more complex methods (Section 8.4). However, what we’ve seen is just a tiny slice of the potential universe of distributions that you could use. Here is a brief list of some that are still in the GLM family proper and others that can be useful7:
+
+
6.6 Wrapping Up
+
So at this point you have standard linear regression with the normal distribution for continuous targets, logistic regression for binary/proportional ones via the binomial distribution, and Poisson regression for counts. These models combine to provide much of what you need for starting out in the modeling world, and all serve well as baseline models for comparison when using more complex methods (Section 8.4). However, what we’ve seen is just a tiny slice of the potential universe of distributions that you could use. Here is a brief list of some that are still in the GLM family proper and others that can be useful6:
Other Core GLM (available in standard functions):
Gamma: For continuous, positive targets that are skewed.
@@ -3616,7 +3676,7 @@
Others (some fairly common):
Beta: For continuous targets that are bounded between 0 and 1.
-
Log-Normal: For continuous targets that are skewed. Essentially what you get with linear regression and logging the target8.
+
Log-Normal: For continuous targets that are skewed. Essentially what you get with linear regression and logging the target7.
Tweedie: Generalizes several core GLM family distributions.
In the ballpark:
@@ -3628,16 +3688,16 @@
Quasi *: For example quasipoisson. These ‘quasi-likelihoods’ served a need at one point that is best served by other methods.
You’ll typically need separate packages to fit some of these, but most often the tools keep to a similar functional approach. The main thing is to know that certain distributions might fit your data a bit better than others, and that you can use both the same basic framework and mindset to fit them, and maybe get a little closer to the answer you seek about your data!
-
-
5.6.1 The common thread
+
+
6.6.1 The common thread
GLMs extend your standard linear model as a powerful tool for modeling a wide range of data types. They are a great way to get started with more complex models, and even allow us to linear models in a not so linear way. It’s best to think of GLMs more broadly than the strict statistical definition, and consider many models like ordinal regression, ranking models, survival analysis, and more as part of the same extension.
-
-
5.6.2 Choose your own adventure
-
At this point you have a pretty good sense of what linear models have to offer, but there’s even more! You can start to look at more complex models that build on these ideas, like mixed models, generalized additive models and more in Chapter 6. You can also feel confident heading into the world of machine learning (Chapter 7), where you’ll find additional ways to think about your modeling approach.
+
+
6.6.2 Choose your own adventure
+
At this point you have a pretty good sense of what linear models have to offer, but there’s even more! You can start to look at more complex models that build on these ideas, like mixed models, generalized additive models and more in Chapter 7. You can also feel confident heading into the world of machine learning (Chapter 8), where you’ll find additional ways to think about your modeling approach.
-
-
5.6.3 Additional resources
+
+
6.6.3 Additional resources
If you are itching for a textbook, there isn’t any shortage of them out there and you can essentially take your pick, though most purely statistical treatments are going to be a bit dated at this point, but still accurate and maybe worth your time.
Generalized Linear Models (McCullagh (2019)) is a classic text on the subject, but it is a bit dense and not for the faint of heart, or even Nelder and Wedderburn (1972), which is a very early treatment.
@@ -3652,16 +3712,13 @@
-
-
5.7 Exercise
+
+
6.7 Exercise
Use the fish data (Section A.4) to conduct a Poisson regression and see how well you can predict the number of fish caught based on the other variables like how many people were on the trip, how many children, whether live bait was used etc.
If you would prefer to try a logistic regression, change the count to just 0 and 1 for whether any fish were caught, and see how well you can predict that.
Dobson, Annette J., and Adrian G. Barnett. 2018. An Introduction to GeneralizedLinearModels. 4th ed. New York: Chapman; Hall/CRC. https://doi.org/10.1201/9781315182780.
@@ -3680,19 +3737,21 @@
Nelder, J. A., and R. W. M. Wedderburn. 1972. “Generalized LinearModels.”Royal Statistical Society. Journal. Series A: General 135 (3): 370–84. https://doi.org/10.2307/2344614.
The y in the formula is more properly expressed as \(y | X, \theta\), where X is the matrix of features and \(\theta\) the parameters estimated by the model. We’ll keep it simple here.↩︎
-
For more on interpreting odds ratios, see this article.↩︎
-
If your binomial setting has a very large number of trials relative to the number of successes, which amounts to very small proportions \(p\), you would find that the binomial distribution would converge to the Poisson distribution.↩︎
-
Neither the binomial nor the Poisson have a variance parameter to estimate, as the variance is determined by the mean. This is in contrast to the normal distribution, where the variance is a separate parameter. For the Poisson, the variance is equal to the mean, and for the binomial, the variance is equal to \(n*p*(1-p)\). The Poisson assumption of equal variance rarely holds up in practice, so people often use the negative binomial distribution instead.↩︎
-
From a modeling perspective, for large mean counts you can just go back to using the normal distribution if you prefer without much losing much predictively and a gaining in interpretability.↩︎
-
It is not uncommon in many disciplines to use different link functions for logistic models, but the log link is always used for Poisson models.↩︎
-
There is not strict agreement about what qualifies for being in the GLM family.↩︎
For more on interpreting odds ratios, see this article.↩︎
+
If your binomial setting has a very large number of trials relative to the number of successes, which amounts to very small proportions \(p\), you would find that the binomial distribution would converge to the Poisson distribution.↩︎
+
Neither the binomial nor the Poisson have a variance parameter to estimate, as the variance is determined by the mean. This is in contrast to the normal distribution, where the variance is an estimated parameter. For the Poisson, the variance is equal to the mean, and for the binomial, the variance is equal to \(n*p*(1-p)\). The Poisson assumption of equal variance rarely holds up in practice, so people often use the negative binomial distribution instead.↩︎
+
From a modeling perspective, for large mean counts you can just go back to using the normal distribution if you prefer without much losing much predictively and a gaining in interpretability.↩︎
+
It is not uncommon in many disciplines to use different link functions for logistic models, but the log link is always used for Poisson models.↩︎
+
There is not strict agreement about what qualifies for being in the GLM family.↩︎
diff --git a/docs/knowing_models.html b/docs/knowing_models.html
index 586af50..0708ccc 100644
--- a/docs/knowing_models.html
+++ b/docs/knowing_models.html
@@ -7,7 +7,7 @@
-3 Knowing Your Model – [Models Demystified]{.smallcaps}
+4 Knowing Your Model – [Models Demystified]{.smallcaps}
@@ -1009,10 +1026,10 @@
-
-
3.2.1 Regression metrics
+
+
4.2.1 Regression metrics
The primary goal of our endeavor is to come up with a predictive model. The closer our model predictions are to the observed target values, the better our model is performing. As we saw in the above table, when we have a numeric target there are quite a few metrics that help us understand prediction-target correspondence, so let’s look at some of those.
-
But before we create a model to get us started, we are going to read in our data and then create two different splits within our data: a training set and a test set. In other words, we are going to partition our data so that we can train a model and then see how well that model performs with data it hasn’t seen1. For more on this process and the reasons why we do it see -Section 7.4 and -Section 7.6. For now, we just need to know that assessing prediction error on the test set will give us a better estimate of our metric of choice.
+
But before we create a model to get us started, we are going to read in our data and then create two different splits within our data: a training set and a test set. In other words, we are going to partition our data so that we can train a model and then see how well that model performs with data it hasn’t seen1. For more on this process and the reasons why we do it see -Section 8.4 and -Section 8.6. For now, we just need to know that assessing prediction error on the test set will give us a better estimate of our metric of choice.
@@ -1137,17 +1154,17 @@
-Figure 3.1: Observed vs. Predicted Ratings
+Figure 4.1: Observed vs. Predicted Ratings
Obviously, our points do not make a perfect line on the left, which would indicate perfect prediction. Also, the distribution of our values suggests we’re over predicting on the lower end and under predicting on the higher end of the target’s range. More generally, we’re not capturing the range of the observed values very well. But we’d like to determine how far off we are in a general sense. There are a number of metrics that can be used to measure this. We’ll go through a few of them here. In each case, we’ll demystify the calculations to make sure we understand what’s going on.
-
-
3.2.1.1 R-squared
+
+
4.2.1.1 R-squared
Anyone that has done linear regression has come across the \(R^2\) value. It is a measure of how well the model explains the variance in the target. One way to calculate it is as follows:
where \(y_i\) is the observed value, \(\hat{y}_i\) is the predicted value, and \(\bar{y}\) is the mean of the observed values. The \(R^2\) value is basically a proportion of how much variance in the target (the denominator) is attributable to the model’s predictions (numerator). When applied to the training or observed data, it is a value between 0 and 1, with 1 indicating that the model explains all of the variance in the target.
More simply, \(R^2\) is the squared correlation of our predicted values and the target. In that sense it can almost always be useful as a descriptive measure, just like we use means and standard deviations in exploratory data analysis. However, it is not so great at telling us about predictive quality. Why? Take the predictions from our rating model, and add 10 to them, or make them all negative. In both cases your predictions would be ridiculous, but your \(R^2\) will be the same. Another problem is that for training data, \(R^2\) will always increase as you add more features to your model, whether they are useful or pure noise! This is why we use other metrics to assess predictive quality2.
@@ -1211,12 +1228,12 @@
-
3.2.1.2 Mean squared error
+
+
4.2.1.2 Mean squared error
One of the most common performance metrics for numeric targets is the mean squared error (MSE) and its square root, root mean squared error (RMSE). The MSE is the average of the squared differences between the predicted and actual values. It is calculated as follows:
MSE penalizes large errors more- since errors are squared, the larger the error, the larger the penalty. As mentioned, the root mean squared error (RMSE) is just the square root of the MSE. Like MSE, RMSE likewise penalizes large errors, but if you want a metric that is in the same units as the original target data, RMSE is the metric for you. It is calculated as follows:
MAE is a great metric when all you really want to know is how far off your predictions are from the observed values. It is not as sensitive to large errors as the MSE.
The mean absolute percentage error (MAPE) is the average of the absolute differences between the predicted and observed values, expressed as a percentage of the observed values. It is calculated as follows:
In the end, it won’t hurt to look at a few of these metrics to get a better idea of how well your model is performing. You will always be using these metrics to compare different models, so use a few of them to get a better sense of how well your models are performing relative to one another. Does adding a feature help drive down RMSE, indicating that the feature helps to reduce large errors? In other words, does adding complexity to your model provide a big reduction in error? If adding features doesn’t help reduce error, do you really need to include it in your modelU+0203D;
-
-
3.2.2 Classification metrics
+
+
4.2.2 Classification metrics
Whenever we are classifying outcomes, we don’t have the same ability to compare a predicted score to an observed score – instead, we are going to use the predicted probability of an outcome, establish a cut-point for that probability, convert everything below that cut-point to 0, and then convert everything at or above that cut-point to 1. We can then compare a table predicted versus target classes, typically called a confusion matrix3.
Let’s start with a model to predict whether a review is “good” or “bad”. We will use the same training and testing data that we created above. Explore the summary output if desired (not shown), but we will focus on the predictions and metrics.
@@ -1435,8 +1452,8 @@
-
3.2.2.1 Confusion matrix
+
+
4.2.2.1 Confusion matrix
The confusion matrix is a table that shows the number of correct and incorrect predictions made by the model. It’s easy enough to get one from scratch, but we recommend using a function that will give you a nice table, and possibly all of the metrics you need along with it. To get us started, we can use a package function that will take our predictions and observed target as input to create the basic table.
-Table 3.2: Example of a Confusion Matrix
+Table 4.2: Example of a Confusion Matrix
@@ -1945,8 +1962,8 @@
-
3.2.2.2 Accuracy
+
+
4.2.2.2 Accuracy
Accuracy’s allure is in its simplicity and because we use it for so many things in our everyday affairs. The accuracy is the proportion of correct predictions made by the model. Of all the metrics to assess the quality of classification, accuracy is the easiest to cheat. If you have any class imbalance (i.e., one class within the target has far more observations than the other), you can get a high accuracy by simply predicting the majority class all of the time! To get around the false sense of confidence that accuracy alone can promote, we can look at a few other metrics.
@@ -1965,20 +1982,20 @@
-
-
3.2.2.3 Sensitivity/Recall/True positive rate
+
+
4.2.2.3 Sensitivity/Recall/True positive rate
Sensitivity, also known as recall or the true positive rate, is the proportion of observed positives that are correctly predicted by the model. If your focus is on the positive class above all else, sensitivity is the metric for you.
-
-
3.2.2.4 Specificity/True negative rate
+
+
4.2.2.4 Specificity/True negative rate
Specificity, also known as the true negative rate, is the proportion of observed negatives that are correctly predicted as such. If you want to know how well your model will work with the negative class, specificity is a great metric.
-
-
3.2.2.5 Precision/Positive predictive value
+
+
4.2.2.5 Precision/Positive predictive value
The precision is the proportion of positive predictions that are correct, and is often a key metric in many business use cases. While similar to sensitivity, precision focuses on positive predictions, while sensitivity focuses on observed positive cases.
-
-
3.2.2.6 Negative predictive value
+
+
4.2.2.6 Negative predictive value
The negative predictive value is the proportion of negative predictions that are correct, and is the complement to precision.
Let’s see how we’d do this ourselves. We’ll create a basic confusion matrix then extract the values to create the metrics we need.
Earlier when we obtained the predicted class, and subsequently all the metrics based on it, we used a predicted probability value of 0.5 as a cutoff for a ‘good’ vs. a ‘bad’ rating, and this is usually the default if we don’t specify it explicitly. Assuming that this is the best for a given situation is actually a bold assumption on our part, and we should probably make sure that the cut-off value we choose is going to offer us the best result given the modeling context.
But what is the ‘best’ result? That’s going to depend on the situation. If we are predicting whether a patient has a disease, we might want to minimize false negatives, since if we miss the diagnosis, the patient could be in serious trouble. Meanwhile if we are predicting whether a transaction is fraudulent, we might want to minimize false positives, since if we flag a transaction as fraudulent when it isn’t, we could be causing a lot of trouble for the customer. In other words, we might want to maximize sensitivity or specificity, respectively.
Whatever we decide, we ultimately are just shifting the metrics around relative to one another. As an easy example, if we were to classify all of our observations as ‘good’, we would have a sensitivity of 1 because all good ratings would be classified correctly. However, our positive predictive value would not be 1, and we’d have a specificity of 0. No matter which cutpoint we choose, we are going to have to make a tradeoff between these metrics.
@@ -2609,7 +2626,7 @@
-Figure 3.2: ROC curve and AUC value
+Figure 4.2: ROC curve and AUC value
@@ -2674,8 +2691,8 @@
-
3.2.3 Model selection & comparison
+
+
4.2.3 Model selection & comparison
Another important way to understand our model is by looking at how it compares to other models in terms of performance, however we choose to define that. One common way we can do this is by comparing models based on the metric(s) of our choice, for example, with RMSE or AUC. Let’s see this in action for our regression model. Here we will compare three models: one with three features, our original model, and the three feature model with interactions with genre. Our goal will be to see how these perform on the test set based on RMSE.
-Table 3.4: RMSE for Different Models
+Table 4.4: RMSE for Different Models
@@ -3241,11 +3258,11 @@
Chapter 4. In other cases, we are selecting models through the process of cross-validation (Section 7.6), but the idea is largely the same in that we are comparing our current parameter estimates to other possibilities. We are always doing model selection and comparison, and as such we’ll be demonstrating these often.
+
Another thing to consider is that even with a single model, the model fitting procedure is always comparing a model with the current parameter estimates, or more generally the current objective function value, to a previous one with other parameter estimates or objective function value. In this case, our goal is model selection, or how we choose the best result from a single model. While this is an automatic process here, the details of how this actually happens is the focus of Chapter 5. In other cases, we are selecting models through the process of cross-validation (Section 8.6), but the idea is largely the same in that we are comparing our current parameter estimates to other possibilities. We are always doing model selection and comparison, and as such we’ll be demonstrating these often.
-
-
3.2.4 Model visualization
-
A key method for understanding how our model is performing is through visualization. You’ll recall that we started out way back by look at the predicted values against the observed values to see if there was any correspondence (Figure 2.3), but another key way to understand our model is to look at the residuals, or errors in prediction, which again is the difference in our prediction versus the observed value. Here are a couple plots that can help us understand our model:
+
+
4.2.4 Model visualization
+
A key method for understanding how our model is performing is through visualization. You’ll recall that we started out way back by look at the predicted values against the observed values to see if there was any correspondence (Figure 3.2), but another key way to understand our model is to look at the residuals, or errors in prediction, which again is the difference in our prediction versus the observed value. Here are a couple plots that can help us understand our model:
Residuals vs. Fitted: This type of plot shows predicted values vs. the residuals (or some variant of the residuals, like their square root). If you see a pattern, that potentially means your model is not capturing something in the data. For example, if you see a funnel shape, that would suggest that you are systematically having worse predictions for some part of the data. For some plots, patterns may suggest an underlying nonlinear relationship in the data is yet to be uncovered. For our main regression model, we don’t see any patterns which would indicate that the model has a notable prediction issue of some kind.
@@ -3257,7 +3274,7 @@
-Figure 3.3: Residuals vs. Fitted Plot
+Figure 4.3: Residuals vs. Fitted Plot
@@ -3274,7 +3291,7 @@
-Figure 3.4: MSE Loss over 25 epochs in an MLP model
+Figure 4.4: MSE Loss over 25 epochs in an MLP model
@@ -3291,7 +3308,7 @@
-Figure 3.5: Posterior Predictive Check for a regression model
+Figure 4.5: Posterior Predictive Check for a regression model
@@ -3375,17 +3392,17 @@
-
-
3.3 Understanding the Features
+
+
4.3 Understanding the Features
Assuming our model is adequate, let’s now turn our attention to the features. There’s a lot to unpack here, so let’s get started!
-
-
3.3.1 Basic model parameters
-
We saw in the linear model chapter (2.6.1) that we can get a lot out of the basic output from standard linear models. Our starting point should be the coefficients or weights, which can give us a sense of the direction and magnitude of the relationship between the feature and the target given their respective scales. We can also look at the standard errors and confidence intervals to get a sense of the uncertainty in those estimates. Here is a basic summary of the coefficients for our regression model on the training data.
+
+
4.3.1 Basic model parameters
+
We saw in the linear model chapter (3.4.1) that we can get a lot out of the basic output from standard linear models. Our starting point should be the coefficients or weights, which can give us a sense of the direction and magnitude of the relationship between the feature and the target given their respective scales. We can also look at the standard errors and confidence intervals to get a sense of the uncertainty in those estimates. Here is a basic summary of the coefficients for our regression model on the training data.
-Table 3.5: Coefficients for Our Regression Model
+Table 4.5: Coefficients for Our Regression Model
@@ -3955,8 +3972,8 @@
-
3.3.2 Feature contributions
+
+
4.3.2 Feature contributions
We can also look at the contribution of a feature to the model’s explanatory power, namely through its predictions. To start our discussion, we don’t want to lean too heavily on the phrase feature importance yet, because as we’ll see later, trying to rank features by an importance metric is difficult at best, and a misguided endeavor at worst. We can however look at the feature contribution to the model’s predictions, and we can come to a conclusion about whether we think a feature is practically important, but just we need to be careful about how we do it.
Truly understanding feature contribution is a bit more complicated than just looking at the coefficient if using any model that isn’t a linear regression, and there are many ways to go about it. We know we can’t compare raw coefficients across features, because they are on different scales. But even when we put them on the same scale, it may be very easy for some features to move, e.g., one standard deviation, and very hard for others. Binary features can only be on or off, while numeric features can move around more, but numeric features may also be highly skewed. We also can’t use statistical significance based on p-values, because they reflect sample size as much or more than effect size.
So what are we to do? What you need to know to get started looking at a feature’s contribution includes the following:
@@ -3969,8 +3986,8 @@
-
3.3.3 Marginal effects
+
+
4.3.3 Marginal effects
One way to understand the contribution of a feature to the model is to look at the marginal effect of the feature, which conceptually attempts to boil a feature effect to something simple. Unfortunately, not everyone means the same thing when they use this term and it can be a bit confusing. Marginal effects typically refer to a partial derivative of the target with respect to the feature. Oh no! Math! However, as an example, this becomes very simple for standard linear models with no interactions and all linear effects as in linear regression. The derivative of our coefficient with respect to the feature is just the coefficient itself! But for more complicated models, even just a classification model like our logistic regression, we need to do a bit more work to get the marginal effect, or other so-called average effects. Let’s think about a couple common versions:
Average slope, Average Marginal Effect
@@ -3978,8 +3995,8 @@
<
Marginal Means (for categorical features)
Counterfactuals and other predictions at key feature values
-
-
3.3.3.1 Marginal effects at the mean
+
+
4.3.3.1 Marginal effects at the mean
First let’s think about an average slope. This is the average of the slopes across the feature’s values or values of another feature it interacts with. But let’s just look at the effect of word count first. A good question is, how do we visualize that? Here are two plots, and both are useful, neither is inherently wrong, and yet they both tell us something different. The first plot shows the predicted probability of a good review as word count changes, with all other features at their mean (or mode for categorical). The second plot shows what is called a partial dependence plot, which shows the average predicted probability of a good review as word count changes. In both cases we make predictions with imputed values- the left plot imputes the other features to be their mean or mode, while the right plot leaves the other features at their actual values, and then, using a range of values for word count, gets a prediction as if every observation had that value for word count. We then plot the average of the predictions for each value in the range.
@@ -3989,7 +4006,7 @@
-Figure 3.6: Marginal Effect at the Mean vs. Partial Dependence Plot
+Figure 4.6: Marginal Effect at the Mean vs. Partial Dependence Plot
@@ -3997,8 +4014,8 @@
-
3.3.3.2 Average marginal effects
+
+
4.3.3.2 Average marginal effects
Let’s say we want to boil our understanding of the effect to a single number. In this case, the coefficient is fine if we’re dealing with an entirely linear model. In this classification case, the raw coefficient tells us what we need to know, but on the log odds scale, which is not very intuitive for most folks. We can understand the probability scale, but this means things get nonlinear. As an example, a .1 to .2 change in the probability is doubling it, while a .8 to .9 change is a 12.5% increase in the probability. But is there any way we can stick with probabilities and get a single value to understand the change in the probability of a good review as word count changes by 1 unit?
Yes! We can look at what’s called the average marginal effect of word count. This is the average of the slope of the predicted probability of a good review as word count changes. This is a bit more complicated than just looking at the coefficient, but it’s a bit more intuitive. How do we get it? By a neat little trick where we predict the target with the feature at two values, one with the observed value is changed by adding or subtracting a very small amount. Then we take the difference in the predictions. This results in the same thing as taking the derivative of the target with respect to the feature.
@@ -4071,8 +4088,8 @@
-
3.3.3.3 Marginal means
+
+
4.3.3.3 Marginal means
Marginal means are just about getting an average prediction for the levels of categorical features. As an example, we can get the average predicted probability of a good review for each level of the genre feature, and then compare them. To do this we just have to make predictions as if every observation had a certain value for genre, and then average the predictions. This is also the exact same approach that produced the PDP for word count we saw earlier.
-Table 3.6: Average Predicted Probability of a Good Review by Genre
+Table 4.6: Average Predicted Probability of a Good Review by Genre
@@ -4621,8 +4638,8 @@
-
-
3.3.4 Counterfactual predictions
+
+
4.3.4 Counterfactual predictions
The nice thing about having a model is that we can make predictions for any set of feature values we want to explore. This is a great way to understand the contribution of a feature to the model. We can make predictions for a range of feature values, and then compare the predictions to see how much the feature contributes to the model. Counterfactual predictions allow us to ask “what if?” questions of our model, and see how it responds. As an example, we can get a prediction as if every review was made for a drama, and then see what we’d expect if every review pertained to a comedy. This is a very powerful approach, and often utilized in causal inference, but it’s also a great way to understand the contribution of a feature to a model in general.
Consider an experimental setting where we have lots of control over how the data is produced for different scenarios. Ideally we’d be able to look at the same instances under when everything about them was identical, but in one case, the instance was part of the control group, and in another, part of the treatment group. Unfortunately, not only is it impossible to have everything be identical, but it’s also impossible to have the same instance be in two experimental group settings at the same time. Counterfactual predictions are the next best thing though, because once we have a model, we can predict an observation as if it was in the treatment, and then when it is a control. If we do this for all observations, we can get a sense of the average treatment effect, one of the main points of interest in causal inference.
@@ -4729,7 +4746,7 @@
-Table 3.7: Predictions for Happiness Score for Russia and the US with Switched Freedom and GDP
+Table 4.7: Predictions for Happiness Score for Russia and the US with Switched Freedom and GDP
@@ -5207,8 +5224,8 @@
-
3.3.5 SHAP values
+
+
4.3.5 SHAP values
As we’ve suggested, most models are more complicated than can be explained by a simple coefficient, e.g. nonlinear effects in generalized additive models, or there may not even be feature-specific coefficients available, like gradient boosting models, or we may even have many parameters associated with a feature, as in deep learning. Such models typically won’t come with statistical output like standard errors and confidence intervals either. But we’ll still have some tricks up our sleeve to help us figure things out!
A very common interpretation tool is called a SHAP value. SHAP stands for SHapley Additive exPlanations, and it provides a means to understand how much each feature contributes to a specific prediction. It’s based on a concept from game theory called the Shapley value, which is a way to understand how much each player contributes to the outcome of a game. For our modeling context, SHAP values break down a prediction to show the impact of each feature. The reason we bring it up here is that it has a nice intuition in the linear model case, and seeing it in that context is a good way to get a sense of how it works. Furthermore, it builds on what we’ve been talking about with our various prediction approaches.
While the actual computations behind the scenes can be tedious, the basic idea is relatively straightforward- for a given prediction at a specific observation with set feature values, we can calculate the difference between the prediction at that observation versus the average prediction for the model as a whole. We can break this down for each feature, and see how much each contributes to the difference. This provides us the local effect of the feature. The SHAP approach also has the benefit of being able to be applied to any model, whether a simple linear or deep learning model. Very cool! To demonstrate we’ll use the simple model from our model comparison demo, but keep the features on the raw scale.
@@ -5409,6 +5426,8 @@
-Table 3.8: SHAP Value Comparison
+Table 4.8: SHAP Value Comparison
We’ve seen how we can get some plots for predictions in different ways previously with what’s called a partial dependence plot (Figure 3.6). A PDP shows the average prediction of a feature on the target across the feature values, which is in fact what we were just doing to calculate our SHAP value, and for the linear case, the PDP has a direct correspondence to the SHAP. As we saw, the SHAP value is the difference between the average prediction and the point on the PDP for a feature at a specific feature value. With regard to the PDP, this is the difference the point on the PDP and the average prediction for the model at that feature value, shown in the red line below.
+
+
4.3.6 Related visualizations
+
We’ve seen how we can get some plots for predictions in different ways previously with what’s called a partial dependence plot (Figure 4.6). A PDP shows the average prediction of a feature on the target across the feature values, which is in fact what we were just doing to calculate our SHAP value, and for the linear case, the PDP has a direct correspondence to the SHAP. As we saw, the SHAP value is the difference between the average prediction and the point on the PDP for a feature at a specific feature value. With regard to the PDP, this is the difference the point on the PDP and the average prediction for the model at that feature value, shown in the red line below.
-Figure 3.8: PDP, ICE, and ALE Plots
+Figure 4.8: PDP, ICE, and ALE Plots
@@ -5985,8 +6006,8 @@
-
3.3.7 Global assessment of feature importance
+
+
4.3.7 Global assessment of feature importance
How important is a feature? It’s a common question, and one that is often asked of models, but the answer ranges from ‘it depends’ and ‘it doesn’t matter’. Let’s start with some hard facts:
There is no single definition of importance for any given model.
@@ -6011,18 +6032,18 @@
-Figure 3.9: Two plots showing the importance of understanding feature interactions and non-linear relationships
+Figure 4.9: Two plots showing the importance of understanding feature interactions and non-linear relationships
All this is to say, as we get into measures of feature importance, we need to be very careful about how we interpret and use them!
-
-
3.3.7.1 Example: Feature Importance for Linear Regression
+
+
4.3.7.1 Example: Feature Importance for Linear Regression
To show just how difficult measuring feature importance is, we only have to stick with our simple linear regression. Think again about R2: it tells us the proportion of the target explained by our features. An ideal measure of importance would be able to tell us how much each feature contributes to that proportion, or in other words, one that decomposes R2 into the relative contributions of each feature. One of the most common measures of importance in linear models is the standardized coefficient we have demonstrated previously. You know what it doesn’t do? It doesn’t decompose R2 into relative feature contributions. Even the more complicated SHAP approach will not do this.
The easiest situation we could hope for with regard to feature importance is the basic linear regression model we’ve been using. Everything is linear, with no interactions or other things going on, as in our demonstration model. And yet there are many logical ways to determine feature importance, and some even break down R2 into relative contributions, but they won’t necessarily agree with each other in ranking or relative differences. If you can get a measure of statistical difference between whatever metric you choose, it’s often the case that ‘top’ features will not be statistically different from other features. So what do we do? We’ll show a few methods here, but the main point is that there is no single answer, and it’s important to understand what you’re trying to do with the information.
-
Let’s start things off by using one of our previous linear regression models with several features, but which has no interactions or other complexity (Section 2.7.1). It’s just a model with simple linear relationships and nothing else. We even remove categorical features to avoid having to aggregate group effects. In short, it doesn’t get any easier than this!
+
Let’s start things off by using one of our previous linear regression models with several features, but which has no interactions or other complexity (Section 3.5.1). It’s just a model with simple linear relationships and nothing else. We even remove categorical features to avoid having to aggregate group effects. In short, it doesn’t get any easier than this!
-Figure 3.11: Feature importance by various methods
+Figure 4.11: Feature importance by various methods
@@ -6108,24 +6129,24 @@
-
3.3.8 Feature metrics for classification
+
+
4.3.8 Feature metrics for classification
All of what’s been demonstrated for feature metrics applies to classification models. Counterfactual predictions, average marginal effects, SHAP, and permutation-based methods for feature importance would be done in the exact same way. The only real difference is the reference target - our results would be in terms of probabilities and using a different loss metric to determine importance, that sort of thing. But in general, everything we’ve talked about holds for classification settings as well.
-
-
3.4 Wrapping Up
+
+
4.4 Wrapping Up
It is easy to get caught up in the excitement of creating a model and then using it to make predictions. It is also easy to get caught up in the excitement of seeing a model perform well on a test set. It is much harder to take a step back and ask yourself, “Is this model really doing what I want it to do?” You should always be looking at which features are pulling the most weight in your model and how predictions are being made. It takes a lot of work to trust what a model is telling you.
-
-
3.4.1 The common thread
+
+
4.4.1 The common thread
Much of what you’ve seen in this chapter can be applied to any model. From linear regression to deep learning, we often use similar metrics to help select and compare models. In most model scenarios, but not all, we are interested in feature-level interpretation. This can take a statistical focus, or use tools that are more model-agnostic like SHAP. In almost every modeling scenario, we are usually interested in how the model is making its predictions, and how we can use that information to make better decisions.
-
-
3.4.2 Choose your own adventure
-
If you haven’t already, feel free to take your linear models further in Chapter 6 and Chapter 5, where you’ll see how to handle different distributions for your target, add interactions, nonlinear effects, and more. Otherwise, you’ve got enough at this point to try your hand at the Chapter 7 section, where you can dive into machine learning!
+
+
4.4.2 Choose your own adventure
+
If you haven’t already, feel free to take your linear models further in Chapter 7 and Chapter 6, where you’ll see how to handle different distributions for your target, add interactions, nonlinear effects, and more. Otherwise, you’ve got enough at this point to try your hand at the Chapter 8 section, where you can dive into machine learning!
-
-
3.4.3 Additional resources
+
+
4.4.3 Additional resources
If this chapter has piqued your curiosity, we would encourage you to check out the following resources.
Even though we did not use the mlr3 package in this chapter, the Evaluation and Benchmarking chapter of the associated book, Applied Machine Learning Using mlr3 in R, offers a great conceptual take on model metrics and evaluation.
For a more Pythonic look at model evaluation, we would highly recommend going through the sci-kit learn documentation on Model Evaluation. It has you absolutely covered with code examples and concepts.
@@ -6134,8 +6155,8 @@
The marginal effects zoo, written by the marginaleffects package author, is your goto for getting started with marginal effects, but we also recommend the excellent blog post by Andrew Heiss as a very nifty overview and demonstration.
-
-
3.5 Exercise
+
+
4.5 Exercise
Use the world happiness data set to create a model of your choosing that predicts happiness score. Create a simpler or more complex model for comparison. Interpret the model as a whole, then compare it to the simpler/more complex model. Choose the ‘best’ model, and justify your reason for doing so. Given that model, explore the predictions and interpret the features. Can you comfortably claim which is the most important? Why or why not?
So what is the effect of children in the home? Or a particular genre, for that matter? This is a problematic question, because the effect of one feature depends on the setting of the other feature. We can say what the effect of a feature is on average across the settings of the other features. This is called the average marginal effect2, something we’ve talked about in Section 3.3.3.2. We can compute this by averaging the effect of a feature across the values of the other features.
+
+
7.2.1 Average effects
+
So what is the effect of children in the home? Or a particular genre, for that matter? This is a problematic question, because the effect of one feature depends on the setting of the other feature. We can say what the effect of a feature is on average across the settings of the other features. This is called the average marginal effect2, something we’ve talked about in Section 4.3.3.2. We can compute this by averaging the effect of a feature across the values of the other features.
-Table 6.2: Average Marginal Effects of Children in the Home
+Table 7.2: Average Marginal Effects of Children in the Home
-
+
@@ -1544,12 +1561,12 @@
Section 3.3.5), attempt to boil down the effect of a feature to a single number, but this is difficult even in the simpler GLM settings, and can be downright misleading in more complex settings like our interaction model. Here we see the average coefficient for children in the home is 0.15, but we saw in Table 6.1 that this is slightly larger than what we would estimate in the non-interaction model, and we saw in Figure 6.1 it’s actually near zero (flat) for some genres. So what is the average effect really telling us? Consider a more serious case of drug effects across demographic groups, where the effect of the drug is much stronger for some groups than others. Would you want your doctor to prescribe you a drug based on the average effect across all groups, or the specific group to which you belong?
+
So-called marginal effects, and related approaches such as SHAP values (see Section 4.3.5), attempt to boil down the effect of a feature to a single number, but this is difficult even in the simpler GLM settings, and can be downright misleading in more complex settings like our interaction model. Here we see the average coefficient for children in the home is 0.15, but we saw in Table 7.1 that this is slightly larger than what we would estimate in the non-interaction model, and we saw in Figure 7.1 it’s actually near zero (flat) for some genres. So what is the average effect really telling us? Consider a more serious case of drug effects across demographic groups, where the effect of the drug is much stronger for some groups than others. Would you want your doctor to prescribe you a drug based on the average effect across all groups, or the specific group to which you belong?
When dealing with interactions in a model, it’s best to consider how a feature’s effect changes based on the values of other features it interacts with. Visualizing these effects can help us understand how the relationships change. It’s also helpful to consider what the predicted outcome is at important feature values, and how this changes with different feature values. This is the approach we’ve used with interactions, and it’s a good strategy overall.
-
-
6.2.2 ANOVA
-
A common method for summarizing categorical effects in linear models is through analysis of variance or ANOVA, something we briefly mentioned in our chapter introducing the linear model Section 2.7.2.1. ANOVA breaks down the variance in a target attributable to different features or their related effects such as interactions. It’s a bit beyond the scope here to get into all the details, but we demonstrate it here.
+
+
7.2.2 ANOVA
+
A common method for summarizing categorical effects in linear models is through analysis of variance or ANOVA, something we briefly mentioned in our chapter introducing the linear model Section 3.5.2.1. ANOVA breaks down the variance in a target attributable to different features or their related effects such as interactions. It’s a bit beyond the scope here to get into all the details, but we demonstrate it here.
-Table 6.3: ANOVA table for interaction model
+Table 7.3: ANOVA Table for an Interaction Model
-
+
@@ -2067,11 +2084,11 @@
It’s worth noting that ANOVA is often confused with being a model itself. When people use it this way, it is just a linear regression with only categorical features, something that is usually only seen within strict experimental designs that do not have interactions with continuous features. It’s pretty difficult to think of a linear regression setting where no continuous features would be of interest, but back when people were doing this stuff by hand, they just categorized everything to enable doing an ANOVA, which was tedious arithmetic but manageable. It’s a bit of a historical artifact, but might be useful for exploratory purposes. Other approaches are a little more general or not confined to nested models – ones that can be seen as subsets of another. An example is using AIC or the other methods employed in machine learning.
-
-
6.3 Mixed Models
-
-
6.3.1 Knowing your data
-
As much fun as modeling is, knowing your data is far more important. You can throw any model you want at your data, from simple to fancy, but you can count on disappointment if you don’t fundamentally know the structure that lies within your data. Let’s take a look at the following visualizations. In Figure 6.2, we see a positive relationship between the length of the movie and ratings.
+
+
7.3 Mixed Models
+
+
7.3.1 Knowing your data
+
As much fun as modeling is, knowing your data is far more important. You can throw any model you want at your data, from simple to fancy, but you can count on disappointment if you don’t fundamentally know the structure that lies within your data. Let’s take a look at the following visualizations. In Figure 7.2, we see a positive relationship between the length of the movie and ratings.
@@ -2080,7 +2097,7 @@
-Figure 6.2: Linear relationship between length of movie and rating.
+Figure 7.2: Linear Relationship Between Length of Movie and Rating
@@ -2095,13 +2112,13 @@
-Figure 6.3: Genre Effects on Length and Rating
+Figure 7.3: Genre Effects on Length and Rating
-
A very quick examination of Figure 6.3 might suggest that the rating varies by genre, and that the relationship between length and rating varies significantly over the different genres. The group means in the right panel show variability across genre. In addition, on the left panel, some genres show a strong positive relationship, some show less of a positive relationship, a couple even show a negative relationship, and one even looks flat. We can also see that they would have different intercepts. This is a very important thing to know about your data! If we had just run a model with length as a feature and nothing else, we would have missed this important information.
+
A very quick examination of Figure 7.3 might suggest that the rating varies by genre, and that the relationship between length and rating varies significantly over the different genres. The group means in the right panel show variability across genre. In addition, on the left panel, some genres show a strong positive relationship, some show less of a positive relationship, a couple even show a negative relationship, and one even looks flat. We can also see that they would have different intercepts. This is a very important thing to know about your data! If we had just run a model with length as a feature and nothing else, we would have missed this important information.
Now consider something a bit more complicated. Here is a plot of the relationship between the length of a movie and the rating, but across release year. Again we might think there is notable variability in the effect across years, as some slopes are positive, some very strongly positive, and some are even negative. How can we account for this?
@@ -2111,15 +2128,15 @@
-Figure 6.4: Release Year Effects on Length and Rating
+Figure 7.4: Release Year Effects on Length and Rating
-
-
6.3.2 Overview of mixed models
+
+
7.3.2 Overview of mixed models
What we’ve just seen might initially bring to mind an interaction effect, and that’s the right way to think about it! A mixed model can be used to get at that type of relationship into our model, which we can think of as a group interaction, without much hassle and additional explainability. But it’s actually a quite flexible class that can also allow for more complicated but related types.
Before going too much further, the term mixed model is as vanilla as we can possibly make it, but you might have heard of different names such as hierarchical linear models, or multilevel models, or maybe mixed-effects models tossed around before. Maybe you’ve even been exposed to ideas like random effects or random slopes. These are in fact all instances of what we’re calling a mixed model.
What makes a model a mixed model? The mixed model is characterized by the idea that a model can have fixed effects and random effects. Fortunately, you’ve already encountered fixed effects – those are the features that we have been using in all of our models so far! We are assuming a single true parameter (e.g. coefficient/weight) for each of those features to estimate, and that parameter is fixed.
A simple random intercept and slope is just the start. As an example, we can let the intercepts and slopes correlate, and we could have multiple grouping factors contribute, as well as allowing other features and even interactions themselves to vary by group! This is where mixed models can get quite complex, but the basic idea is still the same: we are allowing parameters to vary across groups, and we are estimating the variance of those parameters.
-
-
6.3.3 Using a mixed model
+
+
7.3.3 Using a mixed model
One of the key advantages of a mixed is that we can use it when the observations within a group are not independent. This is a very common situation in many fields, and it’s a good idea to consider this when you have grouped data. As an example we’ll use the happiness data for all available years, and we’ll consider the country as a grouping variable. In this case, observations within a country are likely to be more similar to each other than to observations from other countries. This is a classic example of when to use a mixed model. This is also a case where we wouldn’t just throw in country as a feature like any other factor, since there are 164 countries in the data.
In general, to use mixed models we have to specify a group effect for the categorical feature of focus, but that’s the primary difference from our previous approaches used for linear or generalized linear models. For our example, we’ll look at a model with a random intercept for country, and one that adds a random slope for the yearly trend across countries. This means that we are allowing the intercepts and slopes to vary across countries, and the intercepts and slopes can correlate with one another. Furthermore, by recoding year to start at zero, the intercept will represent the happiness score at the start of the data. In addition, to see a more reasonable effect, we also divide the yearly trend by 10, so the coefficient provides the change in happiness score per decade.
@@ -2204,35 +2221,35 @@
-
Table 6.4 shows some typical output from a mixed model, focusing on the random slope model. The fixed effect part (Fixed) is your basic GLM result and interpreted the same way. Nothing new there, and we can see a slight positive decade trend in happiness, though maybe not a strong one. But the random effects (Random) are where the action is! We can see the standard deviation of the random effects, i.e., the intercepts and slopes. We can also see the residual (observation level) standard deviation, which is conceptually the same as what you saw with standard linear regression. We can also see the correlation between the random intercepts and random slopes. Depending on your tool, the default may be in terms of variances and covariances rather than standard deviations and correlations, but otherwise the same.
+
Table 7.4 shows some typical output from a mixed model, focusing on the random slope model. The fixed effect part (Fixed) is your basic GLM result and interpreted the same way. Nothing new there, and we can see a slight positive decade trend in happiness, though maybe not a strong one. But the random effects (Random) are where the action is! We can see the standard deviation of the random effects, i.e., the intercepts and slopes. We can also see the residual (observation level) standard deviation, which is conceptually the same as what you saw with standard linear regression. We can also see the correlation between the random intercepts and random slopes. Depending on your tool, the default may be in terms of variances and covariances rather than standard deviations and correlations, but otherwise the same.
-Table 6.4: Mixed model results
+Table 7.4: Mixed Model Results
-
+
@@ -2745,7 +2762,7 @@
-Figure 6.5: Estimated random effects
+Figure 7.5: Estimated Random Effects
-Figure 6.6: Random effects for mixed model
+Figure 7.6: Random Effects for Mixed Model
@@ -2816,32 +2833,32 @@
-
-
6.3.4 Mixed model summary
+
+
7.3.4 Mixed model summary
For a model with just one feature, we certainly had a lot to talk about! And this is just a glimpse of what mixed models have to offer, and the approach can be even richer than what we’ve just seen. But you might be asking- Why don’t I just put genre into the model like other categorical features? In the case of genre for movie reviews where there are few groups, that’s okay. But doing that with features with a lot of levels would typically result in estimation issues due to having so many parameters to estimate. In general mixed models provide several advantages for the data scientist:
Any coefficient can be allowed to vary by groups, including other random effects. It actually is just an interaction in the end as far as the linear predictor and conceptual model is concerned.
-
The group-specific effects are penalized, which shrinks them toward the overall mean, and makes this a different approach from just adding a ‘mere interaction’. This helps avoid overfitting, and that penalty is related to the variance estimate of the random effect. In other words, you can think of it as running a penalized linear model where the penalty is applied to the group-specific effects (see Section 4.8).
+
The group-specific effects are penalized, which shrinks them toward the overall mean, and makes this a different approach from just adding a ‘mere interaction’. This helps avoid overfitting, and that penalty is related to the variance estimate of the random effect. In other words, you can think of it as running a penalized linear model where the penalty is applied to the group-specific effects (see Section 5.8).
Also unlike standard interaction approaches, we can estimate the covariance of the random effects, and specify different covariance structures for observations within groups.
Because of the way they are estimated, mixed models can account for lack of independence of observations7, which is a common issue in many datasets. This is especially important when you have repeated measures, or when you have a hierarchical structure in your data, such as students within schools, or patients within hospitals.
Standard modeling approaches can estimate these difficult models very efficiently, even with thousands of groups.
-
The group effects are like a very simplified embedding (Section 10.2.2), where we have taken a categorical feature and turned it into a numeric one, like those shown in Figure 6.5. This may help you understand other embedding techniques that are used in other places like deep learning if you think of this as the simplest embedding approach.
-
When you start to think about random effects and/or distributions for effects, you’re already thinking like a Bayesian (Section 4.11.2), who is always thinking about the distributions for various effects. Mixed models are the perfect segue from standard linear model estimation to Bayesian estimation, where everything is random.
+
The group effects are like a very simplified embedding (Section 10.2.2), where we have taken a categorical feature and turned it into a numeric one, like those shown in Figure 7.5. This may help you understand other embedding techniques that are used in other places like deep learning if you think of this as the simplest embedding approach.
+
When you start to think about random effects and/or distributions for effects, you’re already thinking like a Bayesian (Section 5.11.2), who is always thinking about the distributions for various effects. Mixed models are the perfect segue from standard linear model estimation to Bayesian estimation, where everything is random.
The random effect is akin to a latent variable of ‘unspecified group causes’. This is a very powerful idea that can be used in many different ways, but importantly, you might want to start thinking about how you can figure out what those ‘unspecified’ causes may be!
Group effects will almost always improve your model’s performance relative to not having them, especially if you weren’t including those groups in your model because of how many groups there were.
In short, mixed models are a fun way to incorporate additional interpretive color to your model, while also getting several additional benefits to help you understand your data!
-
-
6.4 Generalized Additive Models
+
+
7.4 Generalized Additive Models
-
-
6.4.1 When straight lines aren’t enough
+
+
7.4.1 When straight lines aren’t enough
Fitting a line through your data isn’t always going to be the best approach. Not every relationship is linear and not every relationship is monotonic. Sometimes, you need to be able to model a relationship that has a fair amount of nonlinearity – they can appear as slight curves, waves, and any other type of wiggle that you can imagine.
In other words, we can go from the straight line here:
@@ -2852,7 +2869,7 @@
-Figure 6.7: A standard linear effect
+Figure 7.7: A Standard Linear Relationship
@@ -2867,13 +2884,13 @@
-Figure 6.8: A curvilinear effect
+Figure 7.8: A Curvilinear Relationship
-
That curved line in Figure 6.8 is called a spline. It is created by a feature and expanding it to multiple columns, each of which is a function of the original feature. We then fit a model to that data as usual. What this ultimately means is that we can use a linear model to fit a curve through the data. While this might not give us the same tidy explanation that a typical line would offer, we will certainly get better predictions if it’s appropriate, and a better understanding of the reality and complexity of the true relationship. But often it’s useful for exploratory purposes, and visualization tools like ggplot, plotly8 and others make it easy. Here is some demonstration code (result not shown).
+
That curved line in Figure 7.8 is called a spline. It is created by a feature and expanding it to multiple columns, each of which is a function of the original feature. We then fit a model to that data as usual. What this ultimately means is that we can use a linear model to fit a curve through the data. While this might not give us the same tidy explanation that a typical line would offer, we will certainly get better predictions if it’s appropriate, and a better understanding of the reality and complexity of the true relationship. But often it’s useful for exploratory purposes, and visualization tools like ggplot, plotly8 and others make it easy. Here is some demonstration code (result not shown).
In this case, each \(X_j\) is a matrix of the feature and its basis expansion, and the \(\beta_j\) are the coefficients for each of those basis expansion columns. But a specific X could also just be a single feature and it’s coefficient to model a linear relationship.
The nice thing is that you don’t have to worry about the details of the basis expansion - the package you choose will take care of that for you. You’ll have different options, and often the default is fine, but sometimes you’ll want to adjust the technique and how ‘wiggly’ you want the curve to be.
-
-
6.4.2 A standard GAM
-
Now that you have some background, let’s give this a shot! In most respects, we can use the same sort of approach as we did with our other linear model examples. For our example here, we’ll use the model that was depicted in Figure 6.8, which looks at the relationship between life expectancy and happiness score from the world happiness data (2018). Results are simplified in the subsequent table.
+
+
7.4.2 A standard GAM
+
Now that you have some background, let’s give this a shot! In most respects, we can use the same sort of approach as we did with our other linear model examples. For our example here, we’ll use the model that was depicted in Figure 7.8, which looks at the relationship between life expectancy and happiness score from the world happiness data (2018). Results are simplified in the subsequent table.
smoother = bs, data = df_happiness_2018)
-
+gam_happiness_result = gam_happiness.fit()gam_happiness_result.summary()
@@ -2974,30 +2991,30 @@
-Table 6.5: GAM model output
+Table 7.5: GAM Model Results
-
+
@@ -3495,7 +3512,7 @@
-Figure 6.9: Visualizing a GAM
+Figure 7.9: Visualizing a GAM
@@ -3521,15 +3538,15 @@
-
-
6.5 Quantile Regression
+
+
7.5 Quantile Regression
People generally understand the concept of the arithmetic mean, or ‘average’. You see it some time during elementary school, it gets tossed around in daily language, and it is statistically important. After all, so many distributions depend on it! Why, though, do we feel so tied to it from a regression modeling perspective? Yes, it has handy features, but it can limit the relationships we can otherwise model effectively. Here we’ll show you what to do when the mean betrays you – and trust us, the mean will betray you at some point!
-
-
6.5.1 When the mean breaks down
+
+
7.5.1 When the mean breaks down
In a perfect data world, we like to assume the mean is equal to the middle observation of the data: the median. But that is only when things are symmetric though, and usually our data comes loaded with challenges. Skewness and even just a few extreme scores in your data may cause a rift between the median and the mean.
Let’s say we take the integers between 1 and 10, and find the mean.
\[\frac{1+2+3+4+5+6+7+8+9+10}{10} = 5.5\]
@@ -3547,7 +3564,7 @@
-Figure 6.10: Linear line without extreme scores
+Figure 7.10: Linear Relationship without Extreme Scores
@@ -3562,7 +3579,7 @@
-Figure 6.11: Linear line with extreme scores
+Figure 7.11: Linear Relationship with Extreme Scores
@@ -3577,7 +3594,7 @@
-Figure 6.12: Line lines with and without extreme scores
+Figure 7.12: Linear Relationships with and without Extreme Scores
@@ -3588,7 +3605,7 @@
A better answer to this challenge might be to try a median-based approach instead. This is where a model like quantile regression becomes handy. Formally, the objective function for the model can be expressed as:
With quantile regression, we are given an extra parameter for the model: \(\tau\) or tau. It’s a number between 0 and 1 representing the desired quantile (e.g., 0.5 for the median). The objective function also treats positive residuals differently than negative residuals. If the residual is positive, then we multiply it by the tau value. If the residual is negative, then we multiply it by -1 plus the tau value.
To demonstrate this type of model, let’s use our movie reviews data. Let’s say that we are curious about the relationship between the word_count variable and the rating variable to keep things simple. To make it even more straightforward, we will use the standardized (scaled) version of the variable. In our default approach, we will start with a median regression, in other words, a quantile associated with a tau of .5.
@@ -3617,7 +3634,7 @@
-Table 6.6: Quantile regression model output
+Table 7.6: Quantile Regression Model Results
-Table 6.7: Quantile regression model output
+Table 7.7: Quantile Regression Model Results
@@ -4648,8 +4665,8 @@
-
-
6.5.2 The quantile loss function
+
+
7.5.2 The quantile loss function
Given how relatively simple the objective function is, let’s demystify this model further by creating our own quantile regression model. We’ll start by creating a loss function that we can use to fit our model.
This code is just the embodiment of Equation 6.1. Compared to some of our other approaches, we add an argument for tau, but otherwise proceed very similarly. We calculate the residuals, and then we calculate the loss based on the residuals.
+
This code is just the embodiment of Equation 7.1. Compared to some of our other approaches, we add an argument for tau, but otherwise proceed very similarly. We calculate the residuals, and then we calculate the loss based on the residuals.
-
-
6.5.3 Fitting a quantile regression model
+
+
7.5.3 Fitting a quantile regression model
Now that we have our data and our loss function, we can fit the model almost exactly like our standard linear model. Again, note the difference here with our tau value, which we’ve set to .5 to represent the median.
-Table 6.8: Comparison of quantile regression models
+Table 7.8: Comparison of Quantile Regression Models
@@ -5232,26 +5249,26 @@
-
-
6.6 Wrapping Up
+
+
7.6 Wrapping Up
The standard linear model is useful across many different data situations. It does, unfortunately, have some issues when data becomes a little bit more “real”. When you have extreme scores or relationships that a standard model might miss, you don’t need to abandon your linear model in favor of something more exotic. Instead, you might just need to think about how you are actually fitting the line through your data.
-
-
6.6.1 The common thread
+
+
7.6.1 The common thread
The models discussed in this chapter are all linear models, but they add flexibility in how they model the relationship between the features and the target, and provide a non-linear aspect to the otherwise linear model. Furthermore, with tools like mixed models, GAMs, and quantile regression, we generalize our GLMs to handle even more complex data settings.
-
-
6.6.2 Choose your own adventure
-
No matter how much we cover in this book, there is always more to learn. Hopefully you’ve got a good grip on linear models and related topics, so feel free to try out some machine learning in Chapter 7!
+
+
7.6.2 Choose your own adventure
+
No matter how much we cover in this book, there is always more to learn. Hopefully you’ve got a good grip on linear models and related topics, so feel free to try out some machine learning in Chapter 8!
-
-
6.6.3 Additional resources
+
+
7.6.3 Additional resources
There is no shortage of great references for mixed effects models. If you are looking for a great introduction to mixed models, we would recommend starting with yet another tutorial by one of your fearless authors! Michael Clark’s Mixed Models with R(2023) is a great introduction to mixed models and is freely available. For a more comprehensive treatment, you can’t go wrong with Gelman & Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models(2006), and their more recent efforts in Regression and Other Stories(2020), which will soon have an added component for mixed models - Advanced Regression and Multilevel Models(2024).
If you want to dive more into the GAM world, we would recommend that you start with the Moving Beyond Linearity chapter in An Introduction to Statistical Learning(James et al. 2021). Not only do they have versions for both R and Python, but both have been made available online. If you are wanting more after that, you can’t beat Simon Wood’s book, Generalized Additive Models: An Introduction with R(2017), or a more digestible covering of the same content by one of your own humble authors (Clark 2022).
For absolute depth on quantile regression, we will happily point you to the OG of quantile regression, Roger Koenker. His book, Quantile Regression(2005) is a must read for anyone wanting to dive deeper into quantile regression , or just play around with his R package quantreg. Galton, Edgeworth, Frisch, and Prospects for Quantile Regression in Econometrics is another of his. And finally, you might also consider Fahrmeir et al. (2021), which takes an even more generalized view of GLMs, GAMs, mixed models, quantile regression, and more (very underrated).
-
-
6.7 Exercise
+
+
7.7 Exercise
These models are so much fun, you should feel comfortable just swapping any feature(s) in and out of the models. For example, for the mixed model, try using gdp per capita or life expectancy instead of (just a) trend effect. For the GAM, try using several features with nonlinear effects and see what shakes out. For quantile regression, try a different feature like movie length using different quantiles.
@@ -5725,12 +5742,12 @@
<
diff --git a/docs/linear_model_extensions_files/figure-html/fig-gam-model-line-1.png b/docs/linear_model_extensions_files/figure-html/fig-gam-model-line-1.png
deleted file mode 100644
index 94148e1..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-gam-model-line-1.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-lambda_value_viz-1.png b/docs/linear_model_extensions_files/figure-html/fig-lambda_value_viz-1.png
deleted file mode 100644
index 5011127..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-lambda_value_viz-1.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-length-genre-rating-1.png b/docs/linear_model_extensions_files/figure-html/fig-length-genre-rating-1.png
index b03c214..f2536e9 100644
Binary files a/docs/linear_model_extensions_files/figure-html/fig-length-genre-rating-1.png and b/docs/linear_model_extensions_files/figure-html/fig-length-genre-rating-1.png differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-model-interaction-plot-2.png b/docs/linear_model_extensions_files/figure-html/fig-model-interaction-plot-2.png
deleted file mode 100644
index c8b613c..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-model-interaction-plot-2.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-r_lambda_1_viz-1.png b/docs/linear_model_extensions_files/figure-html/fig-r_lambda_1_viz-1.png
deleted file mode 100644
index 0a5e670..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-r_lambda_1_viz-1.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-random-effects-2-1.png b/docs/linear_model_extensions_files/figure-html/fig-random-effects-2-1.png
deleted file mode 100644
index 34d5c6a..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-random-effects-2-1.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-random-effects-cor-plot-1.png b/docs/linear_model_extensions_files/figure-html/fig-random-effects-cor-plot-1.png
index a4c3a2d..3720f28 100644
Binary files a/docs/linear_model_extensions_files/figure-html/fig-random-effects-cor-plot-1.png and b/docs/linear_model_extensions_files/figure-html/fig-random-effects-cor-plot-1.png differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-regular-linear-line-1.png b/docs/linear_model_extensions_files/figure-html/fig-regular-linear-line-1.png
deleted file mode 100644
index 2d357e7..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-regular-linear-line-1.png and /dev/null differ
diff --git a/docs/linear_model_extensions_files/figure-html/fig-spline_viz-1.png b/docs/linear_model_extensions_files/figure-html/fig-spline_viz-1.png
deleted file mode 100644
index 7d4c7b4..0000000
Binary files a/docs/linear_model_extensions_files/figure-html/fig-spline_viz-1.png and /dev/null differ
diff --git a/docs/linear_models.html b/docs/linear_models.html
index b91b35e..3a1699d 100644
--- a/docs/linear_models.html
+++ b/docs/linear_models.html
@@ -7,7 +7,7 @@
-2 The Foundation – [Models Demystified]{.smallcaps}
+3 The Foundation – [Models Demystified]{.smallcaps}
-
-
-
-
Feature
-
Target
-
-
-
-
independent variable
-
dependent variable
-
predictor variable
-
response
-
explanatory variable
-
outcome
-
covariate
-
label
-
x
-
y
-
input
-
output
-
right-hand side
-
left-hand side
-
-
-
-
-
-
-
-
-
-
-
Some of these terms actually suggest a particular type of relationship (e.g., a causal relationship, an experimental setting), but here we’ll typically avoid those terms if we can since those connotations won’t apply. In the end though, you may find us using any of these words to describe the relationships of interest so that you are comfortable with the terminology, but typically we’ll stick with features and targets for the most part. In our opinion, these terms have the least hidden assumptions/implications, and just implies ‘features of the data’ and the ‘target’ we’re trying to explain or predict.
+
+
+
It is the chief characteristic of data science that it works. ― Isaac Asimov (paraphrased)
+
+
+
Now that you’re here, it’s time to dive in! We’ll start things off by covering the building block of all modeling, and a solid understanding here will provide you the basis for just about anything that comes after, no matter how complex it gets. The linear model is our starting point. At first glance, it may seem like a very simple model, and it is, relatively speaking. But it’s also quite powerful and flexible, able to take in different types of inputs, handle nonlinear relationships, temporal and spatial relations, clustering, and more. Linear models have a long history, with even the formal and scientific idea behind correlation and linear regression being well over a century old1! And in that time, the linear model is far and away the most used model out there. But before we start talking about the linear model, we need to talk about what a model is in general.
+
+
3.1 Key Ideas
+
To get us started, we can pose a few concepts key to understanding linear models. We’ll cover each of these as we go along.
+
+
The linear model is a foundation on which one can build an understanding for all models.
+
Prediction is fundamental to assessing and using a model.
+
Interpretation: what does a model tell us?
+
+
Prediction underlies all interpretation
+
We can interpret a model at the feature level and as a whole
+
+
+
As we go along and cover these concepts, be sure that you feel you have the ‘gist’ of what we’re talking about. Almost everything that goes beyond linear models builds on what’s introduced here, so it’s important to have a firm grasp before climbing to new heights.
+
+
3.1.1 Why this matters
+
The basic linear model and how it comes about underpins so many other models, from the simplest t-test to the most complex neural network. There are many important aspects of it, but it provides a powerful foundation, and one that you’ll see in many different contexts. It’s also a model that is relatively easy to understand, so it’s a great place to start!
-
-
2.3.2 Expressing relationships
-
As noted, a model is a way of expressing a relationship between a set of features and a target, and one way of thinking about this is in terms of inputs and outputs. But how can we go from input to output?
-
Well, first off, we assume that the features and target are correlated, that there is some relationship between the feature x and target y. The output of a model will correspond to the target if they are correlated, and more closely match it with stronger correlation. If so, then we can ultimately use the features to predict the target. In the simplest setting, a correlation implies a relationship where x and y typically move up and down together (left plot) or they move in opposite directions where x goes up and y goes down (right plot).
-
-
-
-
-
-
-
-
-Figure 2.1: Correlation
-
-
-
-
-
-
In addition, the simplest correlation suggests a linear relationship, one that is adequately captured by a straight line. There are many types of correlation metrics, but the most common one, the Pearson correlation, is explicitly a measure of the linear relationship between two variables. It’s expressed as a number between -1 and 1, where 0 means there is no linear relationship. As we move closer to a 1.0 correlation value, we would see a tighter scatterplot like the one on the left, until it became a straight line. The same happens for the negative relationship as we get closer to a value of -1, like the plot on the right. If we have only one feature and target, the Pearson correlation reflects the exact result of the linear model we’d conduct in a more complicated fashion. But even with multiple features, we still stick to this notion of correlation to help us understand how the features account for the target’s variability, or why it behaves the way it does.
-
+
+
3.1.2 Good to know
+
We’re just starting out here, but we’re kind of assuming you’ve had some exposure to the idea of statistical or other models, even if only from an interpretation standpoint. We assume you have an understanding of basic stats like central tendency (e.g., a mean or median), variance, and correlation, stuff like that. And if you intend to get into the code examples, you’ll need a basic familiarity with Python or R.
+
-
-
2.4THE Linear Model
+
+
3.2THE Linear Model
The linear model is perhaps the simplest functional model we can use to express a relationship between features and targets. And because of that, it’s possibly still the most common model used in practice, and it is the basis for many types of other models. So why don’t we do one now?
The following dataset has individual movie reviews containing the movie rating (1-5 stars scale), along with features pertaining to the review (e.g., word count), those that regard the reviewer (e.g., age) and features about the movie (e.g., genre, release year).
For our first linear model, we’ll keep things simple. Let’s predict the rating from the length of the review in terms of word count. We’ll use the lm() function in R and the ols() function in Python2 to fit the model. Both functions take a formula as the first argument, which is a way of expressing the relationship between the features and target. The formula is expressed as y ~ x1 + x2 + ..., where y is the target name and x* are the feature names. We also need to specify what the data object is, typically a data frame.
# all data found on github repo
-df_reviews =read_csv('https://tinyurl.com/moviereviewsdata')
-
-model_lr_rating =lm(rating ~ word_count, data = df_reviews)
-
-summary(model_lr_rating)
+
# all data found on github repo
+df_reviews =read_csv('https://tinyurl.com/moviereviewsdata')
+
+model_lr_rating =lm(rating ~ word_count, data = df_reviews)
+
+summary(model_lr_rating)
Call:
@@ -1076,22 +548,22 @@
-
+
-
import numpy as np
-import pandas as pd
-import statsmodels.formula.api as smf
-
-# all data found on github repo
-df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
-
-model_lr_rating = smf.ols('rating ~ word_count', data = df_reviews).fit()
-
-model_lr_rating.summary(slim =True)
+
import numpy as np
+import pandas as pd
+import statsmodels.formula.api as smf
+
+# all data found on github repo
+df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
+
+model_lr_rating = smf.ols('rating ~ word_count', data = df_reviews).fit()
+
+model_lr_rating.summary(slim =True)
OLS Regression Results
@@ -1232,7 +704,7 @@
Getting more into the details, we’ll start with the fact that the linear model posits a linear combination of the features. This is an important concept to understand, but really, a linear combination is just a sum of the features, each of which has been multiplied by some specific value. That value is often called a coefficient, or possibly weight, depending on the context. The linear model is expressed as (math incoming!):
In this case, the function is just a sum, something so simple we do it all the time. In the linear model sense though, we’re actually saying a bit more. Another way to understand that equation is that y is a function of x. We don’t show any coefficients here, i.e. the bs in our initial equation (Equation 2.1), but technically it’s as if each coefficient was equal to a value of 1. In other words, for this simple linear model, we’re saying that each feature contributes in an identical fashion to the target.
+
In this case, the function is just a sum, something so simple we do it all the time. In the linear model sense though, we’re actually saying a bit more. Another way to understand that equation is that y is a function of x. We don’t show any coefficients here, i.e. the bs in our initial equation (Equation 3.1), but technically it’s as if each coefficient was equal to a value of 1. In other words, for this simple linear model, we’re saying that each feature contributes in an identical fashion to the target.
In practice, features will never contribute in the same ways, because they correlate with the target differently, or are on different scales. So if we want to relate some feature, \(x_1\), and some other feature, \(x_2\), to target \(y\), we probably would not assume that they both contribute in the same way from the beginning. We might give relatively more weight to \(x_1\) than \(x_2\). In the linear model, we express this by multiplying each feature by a different coefficient or weight. So the linear model is really just a sum of the features multiplied by their coefficients, i.e. a weighted sum. In fact, we’re saying that each feature contributes to the target in proportion to the coefficient. So if we have a feature \(x_1\) and a coefficient \(b_1\), then the contribution of \(x_1\) to the target is \(b_1*x_1\). If we have a feature \(x_2\) and a coefficient \(b_2\), then the contribution of \(x_2\) to the target is \(b_2 * x_2\). And so on. So the linear model is really just a sum of the features multiplied by their respective weights.
For our specific model, here is the mathematical representation:
\[
@@ -1263,56 +735,56 @@
Our target is the movie’s rating by a reviewer, and the feature is the word count
We map the feature to the target via the linear model, which provides an initial understanding of how the feature is related to the target. In this case, we start with a baseline of 3.49. This value makes sense only in the case of a rating with no review, and is what we would guess if the word count was 0. But we know there are reviews for every observation, so it’s not very meaningful as is. We’ll talk about ways to get a more meaningful intercept later, but for now, that is our starting point. Moving on, if we add a single word to the review, we expect the rating to decrease by -0.04 stars. So if we had a review that was 10 words long, i.e., the mean word count, we would predict a rating of 3.49 + 10*-0.04 = 3.1 stars.
-
-
2.4.1 The linear model visualized
-
We can also express the linear model as a graph, which can be a very useful way to think about models in a visual fashion, and as we see other models, can help us literally see how different models relate to one another and are actually very similar to one another. In the following, we have three features predicting a single target, so we have three ‘nodes’ for the features, and a single node for the target. The feature nodes are combined into a linear combination to produce the output of the model. In the context of linear regression, the output is often called the linear predictor. Each ‘edge’ signifies the connection of a feature to the output, and is labeled with the coefficient or weight. The connection between the output and the target is direct, without any additional change. We’ll return to this depiction a little bit later (Section 2.9), but for our standard linear model, we’re all set.
+
+
3.2.1 The linear model visualized
+
We can also express the linear model as a graph, which can be a very useful way to think about models in a visual fashion. As we come across models, a visualization like this can help us see both how different models relate to one another and are actually very similar to one another. In the following, we have three features predicting a single target, so we have three ‘nodes’ for the features, and a single node for the target. The feature nodes are combined into a linear combination to produce the output of the model. In the context of linear regression, the output is often called the linear predictor. Each ‘edge’ signifies the connection of a feature to the output, and is labeled with the coefficient or weight. The connection between the output and the target is direct, without any additional change. We’ll return to this depiction a little bit later (Section 3.7), but for our standard linear model, we’re all set.
-Figure 2.2: A linear regression as a graphical model
+Figure 3.1: A linear regression as a graphical model
So at this point you have the basics of what a linear model is and how it works, and a couple ways to think about it, whether through programming, math, or just visually. But there is a lot more to it than that. Just getting the model is easy enough, but we need to be able to use it and understand the details better, so we’ll get into that now!
-
-
2.5 What Do We Do with a Model?
+
+
3.3 What Do We Do with a Model?
Once we have a working model, there are two primary ways we can use it. One way to use a model is to help us understand the relationships between the features and our outcome of interest. In this way, the focus can be said to be on explanation, or interpreting the model results. The other way to use a model is to make estimates about the target for specific observations, often ones we haven’t seen in our data. In this way the focus is on prediction. In practice, we often do both, but the focus is usually on one or the other. We’ll cover both in detail eventually, but let’s start with prediction.
-
-
2.5.1 Prediction
-
It may not seem like much at first, but a model is of no use if it can’t be used to make predictions about what we can expect in the world around us. Once our model has been fit to the data, we can obtain our predictions by plugging in values for the features that we are interested in, and, using the corresponding weights and other parameters that have been estimated, come to a guess about a specific observation. Let’s go back to our results, shown in the following table.
+
+
3.3.1 Prediction
+
It may not seem like much at first, but a model is of little use if it can’t be used to make predictions about what we can expect in the world around us. Once our model has been fit to the data, we can obtain our predictions by plugging in values for the features that we are interested in, and, using the corresponding weights and other parameters that have been estimated, come to a guess about a specific observation. Let’s go back to our results, shown in the following table.
-Table 2.2: Linear Model Output
+Table 3.1: Linear Model Output
-
+
@@ -1777,13 +1249,18 @@
When we’re talking about the predictions (or outputs) for a linear model, we usually will see this as the following mathematically:
What is \(\hat{y}\)? The hat over the \(y\) just means that it’s a predicted value of the model, i.e. the output, rather than the target value we actually observe in the data. Our first equations that just used \(y\) implicitly suggested that we would get a perfect rating value given the model, but that’s not the case. We can only get an estimate. The \(\hat{y}\) is also the linear predictor in our graphical version (Figure 2.2), which makes clear it is not the actual target, but a combination of the features that is related to the target.
-
To make our first equation (Equation 2.1) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an error term, \(\epsilon\), to account for the fact that our predictions will not be perfect3. So the full linear (regression) model is:
+\tag{3.2}\]
+
What is \(\hat{y}\)? The hat over the \(y\) just means that it’s a predicted value of the model, i.e. the output, rather than the target value we actually observe in the data. Our first equations that just used \(y\) implicitly suggested that we would get a perfect rating value given the model, but that’s not the case. We can only get an estimate. The \(\hat{y}\) is also the linear predictor in our graphical version (Figure 3.1), which makes clear it is not the actual target, but a combination of the features that is related to the target.
+
To make our first equation (Equation 3.1) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an error term, \(\epsilon\), to account for the fact that our predictions will not be perfect3. So the full linear (regression) model is:
The error term is a random variable that represents the difference between the actual value and the predicted value, which comes from the weighted combination of features. We can’t know what the error term is, but we can estimate it, just like we can the coefficients. We’ll talk more about that in the section on estimation (Chapter 4).
+\tag{3.3}\]
+
The error term is a random variable that represents the difference between the actual value and the predicted value, which comes from the weighted combination of features. We can’t know what the error term is, but we can estimate parameters associated with it, just like we can the coefficients. We’ll talk more about that in the section on estimation (Chapter 5).
This makes explicit that the target is assumed to be conditionally normally distributed with a mean of based on the linear combination of the features, and a variance of \(\sigma^2\). What do we mean by conditionally? It means that the target is normally distributed given the features in question and the estimated model parameters. This is the standard assumption for linear regression, and it’s a good one to start with, but it’s not our only option. We’ll talk more about this in the section on assumptions Section 3.6, and see what we might do differently in Chapter 6. We will also see that we can estimate the model parameters without any explicit reference to a probability distribution Chapter 5.
@@ -1801,27 +1278,27 @@
-
-
2.5.2 What kinds of predictions can we get?
+
+
3.3.2 What kinds of predictions can we get?
What predictions we get depends on the type of model we are using. For the linear model we have at present we can get predictions for the target, which is a continuous variable. Very commonly, we also can get predictions for a categorical target, such as whether the rating is ‘good’ or ‘bad’. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g. words). In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric in most cases or at least amenable to such, and usually expressed as a probability between 0 and 1, say, for the ‘good’ category. Higher probabilities would mean we’d more likely predict the movie is good. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We’ll talk about how to get predictions for categorical targets later4.
We previously saw a prediction for a single observation where the word count was 10 words, but we can also get predictions for multiple observations at once. In fact, we can get predictions for all observations in our dataset. Besides that, we can also get predictions for observations that we don’t even have data for! Fun! The following shows how we can get predictions for all data, and for a single observation with a word count of 5.
-Figure 2.3: Predicted vs. Observed Ratings
+Figure 3.2: Predicted vs. Observed Ratings
@@ -1847,30 +1324,30 @@
-Table 2.3: Predictions for Specific Observations
+Table 3.2: Predictions for Specific Observations
-
+
@@ -2322,8 +1799,8 @@
-
2.5.3 Prediction error
+
+
3.3.3 Prediction error
As we have seen, predictions are not perfect, and an essential part of the modeling endeavor is to better understand these errors and why they occur. In addition, error assessment is the fundamental way in which we assess a model’s performance, and, by extension, compare that performance to other models. In general, prediction error is the difference between the actual value and the predicted value or some function of it, and in statistical models, is also often called the residual. We can look at these individually, or we can look at them in aggregate with a single metric.
Let’s start with looking at the residuals visually. Often the modeling package you use will have this as a default plotting method when doing a standard linear regression, so it’s wise to take advantage of it. We plot both the distribution of raw error scores and the cumulative distribution of absolute prediction error. Here we see a couple things. First, the distribution is roughly normal, which is a good thing, since statistical linear regression assumes our error is normally distributed, and the prediction error serves as an estimate of that. Second, we see that the mean of the errors is zero, which is a consequence of linear regression, and the reason we look at other metrics when assessing model performance. We can also see that most of our predictions are within ±1 star rating.
@@ -2334,41 +1811,41 @@
-Figure 2.4: Distribution of Prediction Errors
+Figure 3.3: Distribution of Prediction Errors
-
Of more practical concern is that we don’t see extreme values or clustering, which might indicate a failure on the part of the model to pick up certain segments of the data. It can still be a good idea to look at the extremes just in case we can pick up on some aspect of the data that we could potentially incorporate into the model. So looking at our worst prediction in absolute terms, we see the observation has a typical word count, and so our simple model will just predict a fairly typical rating. But the actual rating is 1, which is 2.1 away from our prediction, a very noticeable difference. Further data inspection may be required to figure out why this came about.
+
Of more practical concern is that we don’t see extreme values or clustering, which might indicate a failure on the part of the model to pick up certain segments of the data. It can still be a good idea to look at the extremes just in case we can pick up on some aspect of the data that we could potentially incorporate into the model. So looking at our worst prediction in absolute terms, we see the observation has a typical word count, and so our simple model will just predict a fairly typical rating. But the actual rating is 1, which is 2.1 away from our prediction, a very noticeable difference. Further data inspection may be required to figure out why this came about, and this is a process you should always be prepared to do when you’re working with models.
We can also look at the uncertainty of our predictions, which is a measure of how much we expect our predictions to vary. This is often expressed as an interval range of values that we expect our prediction to fall within, with a certain level of confidence. But! There are actually two types of intervals we can get, one is really about the mean prediction, or expected value we would get from the model at that observation. This is usually called a confidence interval. The other type of interval is based on the model’s ability to predict new data, and is often called a prediction interval. This interval is about the actual prediction we would get from the model for any value, whether it was data we had seen before or not. Because of this, the prediction interval is always wider than the confidence interval, and it’s the one we usually want to use when we’re making predictions about new data.
-Table 2.5: Prediction Intervals for Specific Observations
+Table 3.4: Prediction Intervals for Specific Observations
-
+
@@ -3345,44 +2822,44 @@
As expected our prediction interval is wider than our confidence interval, and we can see that the prediction interval is quite wide- from a rating of 2.1 to 4.4. This is a consequence of the fact that we have a lot of uncertainty in our predictions for new observations and we can’t expect to get a very precise prediction from our model with only one feature. This is a common issue with many models, and one that having a better model can help remedy7.
-
So at this point you have the gist of prediction, prediction error, and uncertainty in a prediction, but there is still more to it! We’ll come back to global assessments of model error very shortly, and even more detail can be found in Chapter 3 where we dive deeper into our models, and Chapter 4, where we see how to estimate the parameters of our model by picking those that will reduce the prediction error the most. For now though, let’s move on to the other main use of models, which is to help us understand the relationships between the features and the target, or explanation.
+
So at this point you have the gist of prediction, prediction error, and uncertainty in a prediction, but there is still more to it! We’ll come back to global assessments of model error very shortly, and even more detail can be found in Chapter 4 where we dive deeper into our models, and Chapter 5, where we see how to estimate the parameters of our model by picking those that will reduce the prediction error the most. For now though, let’s move on to the other main use of models, which is to help us understand the relationships between the features and the target, or explanation.
-
-
2.6 How Do We Interpret the Model?
+
+
3.4 How Do We Interpret the Model?
When it comes to interpreting the results of our model, there are a lot of tools at our disposal, though many of the tools we can ultimately use will depend on the specifics of the model we have employed. In general though, we can group our approach to understanding results at the feature level and the model level. A feature level understanding regards the relationship between a single feature and the target. Beyond that, we also attempt comparisons of feature contributions to prediction, i.e., relative importance. Model level interpretation is focused on assessments of how well the model ‘fits’ the data, or more generally, predictive performance. We’ll start with the feature level, and then move on to the model level.
-
-
2.6.1 Feature level interpretation
+
+
3.4.1 Feature level interpretation
As mentioned, at the feature level, we are primarily concerned with the relationship between a single feature and the target. More specifically, we are interested in the direction and magnitude of the relationship, but in general, it all boils down to how a feature induces change in the target. For numeric features, we are curious about the change in the target given some amount of change in the feature values. It’s conceptually the same for categorical features, but often we like to express the change in terms of group mean differences or something similar, since the order of categories is not usually meaningful. An important aspect of feature level interpretation is the specific predictions we can get by holding the data at key feature values.
Let’s start with the basics by looking again at our coefficient table from the model output.
-Table 2.6: Linear Regression Coefficients
+Table 3.5: Linear Regression Coefficients
-
+
@@ -3867,30 +3344,30 @@
-Table 2.7: Linear Regression Statistical Output
+Table 3.6: Linear Regression Statistical Output
-
+
@@ -4357,30 +3834,30 @@
-Table 2.8: Linear Regression Confidence Intervals
+Table 3.7: Linear Regression Confidence Intervals
-
+
@@ -4866,57 +4343,57 @@
-
2.6.2 Model level interpretation
+
+
3.4.2 Model level interpretation
Thus far, we’ve focused on interpretation at the feature level. But knowing the interpretation of a feature doesn’t do you much good if the model itself is poor! In that case, we also need to assess the model as a whole, and as with the feature level, we can go about this in a few ways. Before getting too carried away with asking whether your model is any good or not, you always need to ask yourself relative to what? Many models claim top performance under various circumstances, but which are statistically indistinguishable from many other models. So we need to be careful about how we assess our model, and what we compare it to.
-
When we looked at the models previously Figure 2.3, we examined how well the predictions and target line up, and that gives us an initial feel for how well the model fits the data. Most model-level interpretation involves assessing and comparing model fit and variations on this theme. Here we show how easy it is to obtain such a plot.
+
When we looked at the models previously Figure 3.2, we examined how well the predictions and target line up, and that gives us an initial feel for how well the model fits the data. Most model-level interpretation involves assessing and comparing model fit and variations on this theme. Here we show how easy it is to obtain such a plot.
We can also get an overall assessment of the prediction error from a single metric. In the case of the linear model we’ve been looking at, we can express this in a single metric as the sum or mean of our squared errors, the latter of which is a very commonly used modeling metric- MSE or mean squared error, or also, its square root - RMSE or root mean squared error12. We’ll talk more about this and similar metrics in other chapters, but we can take a look at the RMSE for our model now.
If we look back at our results, we can see this expressed as the part of the output or as an attribute of the model13. The RMSE is more interpretable, as it gives us a sense that our typical errors bounce around by about 0.59. Given that the rating is on a 1-5 scale, this maybe isn’t bad, but we could definitely hope to do better than get within roughly half a point on this scale. We’ll talk about ways to improve this later.
# summary(model_lr_rating) # 'Residual standard error' is approx RMSE
-summary(model_lr_rating)$sigma # We can extract it directly
+
# summary(model_lr_rating) # 'Residual standard error' is approx RMSE
+summary(model_lr_rating)$sigma # We can extract it directly
[1] 0.5907
-
+
-
np.sqrt(model_lr_rating.scale) # RMSE
+
np.sqrt(model_lr_rating.scale) # RMSE
0.590728780660127
@@ -4924,32 +4401,32 @@
Figure 2.4). With either metric, the closer to zero the better, since as we get closer, we are reducing error.
-
We can also look at the R-squared (R2) value of the model. R2 is possibly the most popular measure of model performance with linear regression and linear models in general. Before squaring, it’s just the correlation of the values that we saw in the previous plot (Figure 2.3). When we square it, we can interpret it as a measure of how much of the variance in the target is explained by the model. In this case, our model shows the R2 is 0.12, which is not bad for a single feature model in this type of setting. We interpret the value as 12% of the target variance is explained by our model, and more specifically by the features in the model. In addition, we can also interpret R2 as 1 - the proportion of error variance in the target, which we can calculate as \(1 - \frac{\textrm{MSE}}{var(y)}\). In other words the complement of R2 is the proportion of the variance in the target that is not explained by the model. Either way, since 88% is not explained by the model, our result suggests there is plenty of work left to do!
+
Another metric we can use to assess model fit in this particular situation is the mean absolute error (MAE). MAE is similar to the mean squared error, but instead of squaring the errors, we just take the absolute value. Conceptually it attempts to get at the same idea, how much our predictions miss the target on average, and here the value is 0.46, which we actually showed in our residual plot (Figure 3.3). With either metric, the closer to zero the better, since as we get closer, we are reducing error.
+
We can also look at the R-squared (R2) value of the model. R2 is possibly the most popular measure of model performance with linear regression and linear models in general. Before squaring, it’s just the correlation of the values that we saw in the previous plot (Figure 3.2). When we square it, we can interpret it as a measure of how much of the variance in the target is explained by the model. In this case, our model shows the R2 is 0.12, which is not bad for a single feature model in this type of setting. We interpret the value as 12% of the target variance is explained by our model, and more specifically by the features in the model. In addition, we can also interpret R2 as 1 - the proportion of error variance in the target, which we can calculate as \(1 - \frac{\textrm{MSE}}{var(y)}\). In other words the complement of R2 is the proportion of the variance in the target that is not explained by the model. Either way, since 88% is not explained by the model, our result suggests there is plenty of work left to do!
Note also, that with R2 we get a sense of the variance shared between all features in the model and the target, however complex the model gets. As long as we use it descriptively as a simple correspondence assessment of our predictions and target, it’s a fine metric. For various reasons, it’s not a great metric for comparing models to each other, but again, as long as you don’t get carried away, it’s okay to use.
-
-
2.6.3 Prediction vs. explanation
+
+
3.4.3 Prediction vs. explanation
In your humble authors’ views, one can’t stress enough the importance of a model’s ability to predict the target. It can be a poor model, maybe because the data is not great, or perhaps we’re exploring a new area of research, but we’ll always be interested in how well a model fits the observed data, and in most situations, we’re just as much or even more interested in how well it predicts new data.
Even to this day, statistical significance is focused on a great deal, and much is made about models that have no little predictive power at all. As strange as it may sound, you can read results in journal articles, news features, and business reports in many fields with hardly any mention of a model’s predictive capability. The focus is almost entirely on the explanation of the model, and usually the statistical significance of the features. In those settings, statistical significance is often used as a proxy for importance, which is rarely ever justified. As we’ve noted elsewhere, statistical significance is affected by other things besides the size of the coefficient, and without an understanding of the context of the features, in this case, like how long typical reviews are, what their range is, what variability of ratings is, etc., the information it provides is extremely limited, and many would argue, not very useful. If we are very interested in the coefficient or weight value specifically, it is better to focus on the range of possible values, which is provided by the confidence interval, along with the predictions that come about based on that coefficient’s value. While a confidence interval is also a loaded description of a feature’s relationship to the target, we can use it in a very practical way as a range of possible values for that weight, and more importantly, think of possibilities rather than certainties.
Suffice it to say at this point that how much one focuses on prediction vs. explanation depends on the context and goals of the data endeavor. There are cases where predictive capability is of utmost importance, and we care less about explanatory details, but not to the point of ignoring it. For example, even with deep learning models for image classification, where the inputs are just RGB values, we’d still like to know what the (notably complex) model is picking up on, otherwise we may be classifying images based on something like image backgrounds (e.g. outdoors vs. indoors) instead of the objects of actual interest (dogs vs. cats). In some business or other organizational settings, we are very, or even mostly, interested in the coefficients/weights, which might indicate how to allocate monetary resources in some fashion. But if those weights come from a model with no predictive power, placing much importance on them may be a fruitless endeavor.
In the end we’ll need to balance our efforts to suit the task at hand. Prediction and explanation are both fundamental to the modeling endeavor.
-
-
2.7 Adding Complexity
+
+
3.5 Adding Complexity
We’ve seen how to fit a model with a single feature and interpret the results, and that helps us to get oriented to the general modeling process. However, we’ll always have more than one feature for a model except under some very specific circumstances, such as exploratory data analysis. So let’s see how we can implement a model with more features and that makes more practical sense.
-
-
2.7.1 Multiple features
+
+
3.5.1 Multiple features
We can add more features to our model very simply. Using the standard functions we’ve already demonstrated, we just add them to the formula (both R and statsmodels) as follows.
-
'y ~ feature_1 + feature_2 + feature_3'
+
'y ~ feature_1 + feature_2 + feature_3'
In other cases where we use matrix inputs, additional features will just be the additional input columns, and nothing about the model code actually changes. We might have a lot of features, and even for relatively simple linear models this could be dozens in some scenarios. A compact depiction of our model uses matrix representation, which we’ll show in the callout below, but you can find more detail in the matrix overview Appendix B. For our purposes, all you really need to know is that this:
\[
y = X\beta\qquad + \epsilon \textrm{or}\qquad y = \alpha + X\beta + \epsilon
-\tag{2.4}\]
You will also see it depicted in a transposed fashion, such that \(y = \beta^\intercal X\), or \(f(x) = w^\intercal X + b\), with the latter formula is typically seen when the context is machine learning. This is just a matter of preference, except that it may assume the data is formatted in a different way, or possibly they are talking about matrix/vector operations for a single observation. You’ll want to pay close attention to what the dimensions are15.
For the models considered here and almost all ‘tabular data’ scenarios, the data is stored in the fashion we’ve represented in this text, but you should be aware that other data settings will force you to think of multi-dimensional arrays16 instead of 2-d matrices, for example, with image processing. So it’s good to be flexible.
-
With that in mind, let’s get to our model! In what follows, we keep the word count, but now we add some aspects of the reviewer, such as age and the number of children in the household, and features related to the movie, like the release year, the length of the movie in minutes, and the total reviews received. We’ll use the same approach as before, and literally just add them as we depicted in our linear model formula (Equation 2.3) .
+
With that in mind, let’s get to our model! In what follows, we keep the word count, but now we add some aspects of the reviewer, such as age and the number of children in the household, and features related to the movie, like the release year, the length of the movie in minutes, and the total reviews received. We’ll use the same approach as before, and literally just add them as we depicted in our linear model formula (Equation 3.3) .
17. Conceptually this means that the effect of word count is the effect of word count after we’ve accounted for the other features in the model. In this case, an increase of a single word results in a -0.03 drop, even after adjusting for the effect of other features. Looking at another feature, the addition of a child to the home is associated with 0.1 increase in rating, accounting for the other features.
-
Thinking about prediction, how would we get a prediction for a movie rating with a review that is 12 words long, written in 2020, by a 30 year old with one child, for a movie that is 100 minutes long, released in 2015, with 10000 total reviews? Exactly the same as we did before (Section 2.5.2)! We just create a data frame with the values we want, and predict accordingly.
+
Thinking about prediction, how would we get a prediction for a movie rating with a review that is 12 words long, written in 2020, by a 30 year old with one child, for a movie that is 100 minutes long, released in 2015, with 10000 total reviews? Exactly the same as we did before (Section 3.3.2)! We just create a data frame with the values we want, and predict accordingly.
model_lr_cat = smf.ols(
- formula ="rating ~ word_count + season",
- data = df_reviews
-).fit()
-
-model_lr_cat.summary(slim =True)
+
model_lr_cat = smf.ols(
+ formula ="rating ~ word_count + season",
+ data = df_reviews
+).fit()
+
+model_lr_cat.summary(slim =True)
OLS Regression Results
@@ -6138,22 +5615,22 @@
-
2.7.2.1 Summarizing categorical features
+
+
3.5.2.1 Summarizing categorical features
When we have a lot of categories, it’s not practical to look at the coefficients for each one, and even when there aren’t that many, we often prefer to get a sense of the total effect of the feature. For standard linear models, we can break down the target variance explained by the model into the variance explained by each feature, and this is called the ANOVA, or analysis of variance. It is not without its issues18, but it’s a good way to get a sense of whether a categorical (or other) feature as a whole is statistically significant.