diff --git a/docs/causal.html b/docs/causal.html index 948fe82..1f2e9a4 100644 --- a/docs/causal.html +++ b/docs/causal.html @@ -7,7 +7,7 @@ -11  Causal Modeling – [Models Demystified]{.smallcaps} +12  Causal Modeling – [Models Demystified]{.smallcaps} @@ -1034,7 +1051,7 @@

-

Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions1 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle2. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 10.2.3.

+

Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions1 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle2. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 11.2.3.

@@ -1043,7 +1060,7 @@

-Figure 10.1: Log Transformation and Heteroscedasticity +Figure 11.1: Log Transformation and Heteroscedasticity
@@ -1066,14 +1083,14 @@

-
-

10.2.2 Categorical variables

-

Despite their ubiquity in data, we can’t analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called effects coding to test for specific types of group differences. Far and away the most common type is called dummy coding or one-hot encoding3, which we visited previously Section 2.7.2. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the season feature that was demonstrated previously.

+
+

11.2.2 Categorical variables

+

Despite their ubiquity in data, we can’t analyze raw text information as it is. Character strings, and labeled features like factors, must be converted to a numeric representation before we can analyze them. For categorical features, we can use something called effects coding to test for specific types of group differences. Far and away the most common type is called dummy coding or one-hot encoding3, which we visited previously Section 3.5.2. In these situations we create columns for each category, and the value of the column is 1 if the observation is in that category, and 0 otherwise. Here is a one-hot encoded version of the season feature that was demonstrated previously.

-Table 10.2: One-hot Encoding +Table 11.2: One-hot Encoding
@@ -1589,9 +1606,9 @@

-

When we encode categories for statistical analysis, we can summarize their impact on the target variable single measure. For a model with only categorical features, we can use an ANOVA (Section 2.7.2.1) for this. But a similar approach can also be used for mixed models, splines, and other models. Techniques like SHAP also provide a way to summarize the total effect of a categorical feature.

-
-

10.2.2.1 Text embeddings

+

When we encode categories for statistical analysis, we can summarize their impact on the target variable single measure. For a model with only categorical features, we can use an ANOVA (Section 3.5.2.1) for this. But a similar approach can also be used for mixed models, splines, and other models. Techniques like SHAP also provide a way to summarize the total effect of a categorical feature.

+
+

11.2.2.1 Text embeddings

When it comes to other string representations like text, we use other methods to represent them numerically. One important way to encode text is through an embedding. This is a way of representing the text as a vector of numbers, at which point the numeric embedding feature is used in the model like anything else. The way to do this usually involves a model itself, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning and natural language processing in particular. However, embeddings can also be used as a preprocessing step in any modeling situation.

To understand how embeddings work, consider a one-hot encoded matrix for a categorical variable. This matrix then connects to a hidden layer of a neural network, and the weights of that layer are the embeddings for the categorical variable. While this isn’t the exact method used (there are more efficient methods that don’t require the actual matrix), the concept is the same. In addition, we normally don’t even use whole words. Instead, we break the text into smaller units called tokens, like characters or subwords, and then use embeddings for those units. Tokenization is used in many of the most successful models for natural language processing, including those such as ChatGPT.

@@ -1600,19 +1617,19 @@

-Figure 10.2: Conceptual example of an embedding +Figure 11.2: Conceptual example of an embedding
-
-

10.2.2.2 Multiclass targets

+
+

11.2.2.2 Multiclass targets

We’ve talked about and demonstrated models with binary targets, but what about when there are more than two classes? In statistical settings, we can use a multinomial regression, which is a generalization of (binomial) logistic regression to more than two classes via the multinomial distribution. Depending on the tool, you may have to use a multivariate target of the counts, though most commonly they would be zeros and ones for a classification model, which then is just a one-hot encoded target. The following table demonstrates how this might look.

-Table 10.3: Multinomial Data Example +Table 11.3: Multinomial Data Example
@@ -2113,10 +2130,10 @@

With Bayesian tools, it’s common to use the categorical distribution, which is a different generalization of the Bernoulli distribution to more than two classes. Unlike the multinomial, it is not a count distribution, but an actual distribution over categories.

In the machine learning context, we can use a variety of models we’d use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use one-vs-all or one-vs-one strategies, where you treat each class as the target in a binary classification problem. In the first case of one vs. all, you would have a model for each class that predicts whether an observation is in that class or not. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand. As an example, we can’t expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. As such, it can be misleading when the models are conducted as if the categories are independent.

-

Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard relative effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (Chapter 3).

+

Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical setting, you now have coefficients that regard relative effects for one class versus a reference group, and so they cannot tell you a general effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (Chapter 4).

-
-

10.2.2.3 Multilabel targets

+
+

11.2.2.3 Multilabel targets

Multilabel targets are a bit more complicated, and are not as common as multiclass targets. In this case, each observation can have multiple labels. For example, if we wanted to predict genre based on the movie review data, we could choose to allow a movie to be both a comedy and action film, a sci-fi horror, or a romantic comedy. In this setting, labels are not mutually exclusive. If there are not too many unique label settings, we can treat the target as we would other multiclass targets, but if there are many, we might need to use a different model to go about things more efficiently.

-
-

10.2.3 Ordinal variables

+
+

11.2.3 Ordinal variables

So far in our discussion of categorical data, it’s assumed to have no order. But it’s quite common to have labels like “low”, “medium”, and “high”, or “very bad”, “bad”, “neutral”, “good”, “very good”, or simply are a few numbers, like ratings from 1 to 5. Ordinal data is categorical data that has a known ordering, but which still has arbitrary labels. Let us repeat that, ordinal data is categorical data.

-
-

10.2.3.1 Ordinal features

+
+

11.2.3.1 Ordinal features

The simplest way to treat ordinal features is as if they were numeric. If you do this, then you’re just pretending that it’s not categorical, but in practice this is usually fine for features. Most of the transformations we mentioned previously aren’t going to be as useful, but you can still use them if you want. For example, logging ratings 1-5 isn’t going to do anything for you model-wise, but it technically doesn’t hurt anything. But you should know that typical statistics like means and standard deviations don’t really make sense for ordinal data, so the main reason for treating them as numeric is for modeling convenience.

If you choose to treat an ordinal feature as categorical, you can ignore the ordering and do the same as you would with categorical data. This would allow for some nonlinearity since the category means will be whatever they need to be. There are some specific techniques to coding ordinal data for use in linear models, but they are not commonly used, and they generally aren’t going to help the model performance or interpreting the feature, so we do not recommend them. You could however use old-school effects coding that you would incorporate traditional ANOVA models, but again, you’d need a good reason to do so.

The take home message for ordinal features is generally simple. Treat them as you would numeric features or non-ordered categorical features. Either is fine.

-
-

10.2.3.2 Ordinal targets

+
+

11.2.3.2 Ordinal targets

Ordinal targets can be trickier to deal with. If you treat them as numeric, you’re assuming that the difference between 1 and 2 is the same as the difference between 2 and 3, and so on. This is probably not true. If you treat them as categorical and use standard models for that setting, you’re assuming that there is no connection between categories. So what should you do?

There are a number of ways to model ordinal targets, but probably the most common is the proportional odds model. This model can be seen as a generalization of the logistic regression model, and is very similar to it, and actually identical if you only had two categories. It basically is a model of 2 vs. 1, 3 vs. (2, 1), 4 vs. (3, 2, 1), etc. But other models beyond proportional odds are also possible, and your results could return something that gives coefficients for the model for the 1-2 category change, the 2-3 category change, and so on.

Ordinality of a categorical outcome is largely ignored in machine learning applications. The outcome is either treated as numeric or multiclass classification. This is not necessarily a bad thing, especially if prediction is the primary goal. But if you need a categorical prediction, treating the target as numeric means you have to make an arbitrary choice to classify the predictions. And if you treat it as multiclass, you’re ignoring the ordinality of the target, which may not work as well in terms of performance.

-
-

10.2.3.3 Rank data

+
+

11.2.3.3 Rank data

Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allowed for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.

-
-

10.3 Missing Data

+
+

11.3 Missing Data

-Table 10.4: A Table with Missing Values +Table 11.4: A Table with Missing Values
@@ -2645,34 +2662,34 @@

Missing data is a common challenge in data science, and there are a number of ways to deal with it, usually by substituting, or imputing, the substituted value for the missing one. Here we’ll provide an overview of common techniques to deal with missing data.

-
-

10.3.1 Complete case analysis

+
+

11.3.1 Complete case analysis

The first way to deal with missing data is the simplest - complete case analysis. Here we only use observations that have no missing data and drop the rest. Unfortunately, this can lead to a lot of lost data, and can lead to biased statistical results if the data is not missing completely at random. There are special cases of some models that by their nature can ignore the missingness under an assumption of missing at random, but even those models would likely benefit from some sort of imputation. If you don’t have much missing data though, dropping the missing data is fine for practical purposes4. How much is too much? Unfortunately that depends on the context, but if you have more than 10% missing, you should probably be looking at alternatives.

-
-

10.3.2 Single value imputation

+
+

11.3.2 Single value imputation

Single value imputation: Replace missing values with the mean, median, mode or some other typical value of the feature.This will probably rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you’ve also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions.

Single value imputation makes perfect sense if you know that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don’t have much missing data, it’s unlikely this would have any real benefit over complete case analysis, except if it allows you to use all the other features that would otherwise be dropped. But then, why not just drop this feature and keep the others?

-
-

10.3.3 Model-based imputation

+
+

11.3.3 Model-based imputation

Model-based imputation is more complicated, but can be very effective. In essence, you run a model for complete cases in which the feature with missing values is now the target, and all the other features and primary target are used to predict it. You then use that model to predict the missing values, and use those predictions as the imputed values. After these predictions are made, you move on to the next feature and do the same. There are no restrictions on which model you use for which feature. If the other features in the imputation model also have missing data, you can use something like mean imputation to get more complete data if necessary as a first step, and then when their turn comes, impute those values.

Although the implication is that you would have one model per feature and then be done, you can do this iteratively for several rounds, such that the initial imputed values are then used in subsequent models to reimpute other feature’s missing values. You can do this as many times as you want, but the returns will diminish. In this setting, we are assuming you’ll have a single value imputed for each missing one, but this approach is the precursor for our next method.

-
-

10.3.4 Multiple imputation

-

Multiple imputation (MI) is a more complicated technique, but can be very useful under some situations, depending on what you’re willing to sacrifice for having better uncertainty estimates versus a deeper dive into the model. The idea is that you create multiple imputed datasets, each of which is based on the predictive distribution of the model used in model-based imputation (See Section 3.2.4). Say we use a linear regression assuming a normal distribution to impute feature A. We would then draw repeatedly from the predictive distribution of that model to create multiple datasets with (randomly) imputed values for feature A.

+
+

11.3.4 Multiple imputation

+

Multiple imputation (MI) is a more complicated technique, but can be very useful under some situations, depending on what you’re willing to sacrifice for having better uncertainty estimates versus a deeper dive into the model. The idea is that you create multiple imputed datasets, each of which is based on the predictive distribution of the model used in model-based imputation (See Section 4.2.4). Say we use a linear regression assuming a normal distribution to impute feature A. We would then draw repeatedly from the predictive distribution of that model to create multiple datasets with (randomly) imputed values for feature A.

Let’s say we do this 10 times, and we now have 10 imputed data sets, each with a now complete feature A. We now run our actual model on each of these datasets, and final model results are averaged in some way to get final parameter estimates. Doing so acknowledges that your single imputation methods have uncertainty in those imputed values, and that uncertainty is incorporated into the final model results.

MI can in theory handle any source of missingness and can be a very powerful technique. But it has some drawbacks that are often not mentioned. One is that you need a specified target distribution for all imputation models used, in order to generate random draws with appropriate uncertainty. Your final model presumably is also a probabilistic model with coefficients and variances you are trying to estimate and understand. MI probably isn’t going to help boosting or deep learning models that have native methods for dealing with missing values, or at least offer little if anything over single value imputation. In addition, if you have very large data and a complicated model, you could be waiting a long time, and as modeling is an iterative process itself, this can be rather tedious to work through. Finally, few data or post-model processing tools that you commonly use will work with MI results, especially visualization tools. So you will have to hope that whatever package you use for MI has what you need. As an example, you’d have to figure out how you’re going to impute interaction terms if you have them.

Practically speaking, MI takes a lot of effort to often come to the same conclusions you would have with a single imputation method, or possibly fewer conclusions for anything beyond GLM coefficients and their standard errors. But if you want your uncertainty estimate for those models to be better, MI can be an option.

-
-

10.3.5 Bayesian imputation

+
+

11.3.5 Bayesian imputation

One final option is to run a Bayesian model where the missing values are treated as parameters to be estimated, and they would have priors just like other parameters as well. MI is basically a variant of Bayesian imputation that can be applied to the non-Bayesian model setting, so why not just use the actual Bayesian method? Some modeling packages can allow you to try this very easily, and it can be very effective. But it is also very computationally intensive, and can be very slow as you may be increasing the number of parameters to estimate dramatically. At least it would be more fun than standard MI!

-
-

10.4 Class Imbalance

+
+

11.4 Class Imbalance

@@ -2681,7 +2698,7 @@

-Figure 10.3: Class Imbalance +Figure 11.3: Class Imbalance
@@ -2697,8 +2714,8 @@

-

10.4.1 Calibration issues in classification

+
+

11.4.1 Calibration issues in classification

Probability calibration is often an issue in classification problems, and is a bit more complicated than just class imbalance, but is often discussed in the same setting. Having calibrated probabilities refers to the situation where the predicted probabilities of the target match up well to the actual probabilities. For example, if you predict that 10% of people will default on their loans, and 10% of people actually do default on their loans, one would say your model is well calibrated. Conversely, if you predict that 10% of people will default on their loans, but 20% of people actually do default on their loans, your model is not so well-calibrated.

One way to assess calibration is to use a calibration curve, which is a plot of the predicted probabilities vs. the observed proportions. We bin the observations in some way, and then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line. In the following, one model seems to align well with the observed proportions based on the chosen bins. The other model (dashed line) is not so well calibrated, and is overshooting with its predictions. For example, if the model’s average prediction predicts a 0.5 probability of default, the actual proportion of defaults is only around 0.6.

@@ -2709,7 +2726,7 @@

-Figure 10.4: Calibration Plot +Figure 11.4: Calibration Plot
@@ -2720,8 +2737,8 @@

All this is to say that each point in a calibration plot, along with the reference line itself, has some error bar around it, and the differences between models and the ‘best case scenario’ would need additional steps to suss out if we are interested in doing so in a statistically rigorous fashion. Some methods are available to calibrate probabilities, but they are not commonly implemented in practice, and often involve a model-based technique, with all of its own assumptions and limitations. It’s also not exactly clear that forcing your probabilities to be on the line is helping solve the actual modeling goal in any way6. But if you are interested, you can read more here.

-
-

10.5 Censoring and Truncation

+
+

11.5 Censoring and Truncation

@@ -2730,7 +2747,7 @@

-Figure 10.5: Censoring for Time Until Death +Figure 11.5: Censoring for Time Until Death
@@ -2746,7 +2763,7 @@

-Figure 10.6: Truncation +Figure 11.6: Truncation
@@ -2778,8 +2795,8 @@

-
-

10.6 Time Series

+
+

11.6 Time Series

@@ -2788,17 +2805,17 @@

-Figure 10.7: Time Series Data +Figure 11.7: Time Series Data

Time series data is any data that incorporates values over a period of time. This could be something like a state’s population over years, or the max temperature of an area over days. Time series data is very common in data science, and there are a number of ways to model such data.

-
-

10.6.1 Time-based targets

-

As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target’s past values as features, for example, autoregressive moving average (ARIMA) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (Section 9.3). Still others can get quite complex, such recurrent neural networks and their generalizations that form the backbone of modern AI models. Lately transformer-based models have looked promising.

-

Longitudinal data8 is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a mixed model (Section 6.3), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to the linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.

+
+

11.6.1 Time-based targets

+

As in other settings, the most common approach when the target is some value that varies time is to use a linear model of some kind. While the target varies over time, the features may be time-varying or not. There are traditional autoregressive models that use the target’s past values as features, for example, autoregressive moving average (ARIMA) models. Others can incorporate historical information in other ways such as is done in Bayesian methods to marketing data or in reinforcement learning (Section 10.3). Still others can get quite complex, such recurrent neural networks and their generalizations that form the backbone of modern AI models. Lately transformer-based models have looked promising.

+

Longitudinal data8 is a special case of time series data, where the target is a function of time, but it is typically grouped in some fashion. An example is a model for school performance for students over several semesters, where values are clustered within students over time. In this case, you can use some sort of time series regression, but you can also use a mixed model (Section 7.3), where you model the target as a function of time, but also include a random effect for the grouping variable, in this case, students. This is a very common approach in many domains, and can be very effective in terms of performance as well. It can also be used for time series data that is not longitudinal, where the random effects are based on autoregressive covariance matrices. In this case, an ARIMA component is added to the linear model as a random effect to account for the time series nature of the data. This is fairly common in Bayesian contexts.

In general lots of models can be found specific to time series data, and the choice of model will depend on the data, the questions we want to ask, and the goals we have.

-
-

10.6.2 Time-based features

-

When it comes to time-series features, we can apply time-specific transformations. One technique is the fourier transform, which can be used to decompose a time series into its component frequencies, much like how we use PCA (Section 9.2). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.

+
+

11.6.2 Time-based features

+

When it comes to time-series features, we can apply time-specific transformations. One technique is the fourier transform, which can be used to decompose a time series into its component frequencies, much like how we use PCA (Section 10.2). This can be useful for identifying periodicity in the data, which can be used as a feature in a model.

In marketing contexts, some perform adstocking with features. This method models the delayed effect of features over time, such that they may have their most important impact immediately, but still can impact the present target from past values. For example, a marketing campaign might have the most significant impact immediately after it’s launched, but it can still influence the target variable at later time points. Adstocking helps capture this delayed effect without having to include multiple lagged features in the model. That said, including lagged features is also an option. In this case, you would have a feature for the current time point (t), the same feature for the previous time point (t-1), the feature for the time point before that (t-2), and so on.

If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with a feature like year, a zero year isn’t meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn’t really matter which year/month/day is zero, just that zero refers to one of the actual time points observed.

Dates and/or times can be a bit trickier. Often you can just split dates out into year, month, day, etc., and proceed as discussed. In other cases you’d want to track the time period to assess possible seasonal effects. You can use something like a cyclic approach (e.g. cyclic spline or sine/cosine transformation) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle.

@@ -2839,8 +2856,8 @@

-

10.6.2.1 Covariance structures

+
+

11.6.2.1 Covariance structures

In many cases you’ll have features that vary over time but are not a time-oriented feature like year or month. For example, you might have a feature that is the number of people who visited a website over days. This is a time-varying feature, but it’s not a time metric in and of itself.

In general, we’d like to account for the time-dependent correlations in our data, and the main way to do so is to posit a covariance structure that accounts for this in some fashion. This helps us understand how data points are related to each other over time, and requires us to estimate the correlations between observations. As a starting point, consider linear regression. In a standard linear regression model, we assume that the samples are independent of one another, with a constant variance and no covariance.

Instead, we can also use something like a mixed model, where we include a random effect for each group and estimate the variance attributable to the grouping effect. By default, this ultimately assumes a constant correlation from time point to time point, but many tools allow you to specify a more complex covariance structure. A common method is to use autoregressive covariance structure that allows for correlations further apart in time to lessen. In this sense the covariance comes in as an added random effect, rather than being a model in and of itself as with ARIMA. Many such approaches to covariance structures are special cases of gaussian processes, which are a very general technique to model time series, spatial, and other types of data.

@@ -2852,7 +2869,7 @@

-Figure 10.8: AR (1) Covariance Structure Visualized +Figure 11.8: AR (1) Covariance Structure Visualized
@@ -2861,23 +2878,23 @@

-

10.7 Spatial Data

+
+

11.7 Spatial Data

-Figure 10.9: Spatial Weighting Applied to the Dallas-Fort Worth Area Census Tracts +Figure 11.9: Spatial Weighting Applied to the Dallas-Fort Worth Area Census Tracts
-

We visited spatial data in a discussion on non-tabular data (Section 9.4.1), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a spatial regression model, where the target is a function of location among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (Section 6.4), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model, where we include a random effect for county.

+

We visited spatial data in a discussion on non-tabular data (Section 10.4.1), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a spatial regression model, where the target is a function of location among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (Section 7.4), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model, where we include a random effect for county.

There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a spatial lag, where we incorporate information about the neighborhood of a observation’s location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from Walker (2023)). Such models fall under names such as CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and so on. These models can be very effective, but are in general different forms of random effects models very similar to those used for time-based settings, and likewise can be seen as special cases of gaussian processes.

-
-

10.8 Multivariate Targets

+
+

11.8 Multivariate Targets

Often you will encounter settings where the target is not a single value, but a vector of values. This is often called a multivariate target in statistical settings, or just the norm for deep learning. For example, you might be interested in predicting the number of people who will buy a product, the number of people who will click on an ad, and the number of people who will sign up for a newsletter. This is a common setting in marketing, where you are interested in multiple outcomes. The main idea is that there is a relationship among the targets, and you want to take this into account.

One model example we’ve already seen is the case where we have more than two categories for the target. Some default approaches may take that input and just do a one-vs-all, for each category, but this kind of misses the point. Others will simultaneously model the multiple targets in some way. On the other hand, it can be difficult to interpret results with multiple targets. Because of this, you’ll often see results presented in terms of the respective targets anyway, and often even ignoring parameters specifically associated with such a model9.

In deep learning contexts, the multivariate setting is ubiquitous. For example, if you want to classify the content of an image, you might have to predict something like different species of animals, or different types of car models. In natural language processing, you might want to predict the probability of different words in a sentence. In some cases, there are even multiple kinds of targets considered simultaneously! It can get very complex, but often in these settings prediction performance far outweighs the need to interpret specific parameters, and so it’s a good fit.

@@ -2885,8 +2902,8 @@

-
-

10.9 Latent Variables

+
+

11.9 Latent Variables

@@ -2894,45 +2911,45 @@

-Figure 10.10: Latent Variable Model (Bifactor) +Figure 11.10: Latent Variable Model (Bifactor)

Latent variables are a fundamental aspect of modeling, and simply put, are variables that are not directly observed, but are inferred from other variables. Here are some examples of what might be called latent variables:

    -
  • The linear combination of features in a linear regression model is a latent variable, but usually we only think of it as such before the link transformation in GLMs (Chapter 5).
  • -
  • The error term in any model is a latent variable representing all the unknown/unobserved/unmodeled factors that influence the target (Equation 2.3).
  • -
  • The principal components in PCA (Chapter 9).
  • +
  • The linear combination of features in a linear regression model is a latent variable, but usually we only think of it as such before the link transformation in GLMs (Chapter 6).
  • +
  • The error term in any model is a latent variable representing all the unknown/unobserved/unmodeled factors that influence the target (Equation 3.3).
  • +
  • The principal components in PCA (Chapter 10).
  • The measurement error in any feature or target.
  • The factor scores in a factor analysis model or structural equation (visualization above).
  • -
  • The true target underlying the censored values (Section 10.5).
  • -
  • The clusters in cluster analysis/mixture models (Section 9.2.1.1).
  • -
  • The random effects in a mixed model (Section 6.3).
  • +
  • The true target underlying the censored values (Section 11.5).
  • +
  • The clusters in cluster analysis/mixture models (Section 10.2.1.1).
  • +
  • The random effects in a mixed model (Section 7.3).
  • The hidden states in a hidden Markov model.
  • -
  • The hidden layers in a deep learning model (Section 8.7).
  • +
  • The hidden layers in a deep learning model (Section 9.7).

It’s easy to see from such a list that latent variables are very common in modeling, so it’s good to get comfortable with the concept. Whether they’re appropriate to your specific situation will depend on a variety of factors, but they can be very useful in many settings, if not a required part of the modeling approach.

-
-

10.10 Data Augmentation

+
+

11.10 Data Augmentation

Data augmentation is a technique where you artificially increase the size of your dataset by creating new data points based on the existing data. This is a common technique in deep learning for computer vision, where you might rotate, flip, or crop images to create new training data. This can help improve the performance of your model, especially when you have a small dataset. Techniques are also available for text.

-

In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 10.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.

+

In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 11.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.

Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process10. Downsampling the majority class can potentially throw away usefu information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.

-
-

10.11 Wrapping Up

+
+

11.11 Wrapping Up

There’s a lot going on with data before you ever get to modeling, and which will affect every aspect of your modeling approach. This chapter outlines common data types, issues, and associated modeling aspects, but in the end, you’ll always have to make decisions based on your specific situation, and they will often not be easy ones. These are only some of the things to consider, so be ready for surprises, and be ready to learn from them!

-
-

10.11.1 The common thread

+
+

11.11.1 The common thread

Many of the transformations and missing data techniques could possibly be applied in many modeling settings. Likewise, you may find yourself dealing with different target variable issues like imbalance or censoring, and deal with temporal, spatial or other structures, in a variety of models. The key is to understand the data, the target, and the features, and to make the best decisions you can based on that understanding.

-
-

10.11.2 Choose your own adventure

+
+

11.11.2 Choose your own adventure

Consider revisiting a model covered in the other parts of this book in light of the data issues discussed here. For example,might you deal with class imbalance for a boosted tree model? How would you deal with spatial structure in a neural network? How would you deal with a multivariate target in a time series model?

-
-

10.11.3 Additional resources

+
+

11.11.3 Additional resources

Here are some additional resources to help you learn more about the topics covered in this chapter.

Transformations

    @@ -3446,12 +3463,12 @@

    < diff --git a/docs/data_files/figure-html/r-calibration-plot-1.png b/docs/data_files/figure-html/r-calibration-plot-1.png deleted file mode 100644 index 2500206..0000000 Binary files a/docs/data_files/figure-html/r-calibration-plot-1.png and /dev/null differ diff --git a/docs/dataset_descriptions.html b/docs/dataset_descriptions.html index d3b3e41..85ec1dd 100644 --- a/docs/dataset_descriptions.html +++ b/docs/dataset_descriptions.html @@ -7,7 +7,7 @@ -Appendix A — Dataset Descriptions – [Models Demystified]{.smallcaps} +Appendix D — Dataset Descriptions – [Models Demystified]{.smallcaps} @@ -882,26 +899,26 @@

    -
    +
    @@ -1426,8 +1443,8 @@

-
-

A.2 World Happiness Report

+
+

D.2 World Happiness Report

The World Happiness Report is a survey of the state of global happiness that ranks countries by how ‘happy’ their citizens perceive themselves to be. You can also find additional details in their supplemental documentation. Our 2018 data is from what was originally reported at that time (figure 2.2) and it also contains a life ladder score from the most recent survey, which is similar and very highly correlated.

The dataset contains the following columns:

    @@ -1457,30 +1474,30 @@

    -Table A.2: World Happiness Report Dataset (All Years) +Table D.2: World Happiness Report Dataset (All Years)
    -
    +
    @@ -2016,30 +2033,30 @@

    -Table A.3: World Happiness Report Dataset (2018) +Table D.3: World Happiness Report Dataset (2018)
    -
    +
    @@ -2596,8 +2613,8 @@

    -

    A.3 Heart Disease UCI

    +
    +

    D.3 Heart Disease UCI

    This classic dataset comes from the UCI ML repository. We took a version from Kaggle, and features and target were renamed to be more intelligible. Here is a brief description from UCI:

    This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

    @@ -2628,30 +2645,30 @@

    -Table A.4: Heart Disease UCI Dataset +Table D.4: Heart Disease UCI Dataset
    -
    +
    @@ -3132,26 +3149,26 @@

    -
    +
    @@ -3660,8 +3677,8 @@

    -
    -

    A.4 Fish

    +
    +

    D.4 Fish

    A very simple data set with a count target variable available for an exercise in the GLM chapter. Also good if you want to try your hand at zero-inflated models. The background is that state wildlife biologists want to model how many fish are being caught by fishermen at a state park.

    • nofish: We’ve never seen this explained. Originally 0 and 1, 0 is equivalent to livebait == ‘yes’, so it may be whether the primary motivation of the camping trip is for fishing or not.
    • @@ -3679,30 +3696,30 @@

      -Table A.5: Fish Dataset +Table D.5: Fish Dataset
      -
      +
      @@ -4163,26 +4180,26 @@

      -
      +
      @@ -5137,13 +5154,13 @@

      diff --git a/docs/estimation.html b/docs/estimation.html index 4c64812..11dc853 100644 --- a/docs/estimation.html +++ b/docs/estimation.html @@ -7,7 +7,7 @@ -4  How Did We Get Here? – [Models Demystified]{.smallcaps} +5  How Did We Get Here? – [Models Demystified]{.smallcaps} @@ -1059,8 +1076,8 @@

      -
      -

      4.2.1 Other Setup

      +
      +

      5.2.1 Other Setup

      For the R examples, after the above nothing beyond base R is needed. For Python examples, the following should be enough to get you through the examples:

      import pandas as pd
      @@ -1076,8 +1093,8 @@ 

      -
      -

      4.3 Starting Out by Guessing

      +
      +

      5.3 Starting Out by Guessing

      So, we’ll start with a model in which we predict a country’s level of happiness by their life expectancy, where if you can expect to live longer, maybe you’re probably in a country with better health care, higher incomes, and other important stuff. We’ll stick with our simple linear regression model as well.

      As a starting point we can just guess what the parameter should be, but how would we know what to guess? How would we know which guesses are better than others? Let’s try a couple and see what happens. Let’s say that we think all countries start at the bottom on the happiness scale (around 3), but life expectancy makes a big impact- for every standard deviation of life expectancy we go up a whole point on happiness1. We can plug this into the model and see what we get:

      \[ @@ -1088,15 +1105,15 @@

      How do we know which is better? Let’s find out!

      -
      -

      4.4 Prediction Error

      +
      +

      5.4 Prediction Error

      We’ve seen that a key component to model assessment involves comparing the predictions from the model to the actual values of the target. This difference is known as the prediction error, or residuals in more statistical contexts. We can express this as:

      \[ \epsilon = y - \hat{y} \] \[ \text{error} = \text{target} - \text{model-based guess} \]

      -

      This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 3.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 3.2.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible2.

      +

      This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 4.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 4.2.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible2.

      However, if we just take the average of our errors from a linear regression model, you’ll see that it is roughly zero! This is by design for many common models, where we even will explicitly write the formula for the error as coming from a normal distribution with mean of zero. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same3. We’ll use squared error here, and we’ll calculate the mean of the squared errors for all our predictions.

      @@ -1144,30 +1161,30 @@

      -Table 4.2: Comparison of Error Metrics for Two Models +Table 5.2: Comparison of Error Metrics for Two Models
      -
      +
      @@ -1628,12 +1645,12 @@

      -

      4.5 Ordinary Least Squares

      +
      +

      5.5 Ordinary Least Squares

      In a simple linear model, we often use the Ordinary Least Squares (OLS) method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values4. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in our previous example. The sum of the squared errors is also called the residual sum of squares (RSS), as opposed to the ‘total’ sums of squares (i.e. the variance of the target), and the part explained by the model (‘model’ or ‘explained’ sums of squares). We can express this as follows, where \(y_i\) is the observed value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model.

      \[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 -\tag{4.1}\]

      +\tag{5.1}\]

      It’s called ordinary least squares because there are other least squares methods - generalized least squares, weighted least squares, and others, but we don’t need to worry about that for now. The sum or mean of the squared errors is our objective value. The process of taking the predictions and observed target values as inputs, and returning this value as an output is our objective function. We can use this value to find the best parameters for a specific model, as well as compare models with different parameters.

      Now let’s calculate the OLS estimate for our model. We need our own function to do this, but it doesn’t take much. We need to map our inputs to our output, which are the model predictions. We then calculate the error, square it, and then average the squared errors to provide the mean squared error.

      @@ -1795,7 +1812,7 @@

      -Figure 4.2: Results of parameter search +Figure 5.2: Results of parameter search

      @@ -1818,8 +1835,8 @@

    -
    -

    4.6 Optimization

    +
    +

    5.6 Optimization

    Before we get into other objective functions, let’s think about a better way to find good parameters for our model. Rather than just guessing, we can use a more systematic approach, and thankfully, there are tools out there to help us. We just use a function like our OLS function, give it a starting point, and let the algorithms do the rest! These tools eventually arrive at a pretty good set of parameters, and are optimized for speed.

    Previously we created a set of guesses, and tried each one in a manner called a grid search, and it is a bit of a brute force approach to finding the best fitting model. You can maybe imagine a couple of unfortunate scenarios for this approach, such as having a very large number of parameters to search. Or it may be that our range of guesses doesn’t allow us to find the right set of parameters, or we specify a very large range, but the best fitting model is within a very narrow part of that, so it takes a long time to find. In any of these cases we waste a lot of time or may not find an optimal solution.

    In general, we can think of optimization as employing a smarter, more efficient way to find what you’re looking for. Here’s how it works:

    @@ -1888,30 +1905,30 @@

    -Table 4.3: Comparison of Our Results to a Standard Function +Table 5.3: Comparison of Our Results to a Standard Function
    -
    +
    @@ -2383,8 +2400,8 @@

    -
    -

    4.7 Maximum Likelihood

    +
    +

    5.7 Maximum Likelihood

    In our example, we’ve been minimizing the mean of the squared errors to find the best parameters for our model. But let’s think about this differently. Now we’d like you to think about the data generating process. Ignoring the model, imagine that each happiness value is generated by some random process, like drawing from a normal distribution. So, something like this would describe it mathematically:

    \[ \text{happiness} \sim N(\text{mean}, \text{sd}) @@ -2535,30 +2552,30 @@

    -Table 4.4: Comparison of Our Results to a Standard Function +Table 5.4: Comparison of Our Results to a Standard Function
    -
    +
    @@ -3017,7 +3034,7 @@

    -

    To use a maximum likelihood approach for linear models, you can use functions like glm in R or GLM in Python, which is the reference used in the table above. We can also use different likelihoods corresponding to the binomial, poisson and other distributions. Still other packages would allow even more distributions for consideration. In general, we choose a distribution that we feel best reflects the data generating process. For binary targets for example, we typically would feel a bernoulli or binomial distribution is appropriate. For count data, we might choose a poisson or negative binomial distribution. For targets that fall between 0 and 1, we might go for a beta distribution. You can see some of these demonstrated in Chapter 5.

    +

    To use a maximum likelihood approach for linear models, you can use functions like glm in R or GLM in Python, which is the reference used in the table above. We can also use different likelihoods corresponding to the binomial, poisson and other distributions. Still other packages would allow even more distributions for consideration. In general, we choose a distribution that we feel best reflects the data generating process. For binary targets for example, we typically would feel a bernoulli or binomial distribution is appropriate. For count data, we might choose a poisson or negative binomial distribution. For targets that fall between 0 and 1, we might go for a beta distribution. You can see some of these demonstrated in Chapter 6.

    There are many distributions to choose from, and the best one depends on your data. Sometimes, even if one distribution seems like a better fit, we might choose another one because it’s easier to use. Some distributions are special cases of others, or they might become more like a normal distribution under certain conditions. For example, the exponential distribution is a special case of the gamma distribution, and a t distribution with many degrees of freedom looks like a normal distribution. Here is a visualization of the relationships among some of the more common distributions (Wikipedia (2023)).

    @@ -3027,7 +3044,7 @@

    -Figure 4.3: Relationships Among Some Probability Distributions +Figure 5.3: Relationships Among Some Probability Distributions

    @@ -3070,8 +3087,8 @@

    -
    -

    4.7.1 Diving deeper

    +
    +

    5.7.1 Diving deeper

    Let’s think more about what’s going on here. It turns out that our objective function defines a ‘space’ or ‘surface’. You can imagine the process as searching for the lowest point on a landscape, with each guess a point on this landscape. Let’s start to get a sense of this with the following visualization, based on a single parameter. The following visualization shows this for a single parameter. The data comes from a variable with a true average of 5. As our guesses get closer to 5, the likelihood increases. However, with more and more data, the final guess converges on the true value. Model estimation finds that maximum on the curve, and optimization algorithms are the means to find it.

    @@ -3081,7 +3098,7 @@

    -Figure 4.4: Likelihood Function for One Parameter +Figure 5.4: Likelihood Function for One Parameter

    @@ -3093,10 +3110,10 @@

    - +
    -Figure 4.5: Likelihood Surface for Happiness and Life Expectancy (interactive) +Figure 5.5: Likelihood Surface for Happiness and Life Expectancy (interactive)

    @@ -3110,7 +3127,7 @@

    -Figure 4.6: Optimization Path for Two Parameters +Figure 5.6: Optimization Path for Two Parameters
    @@ -3153,17 +3170,17 @@

-
-

4.8 Penalized Objectives

+
+

5.8 Penalized Objectives

-

One thing we may want to take into account with our models is their complexity, especially in the context of overfitting. We talk about this with machine learning also (Chapter 7), but the basic idea is that we can get too familiar with the data we have, and when we try to predict on new data the model hasn’t seen before, our performance suffers, or even gets worse than a simpler model. In other words, we are not generalizing well (Section 7.4).

+

One thing we may want to take into account with our models is their complexity, especially in the context of overfitting. We talk about this with machine learning also (Chapter 8), but the basic idea is that we can get too familiar with the data we have, and when we try to predict on new data the model hasn’t seen before, our performance suffers, or even gets worse than a simpler model. In other words, we are not generalizing well (Section 8.4).

One way to deal with this is to penalize the objective function value for complexity, or at least favor simpler models that might do as well. In some contexts this is called regularization, and in other contexts shrinkage, since the parameter estimates are typically shrunk toward some specific value (e.g., zero).

As a starting point, in our basic linear model we can add a penalty that is applied to the size of coefficients. This is called ridge regression, or, more mathily, as L2 regularization. We can write this formally as:

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 -\tag{4.2}\]

-

The first part is the same as basic OLS (Equation 4.1), but the second part is the penalty for \(p\) features. The penalty is the sum of the squared coefficients multiplied by some value, which we call \(\lambda\). This is an additional model parameter that we typically want to estimate, e.g. through cross-validation. This kind of parameter is often called a hyperparameter, mostly just to distinguish it from those that may be of actual interest. For example, we could probably care less what the actual value for \(\lambda\) is, but we would still be interested in the coefficients.

-

In the end this is just a small change to ordinary least squares (OLS) regression (Equation 4.1), but it can make a big difference. It introduces some bias in the coefficients - recall that OLS is unbiased if assumptions are met - but it can help to reduce variance, which can help the model perform better on new data (Section 7.4.2). In other words, we are willing to accept some bias in order to get a model that generalizes better.

+\tag{5.2}\]

+

The first part is the same as basic OLS (Equation 5.1), but the second part is the penalty for \(p\) features. The penalty is the sum of the squared coefficients multiplied by some value, which we call \(\lambda\). This is an additional model parameter that we typically want to estimate, e.g. through cross-validation. This kind of parameter is often called a hyperparameter, mostly just to distinguish it from those that may be of actual interest. For example, we could probably care less what the actual value for \(\lambda\) is, but we would still be interested in the coefficients.

+

In the end this is just a small change to ordinary least squares (OLS) regression (Equation 5.1), but it can make a big difference. It introduces some bias in the coefficients - recall that OLS is unbiased if assumptions are met - but it can help to reduce variance, which can help the model perform better on new data (Section 8.4.2). In other words, we are willing to accept some bias in order to get a model that generalizes better.

But let’s get to a code example to demystify this a bit! Here is an example of a function that calculates the ridge objective. To make things interesting, let’s add the other features we talked about regarding GDP per capita and perceptions of corruption.

@@ -3239,7 +3256,7 @@

-Table 4.5: Comparison of Ridge Regression Results +Table 5.5: Comparison of Ridge Regression Results
@@ -3740,17 +3757,17 @@

Another very common penalized approach is to use the sum of the absolute value of the coefficients, which is called lasso regression or L1 regularization. An interesting property of the lasso is that in typical implementations, it will potentially zero out coefficients, which is the same as dropping the feature from the model altogether. This is a form of feature selection or variable selection. The true values are never zero, but if we want to use a ‘best subset’ of features, this is one way we could do so. We can write the lasso objective as follows. The chapter exercise asks you to implement this yourself.

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} |\beta_j| -\tag{4.3}\]

+\tag{5.3}\]

-
-

4.9 Classification

+
+

5.9 Classification

So far, we’ve been assuming a continuous target, but what if we have a categorical target? Now we have to learn a bunch of new stuff for that situation, right? Actually, no! When we want to model categorical targets, conceptually, nothing changes! We still have an objective function that maximizes or minimizes some goal, and we can use the same algorithms to estimate parameters. However, we need to think about how we can do this in a way that makes sense for the binary target.

-
-

4.9.1 Misclassification rate

+
+

5.9.1 Misclassification rate

A straightforward correspondence to MSE is a function that minimizes classification error, or by the same token, maximizes accuracy. In other words, we can think of the objective function as the proportion of incorrect classifications. This is called the misclassification rate.

\[ \text{Loss} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(y_i \neq \hat{y_i}) -\tag{4.4}\]

+\tag{5.4}\]

In the equation, \(y_i\) is the actual value of the target for observation \(i\), arbitrarily coded as 1 or 0, and \(\hat{y_i}\) is the predicted class from the model. The \(\mathbb{1}\) is an indicator function that returns 1 if the condition is true, and 0 otherwise. In other words, we are counting the number of times the predicted value is not equal to the actual value, and dividing by the number of observations. Very straightforward, so let’s do this ourselves!

@@ -3801,14 +3818,14 @@

-

Note that our function first adds a step to convert the linear predictor (called mu) to a probability. Once we have a probability, we use some threshold to convert it to a ‘class’. In this case, we use 0.5 as the threshold, but this could be different depending on the context, something we talk more about elsewhere (Section 3.2.2.7). We’ll leave it as an exercise for you to play around with this, as the next objective function is more commonly used. But at least you can see how easy it can be to switch to the classification case.

+

Note that our function first adds a step to convert the linear predictor (called mu) to a probability. Once we have a probability, we use some threshold to convert it to a ‘class’. In this case, we use 0.5 as the threshold, but this could be different depending on the context, something we talk more about elsewhere (Section 4.2.2.7). We’ll leave it as an exercise for you to play around with this, as the next objective function is more commonly used. But at least you can see how easy it can be to switch to the classification case.

-
-

4.9.2 Log loss

+
+

5.9.2 Log loss

Another approach is to use the log loss, sometimes called logistic loss or cross-entropy. If we have just the binary case it is:

\[ \text{Loss} = -\sum_{i=1}^{n} y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i}) -\tag{4.5}\]

+\tag{5.5}\]

Where \(y_i\) is the actual value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model (essentially a probability). It turns out that this is the same as the log-likelihood used in a maximum likelihood approach for logistic regression, made negative so we can minimize it.

We typically prefer this objective function to classification error because it results in a smooth optimization surface, like in the visualization we showed before for maximum likelihood, which means it is differentiable in a mathematical sense. This is important because it allows us to use optimization algorithms that rely on derivatives in updating the parameter estimates. You don’t really need to get into that too much, but just know that a smoother objective function is something we prefer. Here’s some code to try out.

@@ -3916,7 +3933,7 @@

-Table 4.6: Comparison of Log-Loss Results +Table 5.6: Comparison of Log-Loss Results
@@ -4400,8 +4417,8 @@

So, when it comes to classification, you should feel confident in what’s going on under the hood, just like you did with a numeric target. Too much is made of the distinction between ‘regression’ and ‘classification’ and it can be confusing to those starting out. In reality, classification just requires a slightly different way of thinking about the target. Conceptually it really is the same approach.

-
-

4.10 Optimization Algorithms

+
+

5.10 Optimization Algorithms

When it comes to optimization, there are a number of algorithms that have been developed over time. The main thing to keep in mind is that these are all just ways to find the best fitting parameters for a model. Some may be better suited for certain data tasks, or provide computational advantages, but often the choice of algorithm is not as important as many other modeling choices.

Here are some of the options available in R’s optim or scipy’s minimize function:

    @@ -4421,9 +4438,9 @@

    The main reason to choose one method over another usually is based on factors like speed, memory use, or how well they work for certain models. For statistical contexts, many functions for generalized linear models use Newton’s method by default, but more complicated models may implement a different approach for better convergence. In machine learning, stochastic gradient descent is popular because it can be efficient in large data settings and relatively easy to implement.

    In general, we can always try different methods to see which works best, but usually the results will be similar if the results reach convergence. We’ll now demonstrate one of the most popular optimization methods used in machine learning - gradient descent, but know that there are many variants of this one might use.

    -
    -

    4.10.1 Gradient descent

    -

    One of the most popular approaches in optimization is called gradient descent. It uses the gradient of the function we’re trying to optimize to find the best parameters. We still use objective functions as before, and gradient descent is just a way to find that path along the objective surface. More formally, the gradient is the vector of partial derivatives of the objective function with respect to each parameter. That may not mean much to you, but the basic idea is that the gradient provides a direction that points in the direction of steepest increase in the function. So if we want to maximize the objective function, we can take a step in the direction of the gradient, and if we want to minimize it, we can take a step in the opposite direction of the gradient (use the negative gradient). The size of the step is called the learning rate, and, like our penalty parameter we saw with penalized regression, it is a hyperparameter that we can tune through cross-validation (Section 7.7). If the learning rate is too small, it will take a longer time to converge. If it’s too large, we might overshoot the objective and miss the best parameters. There are a number of variations on gradient descent that have been developed over time. Let’s see this in action with the world happiness model.

    +
    +

    5.10.1 Gradient descent

    +

    One of the most popular approaches in optimization is called gradient descent. It uses the gradient of the function we’re trying to optimize to find the best parameters. We still use objective functions as before, and gradient descent is just a way to find that path along the objective surface. More formally, the gradient is the vector of partial derivatives of the objective function with respect to each parameter. That may not mean much to you, but the basic idea is that the gradient provides a direction that points in the direction of steepest increase in the function. So if we want to maximize the objective function, we can take a step in the direction of the gradient, and if we want to minimize it, we can take a step in the opposite direction of the gradient (use the negative gradient). The size of the step is called the learning rate, and, like our penalty parameter we saw with penalized regression, it is a hyperparameter that we can tune through cross-validation (Section 8.7). If the learning rate is too small, it will take a longer time to converge. If it’s too large, we might overshoot the objective and miss the best parameters. There are a number of variations on gradient descent that have been developed over time. Let’s see this in action with the world happiness model.

    @@ -4534,7 +4551,7 @@

    -Table 4.7: Comparison of Gradient Descent Results +Table 5.7: Comparison of Gradient Descent Results
    @@ -5024,15 +5041,15 @@

    -Figure 4.7: Loss with Gradient Descent +Figure 5.7: Loss with Gradient Descent

    -
    -

    4.10.2 Stochastic gradient descent

    +
    +

    5.10.2 Stochastic gradient descent

    Stochastic gradient descent (SGD) is a version of gradient descent that uses a random sample of data to guess the gradient, instead of using all the data. This makes it less accurate in some ways, but it’s faster and can be parallelized. This speed is useful in machine learning when there’s a lot of data, which often makes the discrepancy between standard GD and SGD small. As such you will see variants of it incorporated in many models in deep learning, but it can be with much simpler models as well.

    Let’s see this in action with the happiness model. The following is a conceptual version of the AdaGrad approach10, which is a variation of SGD that adjusts the learning rate for each parameter. We will also add a variation that averages the parameter estimates across iterations, which is a common approach to improve the performance of SGD, but by default it is not used, just something you can play with. We are going to use a ‘batch size’ of one, which is similar to a ‘streaming’ or ‘online’ version where we update the model with each observation. Since our data are alphabetically ordered, we’ll shuffle the data first. We’ll also use a stepsize_tau parameter, which is a way to adjust the learning rate at early iterations. We’ll set it to zero for now, but you can play with it to see how it affects the results. The values for the learning rate and stepsize_tau are arbitrary, selected after some initial playing around, but you can play with them to see how they affect the results.

    @@ -5216,7 +5233,7 @@

    -Table 4.8: Comparison of Stochastic Gradient Descent Results +Table 5.8: Comparison of Stochastic Gradient Descent Results
    @@ -5712,7 +5729,7 @@

    -Figure 4.8: Stochastic Gradient Descent Path +Figure 5.8: Stochastic Gradient Descent Path

    @@ -5725,7 +5742,7 @@

    -Table 4.9: Comparison of Optimization Results +Table 5.9: Comparison of Optimization Results
    @@ -6242,11 +6259,11 @@

    -

    4.11 Other Estimation Approaches

    +
    +

    5.11 Other Estimation Approaches

    Before leaving our estimation discussion, we should mention there are other approaches one could use, including variations on least squares, the method of moments, generalized estimating equations, robust estimation, and more. We’ve focused on the most common ones, but it’s good to be aware of others that might be more common in some domains. But there are two we want to discuss in a little bit detail given their widespread usage, and that is the bootstrap and Bayesian estimation.

    -
    -

    4.11.1 Bootstrap

    +
    +

    5.11.1 Bootstrap

    The bootstrap is a method where we create new data sets by randomly sampling the data from our original set, allowing the same data to be picked more than once. We then use these new data sets to estimate our model. We do this many times, collecting parameter estimates, predictions, or anything we want to calculate along the way. Ultimately, we end up with a distribution of all the things we calculated.

    These results give us a range of possible outcomes, which is useful for inference12, as we can use the distribution to calculate confidence intervals, prediction intervals or intervals for any value we calculate. The average estimate is often the same as whatever the underlying model used would produce, but the bootstrap provides a way to get at a measure of uncertainty with fewer assumptions about how that distribution should take shape. The approach is very flexible, and it can be used with any model. Let’s see this in action with the happiness model.

    @@ -6370,7 +6387,7 @@

    <
    -Table 4.10: Bootstrap Parameter Estimates +Table 5.10: Bootstrap Parameter Estimates
    @@ -6881,7 +6898,7 @@

    <

    -Figure 4.9: Bootstrap Distributions of Parameter Estimates +Figure 5.9: Bootstrap Distributions of Parameter Estimates
    @@ -6889,8 +6906,8 @@

    <

    The bootstrap is often used for predictions and other metrics. However, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or in some data contexts (e.g. correlated observations). Despite these limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model.

    -
    -

    4.11.2 Bayesian estimation

    +
    +

    5.11.2 Bayesian estimation

    The Bayesian approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches can’t do well.

    The basis of Bayesian estimation is the likelihood, the same as with maximum likelihood, and everything we did there applies here. So you need a good grasp of maximum likelihood to understand the Bayesian approach. However, the Bayesian approach is different because it also lets us use our knowledge about the parameters through prior distributions. For example, we may think that the coefficients for a linear model come from a normal distribution centered on zero with some variance. That would serve as our prior. The combination of a prior distribution with the likelihood results in the posterior distribution, which is a distribution of possible parameter values.

    @@ -6900,7 +6917,7 @@

    -Figure 4.10: Prior, likelihood, posterior and distributions +Figure 5.10: Prior, likelihood, posterior and distributions
    @@ -6913,7 +6930,7 @@

    -Figure 4.11: Posterior Distribution of Parameters +Figure 5.11: Posterior Distribution of Parameters
    @@ -6929,13 +6946,13 @@

    -Figure 4.12: Bayesian Chains for Life Expectancy Coefficient +Figure 5.12: Bayesian Chains for Life Expectancy Coefficient
    -

    When we are interested in making predictions, we can use the results to generate a distribution of possible predictions for each observation, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as posterior predictive distribution, which is explored in non-bayesian context in Section 3.2.4. Here is a plot of several draws of predicted values against the true happiness scores.

    +

    When we are interested in making predictions, we can use the results to generate a distribution of possible predictions for each observation, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as posterior predictive distribution, which is explored in non-bayesian context in Section 4.2.4. Here is a plot of several draws of predicted values against the true happiness scores.

    @@ -6944,7 +6961,7 @@

    -Figure 4.13: Posterior Predictive Distribution of Happiness Values +Figure 5.13: Posterior Predictive Distribution of Happiness Values
    @@ -6955,7 +6972,7 @@

    -Table 4.11: Bayesian R2 +Table 5.11: Bayesian R2
    @@ -7440,12 +7457,12 @@

    -

    As we saw in Section 3.2.4, nothing is keeping you from doing ‘posterior predictive checks’ with other estimation approaches, and it’s a very good idea to do so. For example, in a GLM you have the beta estimates and the covariance matrix for them, and can simulate from a normal distribution with those estimates. It’s just more straightforward with the Bayesian approach, where packages will do it for you with little effort.

    +

    As we saw in Section 4.2.4, nothing is keeping you from doing ‘posterior predictive checks’ with other estimation approaches, and it’s a very good idea to do so. For example, in a GLM you have the beta estimates and the covariance matrix for them, and can simulate from a normal distribution with those estimates. It’s just more straightforward with the Bayesian approach, where packages will do it for you with little effort.

    -
    -

    4.11.2.1 Additional Thoughts

    +
    +

    5.11.2.1 Additional Thoughts

    It turns out that any standard (frequentist) statistical model can be seen as a Bayesian one from a particular point of view. Here are a couple:

    • GLM and related models estimated via maximum likelihood: Bayesian estimation with a flat/uniform prior on the parameters.
    • @@ -7458,19 +7475,19 @@

    -
    -

    4.12 Wrapping Up

    +
    +

    5.12 Wrapping Up

    Wow, we covered a lot here! But this is the sort of stuff that can take you from just having some fun with data, to doing that and also understanding how things are actually happening. Just having the gist of how modeling actually is done ‘under the hood’ makes so many other things make sense, and can give you a lot of confidence, even in less familiar modeling domains.

    -
    -

    4.12.1 The common thread

    +
    +

    5.12.1 The common thread

    Simply put, the content in this chapter ties together any and every model you will ever undertake, from linear regression to reinforcement learning, computer vision, and large language models. Estimation and optimization are the core of any modeling process, and understanding them is key to understanding how models work.

    -
    -

    4.12.2 Choose your own adventure

    +
    +

    5.12.2 Choose your own adventure

    Seriously, after this chapter, you should feel fine with any of the others in this book, so dive in!

    -
    -

    4.12.3 Additional resources

    +
    +

    5.12.3 Additional resources

    OLS and Maximum Likelihood Estimation:

    For OLS and maximum likelihood estimation, there are so many resources out there, we recommend just taking a look and seeing which one suits you best. Practically any more technical statistical book will cover these topics in detail.

      @@ -7502,8 +7519,8 @@

    -
    -

    4.13 Exercise

    +
    +

    5.13 Exercise

    Try creating an objective function for a continuous target that uses the mean absolute error, and compare your estimated parameters to the previous results for ordinary least squares. Alternatively, use the ridge regression demonstration and change it to use the lasso approach (this would require altering just one line).

    @@ -7994,12 +8011,12 @@

    diff --git a/docs/generalized_linear_models.html b/docs/generalized_linear_models.html index fedfeaf..66cb3e1 100644 --- a/docs/generalized_linear_models.html +++ b/docs/generalized_linear_models.html @@ -7,7 +7,7 @@ -5  Generalized Linear Models – [Models Demystified]{.smallcaps} +6  Generalized Linear Models – [Models Demystified]{.smallcaps} @@ -1121,66 +1181,65 @@

    -Figure 5.3: Comparison of Probability and Odds Differences +Figure 6.3: Comparison of Probability and Odds Differences

    -

    Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice, they are not as common as you might think. The take-home message is that we can interpret our result in a linear or nonlinear space, but it can be a bit difficult2. Our own preference is to stick with predicted probabilities, but it’s good to know how to interpret odds ratios.

    +

    Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. As shown in Table 6.1, knowledge of the baseline rate is required for a good understanding of them. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice elsewhere, they are not as common. The take-home message is that we can interpret our result in terms of odds (ratios of probabilities), log-odds (linear space), or as probabilities (nonlinear space), but it can take a little more effort than our linear regression setting1. Our own preference is to stick with predicted probabilities, but it’s good to have familiarity of odds ratios, since they are often reported in academic papers and media reports.

    -
    -

    5.3.3 A logistic regression model

    -

    For our model let’s return to our movie review data, but now we are going to use rating_good as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We will use word_count and gender as our features.

    - +
    +

    6.3.3 A logistic regression model

    +

    Now let’s get our hands dirty and do a classification model using logistic regression. For our model let’s return to the movie review data, but now we’ll use the binary rating_good (‘good’ vs. ‘bad’) as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We examine the relationship of word_count and gender features with the likelihood of getting a good rating.

    - +
    -
    +
    -
    df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
    -
    -# for the by-hand option later
    -X = df_reviews |> 
    -    select(word_count, male = gender) |> 
    -    mutate(male = ifelse(male == 'male', 1, 0)) |> 
    -    as.matrix()
    -
    -y = df_reviews$rating_good
    +
    df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
    +
    +# for the by-hand option later
    +X = df_reviews |> 
    +    select(word_count, male = gender) |> 
    +    mutate(male = ifelse(male == 'male', 1, 0)) |> 
    +    as.matrix()
    +
    +y = df_reviews$rating_good
    -
    +
    -
    import pandas as pd
    -import numpy as np
    -
    -df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
    -
    -# for the by-hand option later
    -X = (
    -    df_reviews[['word_count', 'gender']]
    -    .rename(columns = {'gender': 'male'})
    -    .assign(male = np.where(df_reviews[['gender']] == 'male', 1, 0))
    -)
    -
    -y = df_reviews['rating_good']
    +
    import pandas as pd
    +import numpy as np
    +
    +df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
    +
    +# for the by-hand option later
    +X = (
    +    df_reviews[['word_count', 'gender']]
    +    .rename(columns = {'gender': 'male'})
    +    .assign(male = np.where(df_reviews[['gender']] == 'male', 1, 0))
    +)
    +
    +y = df_reviews['rating_good']

    For an initial logistic regression model, we can use standard and common functions in our chosen language. Running a logistic regression model requires the specification of the family, but that’s pretty much the only difference compared to our previous linear regression. The default link function for the binomial distribution is the ‘logit’ link, so we don’t have to specify it explicitly.

    - +
    -
    +
    -
    model_logistic = glm(
    -    rating_good ~ word_count + gender, 
    -    data = df_reviews,
    -    family = binomial
    -)
    -
    -summary(model_logistic)
    +
    model_logistic = glm(
    +    rating_good ~ word_count + gender, 
    +    data = df_reviews,
    +    family = binomial
    +)
    +
    +summary(model_logistic)
    
     Call:
    @@ -1205,18 +1264,18 @@ 

    -
    +
    -
    import statsmodels.api as sm
    -import statsmodels.formula.api as smf
    -
    -model_logistic = smf.glm(
    -    'rating_good ~ word_count + gender', 
    -    data = df_reviews,
    -    family = sm.families.Binomial()
    -).fit()
    -
    -model_logistic.summary()
    +
    import statsmodels.api as sm
    +import statsmodels.formula.api as smf
    +
    +model_logistic = smf.glm(
    +    'rating_good ~ word_count + gender', 
    +    data = df_reviews,
    +    family = sm.families.Binomial()
    +).fit()
    +
    +model_logistic.summary()
    @@ -1253,13 +1312,13 @@

    - + - + @@ -1336,43 +1395,43 @@

    -

    The binomial distribution is a count distribution. The logistic regression model is used to model binary outcomes, but we can use the binomial distribution because the binary setting is a special case of the binomial distribution where the number of trials is 1, and the number of successes can only be 0 or 1. In this case, we can also use the Bernoulli distribution, which does not require the number of trials, since, when the number of trials is 1 the factorial part of Equation 5.1 drops out.

    +

    As noted, the binomial distribution is a count distribution. For a binary outcome, we can only have a 0 or 1 outcome for each ‘trial’, and the ‘size’ or ‘n’ for the binomial distribution is 1. In this case, we can also use the Bernoulli distribution (\(\textrm{Bern}(p)\)). This does not require the number of trials, since, when the number of trials is 1 the factorial part of Equation 6.1 drops out.

    Many coming from a non-statistical background are not aware that their logistic model can actually handle count and/or proportional outcomes.

    -
    -

    5.3.4 Interpretation and visualization

    -

    We need to know what those results mean. The coefficients that we get from our model are in log odds, but as we demonstrated we can exponentiate them to get the odds ratio. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from ‘bad’ to ‘good’, whereas a negative log odds would indicate that a decrease in the feature will decrease the log odds of moving from ‘bad’ to ‘good’. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease.

    +
    +

    6.3.4 Interpretation and visualization

    +

    If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in log odds, but as we demonstrated we can exponentiate them to get the odds ratio. Interpreting log odds is difficult at best, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from ‘bad’ to ‘good’, whereas a negative log odds would indicate that a decrease in the feature will decrease the log odds of moving from ‘bad’ to ‘good’. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease.

    -Table 5.2: Raw Coefficients and Odds Ratios for a Logistic Regression +Table 6.2: Raw Coefficients and Odds Ratios for a Logistic Regression
    -
    +
    @@ -1824,23 +1883,23 @@

    - +
    -Figure 5.4: Model Predictions for Word Count Feature +Figure 6.4: Model Predictions for Word Count Feature

    -

    In Figure 5.4, we can see a clear negative relationship between the number of words in a review and the probability of being considered a ‘good’ movie. As we get over 20 words, the predicted probability of being a ‘good’ movie is less than .2. We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference.

    +

    In Figure 6.4, we can see a clear negative relationship between the number of words in a review and the probability of being considered a ‘good’ movie. As we get over 20 words, the predicted probability of being a ‘good’ movie is less than .2. We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference.

    @@ -1849,13 +1908,13 @@

    -Figure 5.5: Model Predictions for Gender +Figure 6.5: Model Predictions for Gender
    -

    In the end, whether you think these differences are practically significant is up to you. And you’ll still need to do the standard model exploration to further understand the model (Chapter 3 has lots of detail on this). But this is a good start.

    +

    In the end, whether you think these differences are practically significant is up to you. And you’ll still need to do the standard model exploration to further understand the model (Chapter 4 has lots of detail on this). But this is a good start.

    -
    -

    5.4 Poisson Regression

    +
    +

    6.4 Poisson Regression

    Poisson regression also belongs to the class of generalized linear models, and is used specifically when you have a count variable as your target. After logistic regression for binary outcomes, Poisson regression is probably the next most common type of generalized linear model you will encounter. Unlike continuous targets, a count starts at 0 and can only be a whole number. Often it is naturally skewed as well, so we’d like a model well-suited to this situation. Unlike the binomial, there is no concept of number of trials, just the count of events.

    -
    -

    5.4.1 The Poisson distribution

    -

    The Poisson distribution is very similar to the binomial distribution, because the binomial is also a count distribution, and in fact generalizes the poisson3. The Poisson has a single parameter noted as \(\lambda\), which makes it the simplest model setting we’ve seen so far4. Conceptually, this rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season.

    -

    Let’s see what the particular distribution might look like for different rates. We can see that for low count values, the distribution is skewed to the right, but note how the distribution becomes more symmetric and bell-shaped as the rate increases5. You might also be able to tell that the variance increases along with the mean, and in fact, the variance is equal to the mean for the Poisson distribution.

    +
    +

    6.4.1 The Poisson distribution

    +

    The Poisson distribution is very similar to the binomial distribution, because the binomial is also a count distribution, and in fact generalizes the poisson2. The Poisson has a single parameter noted as \(\lambda\), which makes it the simplest model setting we’ve seen so far3. Conceptually, this rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season.

    +

    Let’s see what the particular distribution might look like for different rates. We can see that for low count values, the distribution is skewed to the right, but note how the distribution becomes more symmetric and bell-shaped as the rate increases4. You might also be able to tell that the variance increases along with the mean, and in fact, the variance is equal to the mean for the Poisson distribution.

    @@ -1889,7 +1948,7 @@

    -Figure 5.6: Poisson Distributions for Different Rates +Figure 6.6: Poisson Distributions for Different Rates
    @@ -1913,26 +1972,26 @@

    - +
    -
    +
    -
    df_reviews$poss_pronoun = stringr::str_count(
    -    df_reviews$review_text, 
    -    '\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b'
    -    )
    -
    -hist(df_reviews$poss_pronoun)
    +
    df_reviews$poss_pronoun = stringr::str_count(
    +    df_reviews$review_text, 
    +    '\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b'
    +    )
    +
    +hist(df_reviews$poss_pronoun)
    -
    +
    -
    df_reviews['poss_pronoun'] = (
    -    df_reviews['review_text']
    -    .str.count('\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b')
    -    )
    -
    -df_reviews['poss_pronoun'].hist()
    +
    df_reviews['poss_pronoun'] = (
    +    df_reviews['review_text']
    +    .str.count('\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b')
    +    )
    +
    +df_reviews['poss_pronoun'].hist()
    @@ -1945,83 +2004,84 @@

    -Figure 5.7: Distribution of the Personal Pronouns Seen Across Reviews +Figure 6.7: Distribution of the Personal Pronouns Seen Across Reviews
    -
    -

    5.4.2 A Poisson regression model

    +
    +

    6.4.2 A Poisson regression model

    -

    Recall that every GLM distribution has a default link function. The Poisson distribution uses a log link function:

    -

    \[y \sim \textrm{Poisson}(\lambda)\]

    +

    Recall that GLM specific distributions have a default link function. The Poisson distribution uses a log link function:

    +

    \[y^* \sim \textrm{Poisson}(\lambda)\]

    \[\text{log}(\lambda) = \alpha + X\beta\]

    -

    Using the log link keeps the outcome non-negative when we use the inverse of it. For model fitting with standard functions, all we have to do is switch the family from ‘binomial’ to ‘poisson’. As the default link is the ‘log’, so we don’t have to specify it explicitly6. We can run the model and get the results as we did before, but we keep our presentation here to just the coefficients.

    +

    Using the log link keeps the outcome non-negative when we use the inverse of it. For model fitting with standard functions, all we have to do is switch the family from ‘binomial’ to ‘poisson’. As the default link is the ‘log’, so we don’t have to specify it explicitly5.

    +

    In this model we’ll predict the number of personal pronouns used in a review. We’ll use word count and gender as our features like we did with the logistic model.

    - +
    -
    +
    -
    model_poisson = glm(
    -    poss_pronoun ~ word_count + gender,
    -    data = df_reviews,
    -    family = poisson
    -)
    -
    -summary(model_poisson)
    -
    -exp(model_poisson$coefficients)
    +
    model_poisson = glm(
    +    poss_pronoun ~ word_count + gender,
    +    data = df_reviews,
    +    family = poisson
    +)
    +
    +summary(model_poisson)
    +
    +exp(model_poisson$coefficients)
    -
    +
    -
    model_poisson = smf.glm(
    -    formula = 'poss_pronoun ~ word_count + gender',
    -    data = df_reviews,
    -    family = sm.families.Poisson()
    -).fit()
    -
    -model_poisson.summary()
    -
    -np.exp(model_poisson.params)
    +
    model_poisson = smf.glm(
    +    formula = 'poss_pronoun ~ word_count + gender',
    +    data = df_reviews,
    +    family = sm.families.Poisson()
    +).fit()
    +
    +model_poisson.summary()
    +
    +np.exp(model_poisson.params)
    -
    -

    5.4.3 Interpretation and visualization

    +
    +

    6.4.3 Interpretation and visualization

    Like with logistic, we can exponentiate the coefficients to get what’s now referred to as the rate ratio. This is the ratio of the rate of the outcome occurring for a one unit increase in the feature.

    -Table 5.3: Rate Ratios for a Poisson Regression +Table 6.3: Rate Ratios for a Poisson Regression
    -
    +
    @@ -2483,7 +2543,7 @@

    -Figure 5.8: Poisson Model Predictions for Word Count Feature +Figure 6.8: Poisson Model Predictions for Word Count Feature
    @@ -2502,109 +2562,109 @@

    -

    Did you notice that both our effects for word count in the logistic (Figure 5.4) and Poisson (Figure 5.8) models were not exactly the straightest of lines? Once we’re on the probability and count scales, we’re not going to see the same linear relationships that we might expect from a basic linear model due to the transformation. If we plot the effect on the log-odds or log-count scale, we’re back to straight lines. This is a first taste in how the linear model can be used to get at nonlinear relationships, which are of the focus of Chapter 6.

    +

    You’ll note again that our effects for word count in the logistic (Figure 6.4) and Poisson (Figure 6.8) models were not exactly the straightest of lines. Once we’re on the probability and count scales, we’re not going to see the same linear relationships that we might expect from a linear regression model due to the transformation. If we plot the effect on the log-odds or log-count scale, we’re back to straight lines, as demonstrated with the logistic model. This is a first taste in how the linear model can be used to get at nonlinear relationships depending on the scale we focus on. More explicit nonlinear relationships are the focus of Chapter 7.

    -
    -

    5.5 How Did We Get Here?

    -

    If we really want to demystify the modeling process, let’s create our own function to estimate the coefficients. We can use maximum likelihood estimation to estimate the parameters of our model, which is the approach used by standard package functions. Feel free to skip this part if you only wanted the basics, but for even more information on maximum likelihood estimation, see Section 4.7 where we take a deeper dive into the topic and with a similar function. The following code is a simple version of what goes on behind the scenes with ‘glm’ type functions.

    +
    +

    6.5 How Did We Get Here?

    +

    If we really want to demystify the modeling process, let’s create our own function to estimate the coefficients. We can use maximum likelihood estimation to estimate the parameters of our model, which is the approach used by standard package functions. Feel free to skip this part if you only wanted the basics, but for even more information on maximum likelihood estimation, see Section 5.7 where we take a deeper dive into the topic and with a similar function. The following code is a simple version of what goes on behind the scenes with ‘glm’ type functions.

    - +
    -
    +
    -
    glm_simple = function(par, X, y, family = 'binomial') {
    -    # add an column for the intercept
    -    X = cbind(1, X)
    -
    -    # Calculate the linear predictor
    -    mu = X %*% par # %*% is matrix multiplication
    -
    -    # get the likelihood for the binomial or poisson distribution
    -    if (family == 'binomial') {
    -        # Convert to a probability ('logit' link/inverse)
    -        p = 1 / (1 + exp(-mu))
    -        L = dbinom(y, size = 1, prob = p, log = TRUE)
    -    }
    -    else if (family == 'poisson') {
    -        # Convert to a count ('log' link/inverse)
    -        p = exp(mu)
    -        L = dpois(y, lambda = p, log = TRUE)
    -    }
    -
    -    # return the negative sum of the log-likelihood (for minimization)
    -    value = -sum(L) 
    -
    -    return(value)
    -}
    +
    glm_simple = function(par, X, y, family = 'binomial') {
    +    # add an column for the intercept
    +    X = cbind(1, X)
    +
    +    # Calculate the linear predictor
    +    mu = X %*% par # %*% is matrix multiplication
    +
    +    # get the likelihood for the binomial or poisson distribution
    +    if (family == 'binomial') {
    +        # Convert to a probability ('logit' link/inverse)
    +        p = 1 / (1 + exp(-mu))
    +        L = dbinom(y, size = 1, prob = p, log = TRUE)
    +    }
    +    else if (family == 'poisson') {
    +        # Convert to a count ('log' link/inverse)
    +        p = exp(mu)
    +        L = dpois(y, lambda = p, log = TRUE)
    +    }
    +
    +    # return the negative sum of the log-likelihood (for minimization)
    +    value = -sum(L) 
    +
    +    return(value)
    +}
    -
    +
    -
    from scipy.stats import poisson, binom
    -
    -def glm_simple(par, X, y, family = 'binomial'):
    -    # add an column for the intercept
    -    X = np.column_stack((np.ones(X.shape[0]), X))
    -
    -    # Calculate the linear predictor
    -    mu = X @ par  # @ is matrix multiplication
    -    
    -    # get the likelihood for the binomial or poisson distribution
    -    if family == 'binomial':
    -        p = 1 / (1 + np.exp(-mu))
    -        L = binom.logpmf(y, 1, p)
    -    elif family == 'poisson':
    -        lambda_ = np.exp(mu)
    -        L = poisson.logpmf(y, lambda_)
    -    
    -    # return the negative sum of the log-likelihood (for minimization)
    -    value = -np.sum(L)
    -    
    -    return value
    +
    from scipy.stats import poisson, binom
    +
    +def glm_simple(par, X, y, family = 'binomial'):
    +    # add an column for the intercept
    +    X = np.column_stack((np.ones(X.shape[0]), X))
    +
    +    # Calculate the linear predictor
    +    mu = X @ par  # @ is matrix multiplication
    +    
    +    # get the likelihood for the binomial or poisson distribution
    +    if family == 'binomial':
    +        p = 1 / (1 + np.exp(-mu))
    +        L = binom.logpmf(y, 1, p)
    +    elif family == 'poisson':
    +        lambda_ = np.exp(mu)
    +        L = poisson.logpmf(y, lambda_)
    +    
    +    # return the negative sum of the log-likelihood (for minimization)
    +    value = -np.sum(L)
    +    
    +    return value

    Now that we have our objective function, we can fit our models, starting with the logistic model. We will use the optim function in R and the minimize function in Python.

    - +
    -
    +
    -
    init = rep(0, ncol(X) + 1)
    -
    -names(init) = c('intercept', 'b1', 'b2')
    -
    -fit_logistic = optim(
    -    par = init,
    -    fn = glm_simple,
    -    X = X,
    -    y = y,
    -    control = list(reltol = 1e-8)
    -)
    -
    -fit_logistic$par
    +
    init = rep(0, ncol(X) + 1)
    +
    +names(init) = c('intercept', 'b1', 'b2')
    +
    +fit_logistic = optim(
    +    par = init,
    +    fn = glm_simple,
    +    X = X,
    +    y = y,
    +    control = list(reltol = 1e-8)
    +)
    +
    +fit_logistic$par
    -
    +
    -
    import numpy as np
    -from scipy.optimize import minimize
    -
    -init = np.zeros(X.shape[1] + 1)
    -
    -fit_logistic = minimize(
    -    fun = glm_simple,
    -    x0 = init,
    -    args = (X, y),
    -    method = 'BFGS'
    -)
    -
    -fit_logistic['x']
    +
    import numpy as np
    +from scipy.optimize import minimize
    +
    +init = np.zeros(X.shape[1] + 1)
    +
    +fit_logistic = minimize(
    +    fun = glm_simple,
    +    x0 = init,
    +    args = (X, y),
    +    method = 'BFGS'
    +)
    +
    +fit_logistic['x']
    @@ -2614,30 +2674,30 @@

    -Table 5.4: Comparison of Coefficients +Table 6.4: Comparison of Coefficients
    -
    +
    @@ -3091,34 +3151,34 @@

    Similarly, we can also use our function to estimate the coefficients for the poisson model. Just like the GLM function we might normally use, we can change the family option to specify the distribution we want to use.

    - +
    -
    +
    -
    fit_poisson = optim(
    -    par = c(0, 0, 0),
    -    fn = glm_simple,
    -    X = X,
    -    y = df_reviews$poss_pronoun,
    -    family = 'poisson'
    -)
    -
    -fit_poisson$par
    -
    -
    -
    +
    fit_poisson = optim(
    +    par = c(0, 0, 0),
    +    fn = glm_simple,
    +    X = X,
    +    y = df_reviews$poss_pronoun,
    +    family = 'poisson'
    +)
    +
    +fit_poisson$par
    +
    +
    +
    -
    fit_poisson = minimize(
    -    fun = glm_simple,
    -    x0 = init,
    -    args = (
    -        X, 
    -        df_reviews['poss_pronoun'], 
    -        'poisson'
    -    )
    -)
    -
    -fit_poisson['x']
    +
    fit_poisson = minimize(
    +    fun = glm_simple,
    +    x0 = init,
    +    args = (
    +        X, 
    +        df_reviews['poss_pronoun'], 
    +        'poisson'
    +    )
    +)
    +
    +fit_poisson['x']
    @@ -3128,30 +3188,30 @@

    -Table 5.5: Comparison of Coefficients +Table 6.5: Comparison of Coefficients
    -
    +
    @@ -3605,9 +3665,9 @@

    This goes to show that just a little knowledge of the underlying mechanics can go a long way in understanding how many models work.

    -
    -

    5.6 Wrapping Up

    -

    So at this point you have standard linear regression with the normal distribution for continuous targets, logistic regression for binary/proportional ones via the binomial distribution, and Poisson regression for counts. These models combine to provide much of what you need for starting out in the modeling world, and all serve well as baseline models for comparison when using more complex methods (Section 8.4). However, what we’ve seen is just a tiny slice of the potential universe of distributions that you could use. Here is a brief list of some that are still in the GLM family proper and others that can be useful7:

    +
    +

    6.6 Wrapping Up

    +

    So at this point you have standard linear regression with the normal distribution for continuous targets, logistic regression for binary/proportional ones via the binomial distribution, and Poisson regression for counts. These models combine to provide much of what you need for starting out in the modeling world, and all serve well as baseline models for comparison when using more complex methods (Section 8.4). However, what we’ve seen is just a tiny slice of the potential universe of distributions that you could use. Here is a brief list of some that are still in the GLM family proper and others that can be useful6:

    Other Core GLM (available in standard functions):

    • Gamma: For continuous, positive targets that are skewed.
    • @@ -3616,7 +3676,7 @@

      Others (some fairly common):

      • Beta: For continuous targets that are bounded between 0 and 1.
      • -
      • Log-Normal: For continuous targets that are skewed. Essentially what you get with linear regression and logging the target8.
      • +
      • Log-Normal: For continuous targets that are skewed. Essentially what you get with linear regression and logging the target7.
      • Tweedie: Generalizes several core GLM family distributions.

      In the ballpark:

      @@ -3628,16 +3688,16 @@

      Quasi *: For example quasipoisson. These ‘quasi-likelihoods’ served a need at one point that is best served by other methods.

    You’ll typically need separate packages to fit some of these, but most often the tools keep to a similar functional approach. The main thing is to know that certain distributions might fit your data a bit better than others, and that you can use both the same basic framework and mindset to fit them, and maybe get a little closer to the answer you seek about your data!

    -
    -

    5.6.1 The common thread

    +
    +

    6.6.1 The common thread

    GLMs extend your standard linear model as a powerful tool for modeling a wide range of data types. They are a great way to get started with more complex models, and even allow us to linear models in a not so linear way. It’s best to think of GLMs more broadly than the strict statistical definition, and consider many models like ordinal regression, ranking models, survival analysis, and more as part of the same extension.

    -
    -

    5.6.2 Choose your own adventure

    -

    At this point you have a pretty good sense of what linear models have to offer, but there’s even more! You can start to look at more complex models that build on these ideas, like mixed models, generalized additive models and more in Chapter 6. You can also feel confident heading into the world of machine learning (Chapter 7), where you’ll find additional ways to think about your modeling approach.

    +
    +

    6.6.2 Choose your own adventure

    +

    At this point you have a pretty good sense of what linear models have to offer, but there’s even more! You can start to look at more complex models that build on these ideas, like mixed models, generalized additive models and more in Chapter 7. You can also feel confident heading into the world of machine learning (Chapter 8), where you’ll find additional ways to think about your modeling approach.

    -
    -

    5.6.3 Additional resources

    +
    +

    6.6.3 Additional resources

    If you are itching for a textbook, there isn’t any shortage of them out there and you can essentially take your pick, though most purely statistical treatments are going to be a bit dated at this point, but still accurate and maybe worth your time.

    • Generalized Linear Models (McCullagh (2019)) is a classic text on the subject, but it is a bit dense and not for the faint of heart, or even Nelder and Wedderburn (1972), which is a very early treatment.
    • @@ -3652,16 +3712,13 @@

    -
    -

    5.7 Exercise

    +
    +

    6.7 Exercise

    Use the fish data (Section A.4) to conduct a Poisson regression and see how well you can predict the number of fish caught based on the other variables like how many people were on the trip, how many children, whether live bait was used etc.

    If you would prefer to try a logistic regression, change the count to just 0 and 1 for whether any fish were caught, and see how well you can predict that.

    +
    +UCLA Advanced Research Computing. 2023. FAQ: What Are Pseudo R-Squareds?” https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/. +

      -
    1. The y in the formula is more properly expressed as \(y | X, \theta\), where X is the matrix of features and \(\theta\) the parameters estimated by the model. We’ll keep it simple here.↩︎

    2. -
    3. For more on interpreting odds ratios, see this article.↩︎

    4. -
    5. If your binomial setting has a very large number of trials relative to the number of successes, which amounts to very small proportions \(p\), you would find that the binomial distribution would converge to the Poisson distribution.↩︎

    6. -
    7. Neither the binomial nor the Poisson have a variance parameter to estimate, as the variance is determined by the mean. This is in contrast to the normal distribution, where the variance is a separate parameter. For the Poisson, the variance is equal to the mean, and for the binomial, the variance is equal to \(n*p*(1-p)\). The Poisson assumption of equal variance rarely holds up in practice, so people often use the negative binomial distribution instead.↩︎

    8. -
    9. From a modeling perspective, for large mean counts you can just go back to using the normal distribution if you prefer without much losing much predictively and a gaining in interpretability.↩︎

    10. -
    11. It is not uncommon in many disciplines to use different link functions for logistic models, but the log link is always used for Poisson models.↩︎

    12. -
    13. There is not strict agreement about what qualifies for being in the GLM family.↩︎

    14. -
    15. But there is a variance issue to consider.↩︎

    16. +
    17. For more on interpreting odds ratios, see this article.↩︎

    18. +
    19. If your binomial setting has a very large number of trials relative to the number of successes, which amounts to very small proportions \(p\), you would find that the binomial distribution would converge to the Poisson distribution.↩︎

    20. +
    21. Neither the binomial nor the Poisson have a variance parameter to estimate, as the variance is determined by the mean. This is in contrast to the normal distribution, where the variance is an estimated parameter. For the Poisson, the variance is equal to the mean, and for the binomial, the variance is equal to \(n*p*(1-p)\). The Poisson assumption of equal variance rarely holds up in practice, so people often use the negative binomial distribution instead.↩︎

    22. +
    23. From a modeling perspective, for large mean counts you can just go back to using the normal distribution if you prefer without much losing much predictively and a gaining in interpretability.↩︎

    24. +
    25. It is not uncommon in many disciplines to use different link functions for logistic models, but the log link is always used for Poisson models.↩︎

    26. +
    27. There is not strict agreement about what qualifies for being in the GLM family.↩︎

    28. +
    29. But there is a variance issue to consider.↩︎

    @@ -4120,12 +4179,12 @@

    diff --git a/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-py-1.png b/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-py-1.png deleted file mode 100644 index 8876ffe..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-py-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-r-1.png b/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-r-1.png deleted file mode 100644 index 3ba3444..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/create-poss-prounoun-r-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-count-1.png b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-count-1.png deleted file mode 100644 index 670c99c..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-count-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-gender-1.png b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-gender-1.png index ab03ec1..1bc1121 100644 Binary files a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-gender-1.png and b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-gender-1.png differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-prob-1.png b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-prob-1.png deleted file mode 100644 index 06ac832..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-prob-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-word-count-1.png b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-word-count-1.png index b6649fa..7ea0678 100644 Binary files a/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-word-count-1.png and b/docs/generalized_linear_models_files/figure-html/fig-logistic-regression-word-count-1.png differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-odds-log-odds-1.png b/docs/generalized_linear_models_files/figure-html/fig-odds-log-odds-1.png deleted file mode 100644 index d4ab1e6..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-odds-log-odds-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-poisson-distribution-1.png b/docs/generalized_linear_models_files/figure-html/fig-poisson-distribution-1.png deleted file mode 100644 index 13db39b..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-poisson-distribution-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-poisson-test-1.png b/docs/generalized_linear_models_files/figure-html/fig-poisson-test-1.png deleted file mode 100644 index 928b23e..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-poisson-test-1.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-2.png b/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-2.png deleted file mode 100644 index 753840a..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-2.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-3.png b/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-3.png deleted file mode 100644 index 753840a..0000000 Binary files a/docs/generalized_linear_models_files/figure-html/fig-poss-prounoun-3.png and /dev/null differ diff --git a/docs/generalized_linear_models_files/figure-html/fig-prob-log-odds-1.png b/docs/generalized_linear_models_files/figure-html/fig-prob-log-odds-1.png index 1aa1f2f..d94eef6 100644 Binary files a/docs/generalized_linear_models_files/figure-html/fig-prob-log-odds-1.png and b/docs/generalized_linear_models_files/figure-html/fig-prob-log-odds-1.png differ diff --git a/docs/img/chapter_gp_plots/gp_plot_17.svg b/docs/img/chapter_gp_plots/gp_plot_17.svg new file mode 100644 index 0000000..9b09a15 --- /dev/null +++ b/docs/img/chapter_gp_plots/gp_plot_17.svg @@ -0,0 +1,90 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/img/graphical-simple_model.svg b/docs/img/graphical-simple_model.svg new file mode 100644 index 0000000..0c914dc --- /dev/null +++ b/docs/img/graphical-simple_model.svg @@ -0,0 +1,33 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/img/nano_gpt.png b/docs/img/nano_gpt.png index 4001df6..b0f0156 100644 Binary files a/docs/img/nano_gpt.png and b/docs/img/nano_gpt.png differ diff --git a/docs/index.html b/docs/index.html index a9fa1f5..93394e9 100644 --- a/docs/index.html +++ b/docs/index.html @@ -146,6 +146,12 @@

    Models Demystifie 1  Introduction + + + @@ -251,25 +257,25 @@

    @@ -294,26 +300,37 @@

    + + +

    Generalized Linear Model Regression Results
    Date:Thu, 15 Aug 2024Sun, 18 Aug 2024 Deviance: 1257.4
    Time:20:48:4518:50:09 Pearson chi2: 1.02e+03
    - - - - - - - - - - - - - - - - - - - - - - - - -
    FeatureTarget
    independent variabledependent variable
    predictor variableresponse
    explanatory variableoutcome
    covariatelabel
    xy
    inputoutput
    right-hand sideleft-hand side
    -
    -
    -
    - -
    -
    -

    Some of these terms actually suggest a particular type of relationship (e.g., a causal relationship, an experimental setting), but here we’ll typically avoid those terms if we can since those connotations won’t apply. In the end though, you may find us using any of these words to describe the relationships of interest so that you are comfortable with the terminology, but typically we’ll stick with features and targets for the most part. In our opinion, these terms have the least hidden assumptions/implications, and just implies ‘features of the data’ and the ‘target’ we’re trying to explain or predict.

    +

    +
    +

    It is the chief characteristic of data science that it works. ― Isaac Asimov (paraphrased)

    +
    + +

    Now that you’re here, it’s time to dive in! We’ll start things off by covering the building block of all modeling, and a solid understanding here will provide you the basis for just about anything that comes after, no matter how complex it gets. The linear model is our starting point. At first glance, it may seem like a very simple model, and it is, relatively speaking. But it’s also quite powerful and flexible, able to take in different types of inputs, handle nonlinear relationships, temporal and spatial relations, clustering, and more. Linear models have a long history, with even the formal and scientific idea behind correlation and linear regression being well over a century old1! And in that time, the linear model is far and away the most used model out there. But before we start talking about the linear model, we need to talk about what a model is in general.

    +
    +

    3.1 Key Ideas

    +

    To get us started, we can pose a few concepts key to understanding linear models. We’ll cover each of these as we go along.

    +
      +
    • The linear model is a foundation on which one can build an understanding for all models.
    • +
    • Prediction is fundamental to assessing and using a model.
    • +
    • Interpretation: what does a model tell us? +
        +
      • Prediction underlies all interpretation
      • +
      • We can interpret a model at the feature level and as a whole
      • +
    • +
    +

    As we go along and cover these concepts, be sure that you feel you have the ‘gist’ of what we’re talking about. Almost everything that goes beyond linear models builds on what’s introduced here, so it’s important to have a firm grasp before climbing to new heights.

    +
    +

    3.1.1 Why this matters

    +

    The basic linear model and how it comes about underpins so many other models, from the simplest t-test to the most complex neural network. There are many important aspects of it, but it provides a powerful foundation, and one that you’ll see in many different contexts. It’s also a model that is relatively easy to understand, so it’s a great place to start!

    -
    -

    2.3.2 Expressing relationships

    -

    As noted, a model is a way of expressing a relationship between a set of features and a target, and one way of thinking about this is in terms of inputs and outputs. But how can we go from input to output?

    -

    Well, first off, we assume that the features and target are correlated, that there is some relationship between the feature x and target y. The output of a model will correspond to the target if they are correlated, and more closely match it with stronger correlation. If so, then we can ultimately use the features to predict the target. In the simplest setting, a correlation implies a relationship where x and y typically move up and down together (left plot) or they move in opposite directions where x goes up and y goes down (right plot).

    -
    -
    -
    -
    -
    - -
    -
    -Figure 2.1: Correlation -
    -
    -
    -
    -
    -

    In addition, the simplest correlation suggests a linear relationship, one that is adequately captured by a straight line. There are many types of correlation metrics, but the most common one, the Pearson correlation, is explicitly a measure of the linear relationship between two variables. It’s expressed as a number between -1 and 1, where 0 means there is no linear relationship. As we move closer to a 1.0 correlation value, we would see a tighter scatterplot like the one on the left, until it became a straight line. The same happens for the negative relationship as we get closer to a value of -1, like the plot on the right. If we have only one feature and target, the Pearson correlation reflects the exact result of the linear model we’d conduct in a more complicated fashion. But even with multiple features, we still stick to this notion of correlation to help us understand how the features account for the target’s variability, or why it behaves the way it does.

    - +
    +

    3.1.2 Good to know

    +

    We’re just starting out here, but we’re kind of assuming you’ve had some exposure to the idea of statistical or other models, even if only from an interpretation standpoint. We assume you have an understanding of basic stats like central tendency (e.g., a mean or median), variance, and correlation, stuff like that. And if you intend to get into the code examples, you’ll need a basic familiarity with Python or R.

    +
    -
    -

    2.4 THE Linear Model

    +
    +

    3.2 THE Linear Model

    The linear model is perhaps the simplest functional model we can use to express a relationship between features and targets. And because of that, it’s possibly still the most common model used in practice, and it is the basis for many types of other models. So why don’t we do one now?

    The following dataset has individual movie reviews containing the movie rating (1-5 stars scale), along with features pertaining to the review (e.g., word count), those that regard the reviewer (e.g., age) and features about the movie (e.g., genre, release year).

    For our first linear model, we’ll keep things simple. Let’s predict the rating from the length of the review in terms of word count. We’ll use the lm() function in R and the ols() function in Python2 to fit the model. Both functions take a formula as the first argument, which is a way of expressing the relationship between the features and target. The formula is expressed as y ~ x1 + x2 + ..., where y is the target name and x* are the feature names. We also need to specify what the data object is, typically a data frame.

    - +
    -
    +
    -
    # all data found on github repo
    -df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
    -
    -model_lr_rating = lm(rating ~ word_count, data = df_reviews)
    -
    -summary(model_lr_rating)
    +
    # all data found on github repo
    +df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
    +
    +model_lr_rating = lm(rating ~ word_count, data = df_reviews)
    +
    +summary(model_lr_rating)
    
     Call:
    @@ -1076,22 +548,22 @@ 

    -
    +
    -
    import numpy as np
    -import pandas as pd
    -import statsmodels.formula.api as smf
    -
    -# all data found on github repo
    -df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
    -
    -model_lr_rating = smf.ols('rating ~ word_count', data = df_reviews).fit()
    -
    -model_lr_rating.summary(slim = True)
    +
    import numpy as np
    +import pandas as pd
    +import statsmodels.formula.api as smf
    +
    +# all data found on github repo
    +df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
    +
    +model_lr_rating = smf.ols('rating ~ word_count', data = df_reviews).fit()
    +
    +model_lr_rating.summary(slim = True)
    @@ -1232,7 +704,7 @@

    Getting more into the details, we’ll start with the fact that the linear model posits a linear combination of the features. This is an important concept to understand, but really, a linear combination is just a sum of the features, each of which has been multiplied by some specific value. That value is often called a coefficient, or possibly weight, depending on the context. The linear model is expressed as (math incoming!):

    \[ y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n -\tag{2.1}\]

    +\tag{3.1}\]

    • \(y\) is the target.
    • \(x_1, x_2, ... x_n\) are the features.
    • @@ -1247,7 +719,7 @@

      \[ y = x_1 + x_2 + ... + x_n \]

      -

      In this case, the function is just a sum, something so simple we do it all the time. In the linear model sense though, we’re actually saying a bit more. Another way to understand that equation is that y is a function of x. We don’t show any coefficients here, i.e. the bs in our initial equation (Equation 2.1), but technically it’s as if each coefficient was equal to a value of 1. In other words, for this simple linear model, we’re saying that each feature contributes in an identical fashion to the target.

      +

      In this case, the function is just a sum, something so simple we do it all the time. In the linear model sense though, we’re actually saying a bit more. Another way to understand that equation is that y is a function of x. We don’t show any coefficients here, i.e. the bs in our initial equation (Equation 3.1), but technically it’s as if each coefficient was equal to a value of 1. In other words, for this simple linear model, we’re saying that each feature contributes in an identical fashion to the target.

      In practice, features will never contribute in the same ways, because they correlate with the target differently, or are on different scales. So if we want to relate some feature, \(x_1\), and some other feature, \(x_2\), to target \(y\), we probably would not assume that they both contribute in the same way from the beginning. We might give relatively more weight to \(x_1\) than \(x_2\). In the linear model, we express this by multiplying each feature by a different coefficient or weight. So the linear model is really just a sum of the features multiplied by their coefficients, i.e. a weighted sum. In fact, we’re saying that each feature contributes to the target in proportion to the coefficient. So if we have a feature \(x_1\) and a coefficient \(b_1\), then the contribution of \(x_1\) to the target is \(b_1*x_1\). If we have a feature \(x_2\) and a coefficient \(b_2\), then the contribution of \(x_2\) to the target is \(b_2 * x_2\). And so on. So the linear model is really just a sum of the features multiplied by their respective weights.

      For our specific model, here is the mathematical representation:

      \[ @@ -1263,56 +735,56 @@

      Our target is the movie’s rating by a reviewer, and the feature is the word count
    • We map the feature to the target via the linear model, which provides an initial understanding of how the feature is related to the target. In this case, we start with a baseline of 3.49. This value makes sense only in the case of a rating with no review, and is what we would guess if the word count was 0. But we know there are reviews for every observation, so it’s not very meaningful as is. We’ll talk about ways to get a more meaningful intercept later, but for now, that is our starting point. Moving on, if we add a single word to the review, we expect the rating to decrease by -0.04 stars. So if we had a review that was 10 words long, i.e., the mean word count, we would predict a rating of 3.49 + 10*-0.04 = 3.1 stars.
    -
    -

    2.4.1 The linear model visualized

    -

    We can also express the linear model as a graph, which can be a very useful way to think about models in a visual fashion, and as we see other models, can help us literally see how different models relate to one another and are actually very similar to one another. In the following, we have three features predicting a single target, so we have three ‘nodes’ for the features, and a single node for the target. The feature nodes are combined into a linear combination to produce the output of the model. In the context of linear regression, the output is often called the linear predictor. Each ‘edge’ signifies the connection of a feature to the output, and is labeled with the coefficient or weight. The connection between the output and the target is direct, without any additional change. We’ll return to this depiction a little bit later (Section 2.9), but for our standard linear model, we’re all set.

    +
    +

    3.2.1 The linear model visualized

    +

    We can also express the linear model as a graph, which can be a very useful way to think about models in a visual fashion. As we come across models, a visualization like this can help us see both how different models relate to one another and are actually very similar to one another. In the following, we have three features predicting a single target, so we have three ‘nodes’ for the features, and a single node for the target. The feature nodes are combined into a linear combination to produce the output of the model. In the context of linear regression, the output is often called the linear predictor. Each ‘edge’ signifies the connection of a feature to the output, and is labeled with the coefficient or weight. The connection between the output and the target is direct, without any additional change. We’ll return to this depiction a little bit later (Section 3.7), but for our standard linear model, we’re all set.

    -Figure 2.2: A linear regression as a graphical model +Figure 3.1: A linear regression as a graphical model

    So at this point you have the basics of what a linear model is and how it works, and a couple ways to think about it, whether through programming, math, or just visually. But there is a lot more to it than that. Just getting the model is easy enough, but we need to be able to use it and understand the details better, so we’ll get into that now!

    -
    -

    2.5 What Do We Do with a Model?

    +
    +

    3.3 What Do We Do with a Model?

    Once we have a working model, there are two primary ways we can use it. One way to use a model is to help us understand the relationships between the features and our outcome of interest. In this way, the focus can be said to be on explanation, or interpreting the model results. The other way to use a model is to make estimates about the target for specific observations, often ones we haven’t seen in our data. In this way the focus is on prediction. In practice, we often do both, but the focus is usually on one or the other. We’ll cover both in detail eventually, but let’s start with prediction.

    -
    -

    2.5.1 Prediction

    -

    It may not seem like much at first, but a model is of no use if it can’t be used to make predictions about what we can expect in the world around us. Once our model has been fit to the data, we can obtain our predictions by plugging in values for the features that we are interested in, and, using the corresponding weights and other parameters that have been estimated, come to a guess about a specific observation. Let’s go back to our results, shown in the following table.

    +
    +

    3.3.1 Prediction

    +

    It may not seem like much at first, but a model is of little use if it can’t be used to make predictions about what we can expect in the world around us. Once our model has been fit to the data, we can obtain our predictions by plugging in values for the features that we are interested in, and, using the corresponding weights and other parameters that have been estimated, come to a guess about a specific observation. Let’s go back to our results, shown in the following table.

    -Table 2.2: Linear Model Output +Table 3.1: Linear Model Output
    -
    +
    @@ -1777,13 +1249,18 @@

    When we’re talking about the predictions (or outputs) for a linear model, we usually will see this as the following mathematically:

    \[ \hat{y} = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n -\tag{2.2}\]

    -

    What is \(\hat{y}\)? The hat over the \(y\) just means that it’s a predicted value of the model, i.e. the output, rather than the target value we actually observe in the data. Our first equations that just used \(y\) implicitly suggested that we would get a perfect rating value given the model, but that’s not the case. We can only get an estimate. The \(\hat{y}\) is also the linear predictor in our graphical version (Figure 2.2), which makes clear it is not the actual target, but a combination of the features that is related to the target.

    -

    To make our first equation (Equation 2.1) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an error term, \(\epsilon\), to account for the fact that our predictions will not be perfect3. So the full linear (regression) model is:

    +\tag{3.2}\]

    +

    What is \(\hat{y}\)? The hat over the \(y\) just means that it’s a predicted value of the model, i.e. the output, rather than the target value we actually observe in the data. Our first equations that just used \(y\) implicitly suggested that we would get a perfect rating value given the model, but that’s not the case. We can only get an estimate. The \(\hat{y}\) is also the linear predictor in our graphical version (Figure 3.1), which makes clear it is not the actual target, but a combination of the features that is related to the target.

    +

    To make our first equation (Equation 3.1) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an error term, \(\epsilon\), to account for the fact that our predictions will not be perfect3. So the full linear (regression) model is:

    \[ y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n + \epsilon -\tag{2.3}\]

    -

    The error term is a random variable that represents the difference between the actual value and the predicted value, which comes from the weighted combination of features. We can’t know what the error term is, but we can estimate it, just like we can the coefficients. We’ll talk more about that in the section on estimation (Chapter 4).

    +\tag{3.3}\]

    +

    The error term is a random variable that represents the difference between the actual value and the predicted value, which comes from the weighted combination of features. We can’t know what the error term is, but we can estimate parameters associated with it, just like we can the coefficients. We’ll talk more about that in the section on estimation (Chapter 5).

    +

    Another way to write the model formally is:

    +

    \[y|X,\beta, \sigma = Normal(\mu, \sigma^2)\] \[\mu = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n \tag{3.4}\]

    +

    or

    +

    \[y|X,\beta, \sigma = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n + \epsilon\] \[\epsilon \sim Normal(0, \sigma^2) \tag{3.5}\]

    +

    This makes explicit that the target is assumed to be conditionally normally distributed with a mean of based on the linear combination of the features, and a variance of \(\sigma^2\). What do we mean by conditionally? It means that the target is normally distributed given the features in question and the estimated model parameters. This is the standard assumption for linear regression, and it’s a good one to start with, but it’s not our only option. We’ll talk more about this in the section on assumptions Section 3.6, and see what we might do differently in Chapter 6. We will also see that we can estimate the model parameters without any explicit reference to a probability distribution Chapter 5.

    -
    -

    2.5.2 What kinds of predictions can we get?

    +
    +

    3.3.2 What kinds of predictions can we get?

    What predictions we get depends on the type of model we are using. For the linear model we have at present we can get predictions for the target, which is a continuous variable. Very commonly, we also can get predictions for a categorical target, such as whether the rating is ‘good’ or ‘bad’. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g. words). In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric in most cases or at least amenable to such, and usually expressed as a probability between 0 and 1, say, for the ‘good’ category. Higher probabilities would mean we’d more likely predict the movie is good. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We’ll talk about how to get predictions for categorical targets later4.

    We previously saw a prediction for a single observation where the word count was 10 words, but we can also get predictions for multiple observations at once. In fact, we can get predictions for all observations in our dataset. Besides that, we can also get predictions for observations that we don’t even have data for! Fun! The following shows how we can get predictions for all data, and for a single observation with a word count of 5.

    - +
    -
    +
    -
    all_predictions = predict(model_lr_rating)
    -
    -df_prediction = tibble(word_count = 5)
    -single_prediction = predict(model_lr_rating, newdata = df_prediction)
    +
    all_predictions = predict(model_lr_rating)
    +
    +df_prediction = tibble(word_count = 5)
    +single_prediction = predict(model_lr_rating, newdata = df_prediction)
    -
    +
    -
    all_predictions = model_lr_rating.predict()
    -
    -df_prediction = pd.DataFrame({'word_count': [5]})
    -single_prediction = model_lr_rating.predict(df_prediction)
    +
    all_predictions = model_lr_rating.predict()
    +
    +df_prediction = pd.DataFrame({'word_count': [5]})
    +single_prediction = model_lr_rating.predict(df_prediction)
    @@ -1836,7 +1313,7 @@

    -Figure 2.3: Predicted vs. Observed Ratings +Figure 3.2: Predicted vs. Observed Ratings
    @@ -1847,30 +1324,30 @@

    -Table 2.3: Predictions for Specific Observations +Table 3.2: Predictions for Specific Observations
    -
    +
    @@ -2322,8 +1799,8 @@

    -

    2.5.3 Prediction error

    +
    +

    3.3.3 Prediction error

    As we have seen, predictions are not perfect, and an essential part of the modeling endeavor is to better understand these errors and why they occur. In addition, error assessment is the fundamental way in which we assess a model’s performance, and, by extension, compare that performance to other models. In general, prediction error is the difference between the actual value and the predicted value or some function of it, and in statistical models, is also often called the residual. We can look at these individually, or we can look at them in aggregate with a single metric.

    Let’s start with looking at the residuals visually. Often the modeling package you use will have this as a default plotting method when doing a standard linear regression, so it’s wise to take advantage of it. We plot both the distribution of raw error scores and the cumulative distribution of absolute prediction error. Here we see a couple things. First, the distribution is roughly normal, which is a good thing, since statistical linear regression assumes our error is normally distributed, and the prediction error serves as an estimate of that. Second, we see that the mean of the errors is zero, which is a consequence of linear regression, and the reason we look at other metrics when assessing model performance. We can also see that most of our predictions are within ±1 star rating.

    @@ -2334,41 +1811,41 @@

    -Figure 2.4: Distribution of Prediction Errors +Figure 3.3: Distribution of Prediction Errors

    -

    Of more practical concern is that we don’t see extreme values or clustering, which might indicate a failure on the part of the model to pick up certain segments of the data. It can still be a good idea to look at the extremes just in case we can pick up on some aspect of the data that we could potentially incorporate into the model. So looking at our worst prediction in absolute terms, we see the observation has a typical word count, and so our simple model will just predict a fairly typical rating. But the actual rating is 1, which is 2.1 away from our prediction, a very noticeable difference. Further data inspection may be required to figure out why this came about.

    +

    Of more practical concern is that we don’t see extreme values or clustering, which might indicate a failure on the part of the model to pick up certain segments of the data. It can still be a good idea to look at the extremes just in case we can pick up on some aspect of the data that we could potentially incorporate into the model. So looking at our worst prediction in absolute terms, we see the observation has a typical word count, and so our simple model will just predict a fairly typical rating. But the actual rating is 1, which is 2.1 away from our prediction, a very noticeable difference. Further data inspection may be required to figure out why this came about, and this is a process you should always be prepared to do when you’re working with models.

    -Table 2.4: Worst Prediction +Table 3.3: Worst Prediction
    -
    +
    @@ -2815,49 +2292,49 @@

    -

    2.5.4 Prediction Uncertainty

    +
    +

    3.3.4 Prediction Uncertainty

    We can also look at the uncertainty of our predictions, which is a measure of how much we expect our predictions to vary. This is often expressed as an interval range of values that we expect our prediction to fall within, with a certain level of confidence. But! There are actually two types of intervals we can get, one is really about the mean prediction, or expected value we would get from the model at that observation. This is usually called a confidence interval. The other type of interval is based on the model’s ability to predict new data, and is often called a prediction interval. This interval is about the actual prediction we would get from the model for any value, whether it was data we had seen before or not. Because of this, the prediction interval is always wider than the confidence interval, and it’s the one we usually want to use when we’re making predictions about new data.

    Here is how we can obtain these from our model.

    - +
    -
    +
    -
    prediction_CI = predict(
    -    model_lr_rating, 
    -    newdata = df_prediction, 
    -    se.fit = TRUE, 
    -    interval = "confidence"
    -)
    -
    -prediction_PI = predict(
    -    model_lr_rating, 
    -    newdata = df_prediction, 
    -    se.fit = TRUE, 
    -    interval = "prediction"
    -)
    -
    -pred_intervals = bind_rows(
    -    as_tibble(prediction_CI$fit),
    -    as_tibble(prediction_PI$fit),
    -) |> mutate(
    -    interval = c('confidence', 'prediction'),
    -    type = c('mean', 'observation')
    -)
    -
    -pred_intervals
    +
    prediction_CI = predict(
    +    model_lr_rating, 
    +    newdata = df_prediction, 
    +    se.fit = TRUE, 
    +    interval = "confidence"
    +)
    +
    +prediction_PI = predict(
    +    model_lr_rating, 
    +    newdata = df_prediction, 
    +    se.fit = TRUE, 
    +    interval = "prediction"
    +)
    +
    +pred_intervals = bind_rows(
    +    as_tibble(prediction_CI$fit),
    +    as_tibble(prediction_PI$fit),
    +) |> mutate(
    +    interval = c('confidence', 'prediction'),
    +    type = c('mean', 'observation')
    +)
    +
    +pred_intervals
    -
    +
    -
    pred_intervals = (
    -    model_lr_rating
    -    .get_prediction(df_prediction)
    -    .summary_frame(alpha = 0.05)
    -)
    -
    -pd.DataFrame(pred_intervals)
    +
    pred_intervals = (
    +    model_lr_rating
    +    .get_prediction(df_prediction)
    +    .summary_frame(alpha = 0.05)
    +)
    +
    +pd.DataFrame(pred_intervals)
    @@ -2866,30 +2343,30 @@

    -Table 2.5: Prediction Intervals for Specific Observations +Table 3.4: Prediction Intervals for Specific Observations
    -
    +
    @@ -3345,44 +2822,44 @@

    As expected our prediction interval is wider than our confidence interval, and we can see that the prediction interval is quite wide- from a rating of 2.1 to 4.4. This is a consequence of the fact that we have a lot of uncertainty in our predictions for new observations and we can’t expect to get a very precise prediction from our model with only one feature. This is a common issue with many models, and one that having a better model can help remedy7.

    -

    So at this point you have the gist of prediction, prediction error, and uncertainty in a prediction, but there is still more to it! We’ll come back to global assessments of model error very shortly, and even more detail can be found in Chapter 3 where we dive deeper into our models, and Chapter 4, where we see how to estimate the parameters of our model by picking those that will reduce the prediction error the most. For now though, let’s move on to the other main use of models, which is to help us understand the relationships between the features and the target, or explanation.

    +

    So at this point you have the gist of prediction, prediction error, and uncertainty in a prediction, but there is still more to it! We’ll come back to global assessments of model error very shortly, and even more detail can be found in Chapter 4 where we dive deeper into our models, and Chapter 5, where we see how to estimate the parameters of our model by picking those that will reduce the prediction error the most. For now though, let’s move on to the other main use of models, which is to help us understand the relationships between the features and the target, or explanation.

    -
    -

    2.6 How Do We Interpret the Model?

    +
    +

    3.4 How Do We Interpret the Model?

    When it comes to interpreting the results of our model, there are a lot of tools at our disposal, though many of the tools we can ultimately use will depend on the specifics of the model we have employed. In general though, we can group our approach to understanding results at the feature level and the model level. A feature level understanding regards the relationship between a single feature and the target. Beyond that, we also attempt comparisons of feature contributions to prediction, i.e., relative importance. Model level interpretation is focused on assessments of how well the model ‘fits’ the data, or more generally, predictive performance. We’ll start with the feature level, and then move on to the model level.

    -
    -

    2.6.1 Feature level interpretation

    +
    +

    3.4.1 Feature level interpretation

    As mentioned, at the feature level, we are primarily concerned with the relationship between a single feature and the target. More specifically, we are interested in the direction and magnitude of the relationship, but in general, it all boils down to how a feature induces change in the target. For numeric features, we are curious about the change in the target given some amount of change in the feature values. It’s conceptually the same for categorical features, but often we like to express the change in terms of group mean differences or something similar, since the order of categories is not usually meaningful. An important aspect of feature level interpretation is the specific predictions we can get by holding the data at key feature values.

    Let’s start with the basics by looking again at our coefficient table from the model output.

    -Table 2.6: Linear Regression Coefficients +Table 3.5: Linear Regression Coefficients
    -
    +
    @@ -3867,30 +3344,30 @@

    -Table 2.7: Linear Regression Statistical Output +Table 3.6: Linear Regression Statistical Output
    -
    +
    @@ -4357,30 +3834,30 @@

    -Table 2.8: Linear Regression Confidence Intervals +Table 3.7: Linear Regression Confidence Intervals
    -
    +
    @@ -4866,57 +4343,57 @@

    -

    2.6.2 Model level interpretation

    +
    +

    3.4.2 Model level interpretation

    Thus far, we’ve focused on interpretation at the feature level. But knowing the interpretation of a feature doesn’t do you much good if the model itself is poor! In that case, we also need to assess the model as a whole, and as with the feature level, we can go about this in a few ways. Before getting too carried away with asking whether your model is any good or not, you always need to ask yourself relative to what? Many models claim top performance under various circumstances, but which are statistically indistinguishable from many other models. So we need to be careful about how we assess our model, and what we compare it to.

    -

    When we looked at the models previously Figure 2.3, we examined how well the predictions and target line up, and that gives us an initial feel for how well the model fits the data. Most model-level interpretation involves assessing and comparing model fit and variations on this theme. Here we show how easy it is to obtain such a plot.

    +

    When we looked at the models previously Figure 3.2, we examined how well the predictions and target line up, and that gives us an initial feel for how well the model fits the data. Most model-level interpretation involves assessing and comparing model fit and variations on this theme. Here we show how easy it is to obtain such a plot.

    - +
    -
    +
    -
    predictions = predict(model_lr_rating)
    -y = df_reviews$rating
    -
    -ggplot(
    -    data = data.frame(y = y, predictions = predictions), 
    -    aes(x = y, y = predictions)
    -) +
    -  geom_point() +
    -  labs(x = "Predicted", y = "Observed")
    +
    predictions = predict(model_lr_rating)
    +y = df_reviews$rating
    +
    +ggplot(
    +    data = data.frame(y = y, predictions = predictions), 
    +    aes(x = y, y = predictions)
    +) +
    +  geom_point() +
    +  labs(x = "Predicted", y = "Observed")
    -
    +
    -
    import matplotlib.pyplot as plt
    -
    -predictions = model_lr_rating.predict()
    -y = df_reviews.rating
    -
    -plt.scatter(y, predictions)
    +
    import matplotlib.pyplot as plt
    +
    +predictions = model_lr_rating.predict()
    +y = df_reviews.rating
    +
    +plt.scatter(y, predictions)
    -
    -

    2.6.2.1 Model Metrics

    +
    +

    3.4.2.1 Model Metrics

    We can also get an overall assessment of the prediction error from a single metric. In the case of the linear model we’ve been looking at, we can express this in a single metric as the sum or mean of our squared errors, the latter of which is a very commonly used modeling metric- MSE or mean squared error, or also, its square root - RMSE or root mean squared error12. We’ll talk more about this and similar metrics in other chapters, but we can take a look at the RMSE for our model now.

    If we look back at our results, we can see this expressed as the part of the output or as an attribute of the model13. The RMSE is more interpretable, as it gives us a sense that our typical errors bounce around by about 0.59. Given that the rating is on a 1-5 scale, this maybe isn’t bad, but we could definitely hope to do better than get within roughly half a point on this scale. We’ll talk about ways to improve this later.

    - +
    -
    +
    -
    # summary(model_lr_rating) # 'Residual standard error' is approx RMSE
    -summary(model_lr_rating)$sigma   # We can extract it directly
    +
    # summary(model_lr_rating) # 'Residual standard error' is approx RMSE
    +summary(model_lr_rating)$sigma   # We can extract it directly
    [1] 0.5907
    -
    +
    -
    np.sqrt(model_lr_rating.scale)   # RMSE
    +
    np.sqrt(model_lr_rating.scale)   # RMSE
    0.590728780660127
    @@ -4924,32 +4401,32 @@

    Figure 2.4). With either metric, the closer to zero the better, since as we get closer, we are reducing error.

    -

    We can also look at the R-squared (R2) value of the model. R2 is possibly the most popular measure of model performance with linear regression and linear models in general. Before squaring, it’s just the correlation of the values that we saw in the previous plot (Figure 2.3). When we square it, we can interpret it as a measure of how much of the variance in the target is explained by the model. In this case, our model shows the R2 is 0.12, which is not bad for a single feature model in this type of setting. We interpret the value as 12% of the target variance is explained by our model, and more specifically by the features in the model. In addition, we can also interpret R2 as 1 - the proportion of error variance in the target, which we can calculate as \(1 - \frac{\textrm{MSE}}{var(y)}\). In other words the complement of R2 is the proportion of the variance in the target that is not explained by the model. Either way, since 88% is not explained by the model, our result suggests there is plenty of work left to do!

    +

    Another metric we can use to assess model fit in this particular situation is the mean absolute error (MAE). MAE is similar to the mean squared error, but instead of squaring the errors, we just take the absolute value. Conceptually it attempts to get at the same idea, how much our predictions miss the target on average, and here the value is 0.46, which we actually showed in our residual plot (Figure 3.3). With either metric, the closer to zero the better, since as we get closer, we are reducing error.

    +

    We can also look at the R-squared (R2) value of the model. R2 is possibly the most popular measure of model performance with linear regression and linear models in general. Before squaring, it’s just the correlation of the values that we saw in the previous plot (Figure 3.2). When we square it, we can interpret it as a measure of how much of the variance in the target is explained by the model. In this case, our model shows the R2 is 0.12, which is not bad for a single feature model in this type of setting. We interpret the value as 12% of the target variance is explained by our model, and more specifically by the features in the model. In addition, we can also interpret R2 as 1 - the proportion of error variance in the target, which we can calculate as \(1 - \frac{\textrm{MSE}}{var(y)}\). In other words the complement of R2 is the proportion of the variance in the target that is not explained by the model. Either way, since 88% is not explained by the model, our result suggests there is plenty of work left to do!

    Note also, that with R2 we get a sense of the variance shared between all features in the model and the target, however complex the model gets. As long as we use it descriptively as a simple correspondence assessment of our predictions and target, it’s a fine metric. For various reasons, it’s not a great metric for comparing models to each other, but again, as long as you don’t get carried away, it’s okay to use.

    -
    -

    2.6.3 Prediction vs. explanation

    +
    +

    3.4.3 Prediction vs. explanation

    In your humble authors’ views, one can’t stress enough the importance of a model’s ability to predict the target. It can be a poor model, maybe because the data is not great, or perhaps we’re exploring a new area of research, but we’ll always be interested in how well a model fits the observed data, and in most situations, we’re just as much or even more interested in how well it predicts new data.

    Even to this day, statistical significance is focused on a great deal, and much is made about models that have no little predictive power at all. As strange as it may sound, you can read results in journal articles, news features, and business reports in many fields with hardly any mention of a model’s predictive capability. The focus is almost entirely on the explanation of the model, and usually the statistical significance of the features. In those settings, statistical significance is often used as a proxy for importance, which is rarely ever justified. As we’ve noted elsewhere, statistical significance is affected by other things besides the size of the coefficient, and without an understanding of the context of the features, in this case, like how long typical reviews are, what their range is, what variability of ratings is, etc., the information it provides is extremely limited, and many would argue, not very useful. If we are very interested in the coefficient or weight value specifically, it is better to focus on the range of possible values, which is provided by the confidence interval, along with the predictions that come about based on that coefficient’s value. While a confidence interval is also a loaded description of a feature’s relationship to the target, we can use it in a very practical way as a range of possible values for that weight, and more importantly, think of possibilities rather than certainties.

    Suffice it to say at this point that how much one focuses on prediction vs. explanation depends on the context and goals of the data endeavor. There are cases where predictive capability is of utmost importance, and we care less about explanatory details, but not to the point of ignoring it. For example, even with deep learning models for image classification, where the inputs are just RGB values, we’d still like to know what the (notably complex) model is picking up on, otherwise we may be classifying images based on something like image backgrounds (e.g. outdoors vs. indoors) instead of the objects of actual interest (dogs vs. cats). In some business or other organizational settings, we are very, or even mostly, interested in the coefficients/weights, which might indicate how to allocate monetary resources in some fashion. But if those weights come from a model with no predictive power, placing much importance on them may be a fruitless endeavor.

    In the end we’ll need to balance our efforts to suit the task at hand. Prediction and explanation are both fundamental to the modeling endeavor.

    -
    -

    2.7 Adding Complexity

    +
    +

    3.5 Adding Complexity

    We’ve seen how to fit a model with a single feature and interpret the results, and that helps us to get oriented to the general modeling process. However, we’ll always have more than one feature for a model except under some very specific circumstances, such as exploratory data analysis. So let’s see how we can implement a model with more features and that makes more practical sense.

    -
    -

    2.7.1 Multiple features

    +
    +

    3.5.1 Multiple features

    We can add more features to our model very simply. Using the standard functions we’ve already demonstrated, we just add them to the formula (both R and statsmodels) as follows.

    -
    'y ~ feature_1 + feature_2 + feature_3'
    +
    'y ~ feature_1 + feature_2 + feature_3'

    In other cases where we use matrix inputs, additional features will just be the additional input columns, and nothing about the model code actually changes. We might have a lot of features, and even for relatively simple linear models this could be dozens in some scenarios. A compact depiction of our model uses matrix representation, which we’ll show in the callout below, but you can find more detail in the matrix overview Appendix B. For our purposes, all you really need to know is that this:

    \[ y = X\beta\qquad + \epsilon \textrm{or}\qquad y = \alpha + X\beta + \epsilon -\tag{2.4}\]

    +\tag{3.6}\]

    is the same as this:

    \[ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \dots + \epsilon @@ -4976,7 +4453,7 @@

    \[ \textbf{X} = \begin{bmatrix} @@ -4985,7 +4462,7 @@

    \[ \bf{\beta} = \begin{bmatrix} @@ -4994,35 +4471,35 @@

    \[ \bf{y = X\beta + \epsilon} -\tag{2.8}\]

    +\tag{3.10}\]

    You will also see it depicted in a transposed fashion, such that \(y = \beta^\intercal X\), or \(f(x) = w^\intercal X + b\), with the latter formula is typically seen when the context is machine learning. This is just a matter of preference, except that it may assume the data is formatted in a different way, or possibly they are talking about matrix/vector operations for a single observation. You’ll want to pay close attention to what the dimensions are15.

    For the models considered here and almost all ‘tabular data’ scenarios, the data is stored in the fashion we’ve represented in this text, but you should be aware that other data settings will force you to think of multi-dimensional arrays16 instead of 2-d matrices, for example, with image processing. So it’s good to be flexible.

    -

    With that in mind, let’s get to our model! In what follows, we keep the word count, but now we add some aspects of the reviewer, such as age and the number of children in the household, and features related to the movie, like the release year, the length of the movie in minutes, and the total reviews received. We’ll use the same approach as before, and literally just add them as we depicted in our linear model formula (Equation 2.3) .

    +

    With that in mind, let’s get to our model! In what follows, we keep the word count, but now we add some aspects of the reviewer, such as age and the number of children in the household, and features related to the movie, like the release year, the length of the movie in minutes, and the total reviews received. We’ll use the same approach as before, and literally just add them as we depicted in our linear model formula (Equation 3.3) .

    - +
    -
    +
    -
    model_lr_rating_extra = lm(
    -    rating ~
    -        word_count
    -        + age
    -        + review_year
    -        + release_year
    -        + length_minutes
    -        + children_in_home
    -        + total_reviews,
    -    data = df_reviews
    -)
    -
    -summary(model_lr_rating_extra)
    +
    model_lr_rating_extra = lm(
    +    rating ~
    +        word_count
    +        + age
    +        + review_year
    +        + release_year
    +        + length_minutes
    +        + children_in_home
    +        + total_reviews,
    +    data = df_reviews
    +)
    +
    +summary(model_lr_rating_extra)
    
     Call:
    @@ -5052,20 +4529,20 @@ 

    +
    -
    model_lr_rating_extra = smf.ols(
    -    formula = 'rating ~ word_count \
    -        + age \
    -        + review_year \
    -        + release_year \
    -        + length_minutes \
    -        + children_in_home \
    -        + total_reviews',
    -    data = df_reviews
    -).fit()
    -
    -model_lr_rating_extra.summary(slim = True)
    +
    model_lr_rating_extra = smf.ols(
    +    formula = 'rating ~ word_count \
    +        + age \
    +        + review_year \
    +        + release_year \
    +        + length_minutes \
    +        + children_in_home \
    +        + total_reviews',
    +    data = df_reviews
    +).fit()
    +
    +model_lr_rating_extra.summary(slim = True)

    OLS Regression Results
    @@ -5311,48 +4788,48 @@

    17. Conceptually this means that the effect of word count is the effect of word count after we’ve accounted for the other features in the model. In this case, an increase of a single word results in a -0.03 drop, even after adjusting for the effect of other features. Looking at another feature, the addition of a child to the home is associated with 0.1 increase in rating, accounting for the other features.

    -

    Thinking about prediction, how would we get a prediction for a movie rating with a review that is 12 words long, written in 2020, by a 30 year old with one child, for a movie that is 100 minutes long, released in 2015, with 10000 total reviews? Exactly the same as we did before (Section 2.5.2)! We just create a data frame with the values we want, and predict accordingly.

    +

    Thinking about prediction, how would we get a prediction for a movie rating with a review that is 12 words long, written in 2020, by a 30 year old with one child, for a movie that is 100 minutes long, released in 2015, with 10000 total reviews? Exactly the same as we did before (Section 3.3.2)! We just create a data frame with the values we want, and predict accordingly.

    - +
    -
    +
    -
    predict_observation = tibble(
    -    word_count = 12,
    -    age = 30,
    -    children_in_home = 1,
    -    review_year = 2020,
    -    release_year = 2015,
    -    length_minutes = 100,
    -    total_reviews = 10000
    -)
    -
    -predict(
    -    model_lr_rating_extra,
    -    newdata = predict_observation
    -)
    +
    predict_observation = tibble(
    +    word_count = 12,
    +    age = 30,
    +    children_in_home = 1,
    +    review_year = 2020,
    +    release_year = 2015,
    +    length_minutes = 100,
    +    total_reviews = 10000
    +)
    +
    +predict(
    +    model_lr_rating_extra,
    +    newdata = predict_observation
    +)
       1 
     3.26 
    -
    +
    -
    predict_observation = pd.DataFrame(
    -    {
    -        'word_count': 12,
    -        'age': 30,
    -        'children_in_home': 1,
    -        'review_year': 2020,
    -        'release_year': 2015,
    -        'length_minutes': 100,
    -        'total_reviews': 10000
    -    },
    -    index = ['new_observation']
    -)
    -
    -model_lr_rating_extra.predict(predict_observation)
    +
    predict_observation = pd.DataFrame(
    +    {
    +        'word_count': 12,
    +        'age': 30,
    +        'children_in_home': 1,
    +        'review_year': 2020,
    +        'release_year': 2015,
    +        'length_minutes': 100,
    +        'total_reviews': 10000
    +    },
    +    index = ['new_observation']
    +)
    +
    +model_lr_rating_extra.predict(predict_observation)
    new_observation   3.260
     dtype: float64
    @@ -5363,8 +4840,8 @@

    -

    2.7.2 Categorical features

    +
    +

    3.5.2 Categorical features

    @@ -5373,30 +4850,30 @@

    -Table 2.9: One-hot encoding of the season feature +Table 3.8: One-hot encoding of the season feature
    -
    +
    @@ -5904,16 +5381,16 @@

    - +
    -
    +
    -
    model_lr_cat = lm(
    -    rating ~ word_count + season,
    -    data = df_reviews
    -)
    -
    -summary(model_lr_cat)
    +
    model_lr_cat = lm(
    +    rating ~ word_count + season,
    +    data = df_reviews
    +)
    +
    +summary(model_lr_cat)
    
     Call:
    @@ -5939,14 +5416,14 @@ 

    +
    -
    model_lr_cat = smf.ols(
    -    formula = "rating ~ word_count + season",
    -    data = df_reviews
    -).fit()
    -
    -model_lr_cat.summary(slim = True)
    +
    model_lr_cat = smf.ols(
    +    formula = "rating ~ word_count + season",
    +    data = df_reviews
    +).fit()
    +
    +model_lr_cat.summary(slim = True)

    OLS Regression Results
    @@ -6138,22 +5615,22 @@

    -

    2.7.2.1 Summarizing categorical features

    +
    +

    3.5.2.1 Summarizing categorical features

    When we have a lot of categories, it’s not practical to look at the coefficients for each one, and even when there aren’t that many, we often prefer to get a sense of the total effect of the feature. For standard linear models, we can break down the target variance explained by the model into the variance explained by each feature, and this is called the ANOVA, or analysis of variance. It is not without its issues18, but it’s a good way to get a sense of whether a categorical (or other) feature as a whole is statistically significant.

    - +
    -
    +
    -
    anova(model_lr_cat)
    +
    anova(model_lr_cat)
    -
    +
    -
    import statsmodels.api as sm
    -
    -sm.stats.anova_lm(model_lr_cat)
    +
    import statsmodels.api as sm
    +
    +sm.stats.anova_lm(model_lr_cat)
    @@ -6162,30 +5639,30 @@

    -Table 2.10: ANOVA Table for Categorical Feature +Table 3.9: ANOVA Table for Categorical Feature
    -
    +
    @@ -6651,9 +6128,9 @@

    19.

    -
    -

    2.7.2.2 Group predictions

    -

    A better approach to understanding categorical features for standard linear models is through what are called marginal effects, which can provide a kind of average prediction for each category while accounting for the other features in the model. Better still is to visualize these. It’s actually tricky to define ‘average’ when there are multiple features and interactions involved, so be careful, but we’d interpret the result similarly in those cases as best we can. In this case, we expect higher ratings for summer releases. We’ll return more to this concept in Section 3.3.3.

    +
    +

    3.5.2.2 Group predictions

    +

    A better approach to understanding categorical features for standard linear models is through what are called marginal effects, which can provide a kind of average prediction for each category while accounting for the other features in the model. Better still is to visualize these. It’s actually tricky to define ‘average’ when there are multiple features and interactions involved, so be careful, but we’d interpret the result similarly in those cases as best we can. In this case, we expect higher ratings for summer releases. We’ll return more to this concept in Section 4.3.3.

    @@ -6662,7 +6139,7 @@

    -Figure 2.5: Marginal Effects of Season on Rating +Figure 3.4: Marginal Effects of Season on Rating
    @@ -6670,13 +6147,13 @@

    -
    -

    2.7.3 Other model complexities

    +
    +

    3.5.3 Other model complexities

    There are a lot more fun things we can do while still employing a linear model. We can add interactions between features, account for non-linear relationships, and enhance the linear model we’ve seen to improve predictions. We’ll talk more about these types of techniques throughout the rest of Parts I and II.

    -
    -

    2.8 Assumptions and More

    +
    +

    3.6 Assumptions and More

    Every model you use has underlying assumptions which, if not met, could potentially result in incorrect inferences about the effects, performance, or predictive capabilities of the model. The standard linear regression model we’ve shown is no different, and it has a number of assumptions that must be met for it to be statistically valid. Briefly they are:

      @@ -6686,7 +6163,7 @@

      The features are not correlated with the error (prediction errors, unobserved causes)
    • Your data observations are independent of each other
    • The prediction errors are homoscedastic (don’t have large errors with certain predictions vs low with others)
    • -
    • Normality of the errors (i.e. your prediction errors). Another way to put it is that your target variable is normally distributed conditional on the features.
    • +
    • Normality of the errors (i.e. your prediction errors). Another way to put it is that your target variable is normally distributed conditional on the features.

    Things a linear regression model does not assume:

      @@ -6694,6 +6171,10 @@

    • For example, using categorical features is fine
    +
  • That the target is normally distributed +
      +
    • The assumed distribution is conditional on the features
    • +
  • That the relationship between the features and target is linear
    • Interactions, polynomial terms, etc. are all fine
    • @@ -6717,8 +6198,8 @@

      So basically, whether or not you meet the assumptions of your model doesn’t actually say much about whether the model is great or terrible. For the linear regression model, if you do meet those assumptions, your coefficient estimates are unbiased20, and in general, your statistical inferences are valid ones. If you don’t meet the assumptions, there are alternative versions of the linear model you could use that would potentially address the issues. For example, data that runs over a sequence of time (time series data) violates the independence assumption, since observations closer in time are more likely to be similar than those farther apart. Violation of this assumption will result in

      But we would use a time series or similar model instead to account for this. If normality is difficult to meet, you could assume a different data generating distribution. We’ll discuss some of these approaches explicitly in later chapters, but it’s also important to note that not meeting the assumptions for the baseline model may only mean you’ll prefer a different type of linear or other model to use in order to meet them.

      -
      -

      2.8.1 Assumptions with more complex models

      +
      +

      3.6.1 Assumptions with more complex models

      Let’s say you’re running some XGBoost or a Deep Linear Model and getting outstanding predictions. ‘Assumptions smumptions’ you say! And you might even be right! But if you want to talk confidently about feature contributions, or know something about the uncertainty in the predictions (which you’re assessing right?), well, maybe you might want to know if you’re meeting your assumptions. Some of them are:

      • You have enough data to make the model generalizable
      • @@ -6735,24 +6216,24 @@

        -

        2.9 Classification

        +
        +

        3.7 Classification

        Up to this point we’ve been using a continuous, numeric target. But what about a categorical target? For example, what if we just had a binary target of whether a movie was good or bad? We will dive much more into classification models in our upcoming chapters, but it turns out that we can still formulate it as a linear model problem. The main difference is that we use a transformation of our linear combination of features, using what is sometimes called a link function, and we’ll need to use a different objective function rather than least squares, such as the binomial likelihood, to deal with the binary target. This also means we’ll move away from R2 as a measure of model fit, and look at something else, like accuracy.

        -

        Graphically we can see it in the following way, which when compared with our linear model (Figure 2.2), doesn’t look much different. In what follows, we create our linear combination of features and put it through the sigmoid function, which is a common link function for binary targets21. The result is a probability, which we can then use to classify the observation as good or bad based on a chosen threshold. For example, we might say that any instance associated with a probability greater than or equal to 0.5 is classified as ‘good’, and less than that is classified as ‘bad’.

        +

        Graphically we can see it in the following way, which when compared with our linear model (Figure 3.1), doesn’t look much different. In what follows, we create our linear combination of features and put it through the sigmoid function, which is a common link function for binary targets21. The result is a probability, which we can then use to classify the observation as good or bad based on a chosen threshold. For example, we might say that any instance associated with a probability greater than or equal to 0.5 is classified as ‘good’, and less than that is classified as ‘bad’.

        -Figure 2.6: A Linear Model with Transformation Can Be a Logistic Regression +Figure 3.5: A Linear Model with Transformation Can Be a Logistic Regression
        -

        As soon as we move away from the standard linear model and use transformations of our linear predictor, simple coefficient interpretation becomes difficult, sometimes exceedingly so. We will explore more of these types of models and how to interpret them in later chapters (e.g. Chapter 5).

        +

        As soon as we move away from the standard linear model and use transformations of our linear predictor, simple coefficient interpretation becomes difficult, sometimes exceedingly so. We will explore more of these types of models and how to interpret them in later chapters (e.g. Chapter 6).

        -
        -

        2.10 More Linear Models

        +
        +

        3.8 More Linear Models

        Before we leave our humble linear model, let’s look at some others. Here is a brief overview of some of the more common ‘linear’ models you might encounter.

        Generalized Linear Models and related

        @@ -6785,25 +6266,25 @@

        All of these are explicitly linear models or can be framed as such, and most require only a tweak or two from what you’ve already seen - e.g. a different distribution, a different link function, penalizing the coefficients, etc. In other cases, we can bounce from one to another and even get similar results. For example we can reshape our multivariate outcome to be amenable to a mixed model approach and get the exact same results. We can potentially add a random effect to any model, and that random effect can be based on time, spatial or other considerations. The important thing to know is that the linear model is a very flexible tool that expands easily, and allows you to model most of the types of outcomes we are interested in. As such, it’s a very powerful approach to modeling.

        -
        -

        2.11 Wrapping Up

        +
        +

        3.9 Wrapping Up

        Linear models such as the linear regression demonstrated in this chapter are a very popular tool for data analysis, and for good reason. They are relatively easy to implement and they are very flexible. They can be used for prediction, explanation, and inference, and they can be used for a wide variety of data types. There are also many tools at our disposal to help us use and explore them. But they are not without their limitations, and you’ll want to have more in your toolbox than just the approach we’ve seen so far.

        -
        -

        2.11.1 The common thread

        +
        +

        3.9.1 The common thread

        In most of the chapters we want to highlight the connections between models you’ll encounter. Linear models are the starting point for modeling, and they can be used for a wide variety of data types and tasks. The linear regression with a single feature is identical to a simple correlation if the feature is numeric, a t-test if it is binary, and an ANOVA if it is categorical. We explored a more complex model with multiple features, and saw how to interpret the coefficients and make predictions. The creation of a combination of features to predict a target is the basis of all models, and as such the linear model we’ve just seen is the real starting point on your data science journey.

        -
        -

        2.11.2 Choose your own adventure

        +
        +

        3.9.2 Choose your own adventure

        Now that you’ve got the basics, where do you want to go?

        -
        -

        2.11.3 Additional resources

        +
        +

        3.9.3 Additional resources

        If you are interested in a deeper dive into the theory and assumptions behind linear models, you can check out more traditional statistical/econometric treatments such as:

        • Gelman, Hill, and Vehtari (2020)
        • @@ -6823,8 +6304,8 @@

          But there are many, many books on statistical analysis, linear models, and linear regression specifically. Texts tend to get more mathy and theoretical as you go back in time, to the mostly applied and code-based treatments today. You will likely need to do a bit of exploration to find one you like best. We also recommend you check out the many statistics and modeling based courses like those on Coursera, EdX, and similar, and the many tutorials and blog posts on the internet. Great demonstrations of specific topics can be found on youtube, blog posts and other places. Just start searching and you’ll find a lot of great resources!

        -
        -

        2.12 Exercise

        +
        +

        3.10 Exercise

        Import some data. Stick with the movie reviews data if you want and just try out other features, or maybe try the world happiness data 2018 data. You can find details about it in the appendix Section A.2, can download it here.

        • Fit a linear model, maybe keep it to three features or less
        • @@ -6895,7 +6376,7 @@

          A lot of statisticians and causal modeling folks get very hung up on the terminology here, but we’ll leave that to them as we’d like to get on with things. For our purposes, we’ll just say that we’re interested in the effect of a feature after we’ve accounted for the other features in the model.↩︎

        • There are many types of ANOVA, and different ways to calculate the variance values. One may notice the Python ANOVA result is different, even though the season coefficients and initial model is identical. R defaults with what is called Type II sums of squares, while the Python default uses Type I sums of squares. We won’t bore you with the details of their differences, and the astute modeler will not come to different conclusions because of this sort of thing, and you now have enough detail to look it up.↩︎

        • For those interested, for those features with one degree of freedom, all else being equal the F statistic here would just be the square of the t-statistic for the coefficients, and the p-value would be the same.↩︎

        • -
        • This means they are correct on average, not the true value. And if they were biased, this refers to statistical bias, and has nothing to do with the moral or ethical implications of the data, or whether the features themselves are biased in measurement. Culturally biased data is a different problem than statistical/prediction bias or measurement error, though they are not mutually exclusive. Statistical bias can more readily be tested, while other types of bias are more difficult to assess. Even statistical unbiasedness is not necessarily a goal, as we will see later Section 4.8.↩︎

        • +
        • This means they are correct on average, not the true value. And if they were biased, this refers to statistical bias, and has nothing to do with the moral or ethical implications of the data, or whether the features themselves are biased in measurement. Culturally biased data is a different problem than statistical/prediction bias or measurement error, though they are not mutually exclusive. Statistical bias can more readily be tested, while other types of bias are more difficult to assess. Even statistical unbiasedness is not necessarily a goal, as we will see later Section 5.8.↩︎

        • The sigmoid function in this case is the inverse logistic function, and the resulting statistical model is called logistic regression. In other contexts the model would not be a logistic regression, but this is still a very commonly used activation function. But many others could potentially be used e.g. using a normal instead of logistic distribution, resulting in the so-called probit model.↩︎

        @@ -7323,13 +6804,13 @@

        diff --git a/docs/linear_models_files/figure-html/cat-feature-viz-r-1.png b/docs/linear_models_files/figure-html/cat-feature-viz-r-1.png deleted file mode 100644 index 5095a40..0000000 Binary files a/docs/linear_models_files/figure-html/cat-feature-viz-r-1.png and /dev/null differ diff --git a/docs/linear_models_files/figure-html/fig-corr-plot-1.png b/docs/linear_models_files/figure-html/fig-corr-plot-1.png deleted file mode 100644 index 97cd80c..0000000 Binary files a/docs/linear_models_files/figure-html/fig-corr-plot-1.png and /dev/null differ diff --git a/docs/linear_models_files/figure-html/fig-my-first-model-predictions-plot-1.png b/docs/linear_models_files/figure-html/fig-my-first-model-predictions-plot-1.png index 472006f..771f461 100644 Binary files a/docs/linear_models_files/figure-html/fig-my-first-model-predictions-plot-1.png and b/docs/linear_models_files/figure-html/fig-my-first-model-predictions-plot-1.png differ diff --git a/docs/linear_models_files/figure-html/fig-pp-scatter-1.png b/docs/linear_models_files/figure-html/fig-pp-scatter-1.png deleted file mode 100644 index e9d7900..0000000 Binary files a/docs/linear_models_files/figure-html/fig-pp-scatter-1.png and /dev/null differ diff --git a/docs/machine_learning.html b/docs/machine_learning.html index f4c1c8d..2513a92 100644 --- a/docs/machine_learning.html +++ b/docs/machine_learning.html @@ -7,7 +7,7 @@ -7  Core Concepts in Machine Learning – [Models Demystified]{.smallcaps} +8  Core Concepts in Machine Learning – [Models Demystified]{.smallcaps} @@ -998,9 +1015,9 @@

        For specific types of tasks and models you might use something else, but the above will suffice to get you started with many common settings. Even when dealing with different types of targets, such as counts, proportions, etc., one can use an appropriate likelihood objective, which allows you to cover a bit more ground.

        -
        -

        7.3 Performance Metrics

        -

        When discussing how to understand our model (Section 3.2), we noted there are many performance metrics used in machine learning. Care should be taken to choose the appropriate one for your situation. Usually we have a standard set we might use for the type of predictive problem. For example, for numeric targets, we typically are interested in (R)MSE and MAE. For classification problems, many metrics are based on the confusion matrix, which is a table of the predicted classes versus the observed classes. From that we can calculate things like accuracy, precision, recall, AUROC, etc. (refer to Table 3.1).

        +
        +

        8.3 Performance Metrics

        +

        When discussing how to understand our model (Section 4.2), we noted there are many performance metrics used in machine learning. Care should be taken to choose the appropriate one for your situation. Usually we have a standard set we might use for the type of predictive problem. For example, for numeric targets, we typically are interested in (R)MSE and MAE. For classification problems, many metrics are based on the confusion matrix, which is a table of the predicted classes versus the observed classes. From that we can calculate things like accuracy, precision, recall, AUROC, etc. (refer to Table 4.1).

        As an example, and as a reason to get our first taste of machine learning, let’s get some metrics for a movie review model. Depending on the tool used, getting one type of metric should be as straightforward as most others if we’re using common metrics. As we start our journey into machine learning, we’ll show Python code first, as it’s the dominant tool. Here we’ll model the target in both numeric and binary form with corresponding metrics.

        @@ -1114,30 +1131,30 @@

        -Table 7.2: Example Metrics for Linear and Logistic Regression Models +Table 8.2: Example Metrics for Linear and Logistic Regression Models
        -
        +
        @@ -1602,8 +1619,8 @@

        -
        -

        7.4 Generalization

        +
        +

        8.4 Generalization

        Getting metrics is easy enough, but how will we use them? One of the key differences separating ML from traditional statistical modeling approaches is the assessment of performance on unseen or future data, a concept commonly referred to as generalization. The basic idea is that we want to build a model that will perform well on new data, and not just the data we used to train the model. This is because ultimately data is ever evolving, and we don’t want to be beholden to a particular set of data we just happened to have at a particular time and context.

        But how do we do this? As a starting point, we can simply split (often called partitioning) our data into two sets, a training set and a test set, often called a holdout set. The test set is typically a smaller subset, say 25% of the original data, but this amount is arbitrary, and will reflect the data situation. We fit or train the model on the training set, and then use the model to make predictions on, or score, the test set. This general approach is also known as the holdout method. Consider a simple linear regression. We can fit the linear regression model on the training set, which provides us coefficients, etc. We can then use that model result to predict on the test set, and then compare the predictions to the observed target values in the test set. Here we demonstrate this with our simple linear model.

        @@ -1694,7 +1711,7 @@

        -Table 7.3: RMSE for Linear Regression Model on Train and Test Sets +Table 8.3: RMSE for Linear Regression Model on Train and Test Sets
        @@ -2168,20 +2185,20 @@

        So there you have it, you just did some machine learning! And now we have a model that we can use to predict with any new data that comes along with ease. But as we’ll soon see, there are limitations to doing things this simply. But conceptually this is an important idea, and one we will continue to return to.

        -
        -

        7.4.1 Using metrics for model evaluation and selection

        -

        As we’ve seen elsewhere, there are many performance metrics to choose from to assess model performance, and the choice of metric depends on the type of problem (Section 3.2). It also turns out that assessing the metric on the data we used to train the model does not give us the best assessment of that metric. This is because the model will do better on the data it was trained on than on new data it wasn’t trained on, and we can generally always improve that metric in training by making the model more complex. However, in many modeling situations, this complexity comes at the expense of generalization. So what we really want to ultimately say about our model will regard performance on the test set with our chosen metric, and not the data we used to train the model. At that point, we can also compare multiple models to one another given their performance on the test set, and select the one that performs best.

        +
        +

        8.4.1 Using metrics for model evaluation and selection

        +

        As we’ve seen elsewhere, there are many performance metrics to choose from to assess model performance, and the choice of metric depends on the type of problem (Section 4.2). It also turns out that assessing the metric on the data we used to train the model does not give us the best assessment of that metric. This is because the model will do better on the data it was trained on than on new data it wasn’t trained on, and we can generally always improve that metric in training by making the model more complex. However, in many modeling situations, this complexity comes at the expense of generalization. So what we really want to ultimately say about our model will regard performance on the test set with our chosen metric, and not the data we used to train the model. At that point, we can also compare multiple models to one another given their performance on the test set, and select the one that performs best.

        In the previous section you can compare our results on the tests vs. training set. Metrics are notably better on the training set on average, and that’s what we see here. But since we should be more interested in how well the model will do on new data, we use the test set to get a sense of that.

        -
        -

        7.4.2 Understanding test error and generalization

        +
        +

        8.4.2 Understanding test error and generalization

        This part gets into the weeds a bit. If you are not so inclined, skip to the summary of this section.

        In the following discussion, you can think of a standard linear model scenario, e.g. with squared-error loss function, and a data set where we split some of the observations in a random fashion into a training set, for initial model fitting, and a test set, which will be kept separate and independent, and used to measure generalization performance. We note training error as the average loss over all the training sets we could create in this process of random splitting. The test error is the average prediction error obtained when a model fitted on the training data is used to make predictions on the test data.

        -
        -

        7.4.2.1 Generalization in the classical regime

        +
        +

        8.4.2.1 Generalization in the classical regime

        So what result should we expect in this scenario? Let’s look at the following visualization inspired by Hastie, Tibshirani, and Friedman (2017).

        @@ -2190,7 +2207,7 @@

        -Figure 7.1: Bias Variance Tradeoff +Figure 8.1: Bias Variance Tradeoff

        @@ -2199,12 +2216,12 @@

        -

        7.4.2.2 Generalization in deep learning

        +
        +

        8.4.2.2 Generalization in deep learning

        It turns out that with lots of data and very complex models, or maybe even in most settings, our ‘classical’ understanding just described doesn’t hold up. In fact, it is possible to get a model that fits the training data perfectly, and yet ultimately still generalizes well to new data!

        -

        This phenomenon is encapsulated in the notion of double descent. The idea is that, with overly complex models such as those employed with deep learning, we can get to the point of interpolating the data exactly. But as we continue to increase the complexity of the model, we actually start to generalize better again, and visually this displays as a ‘double descent’ in terms of test error. We see an initial decrease in test error as the model gets better in general. After a while, it begins to rise as we would expect in the classical regime (Figure 7.1). Eventually it peaks at the point where we have as many parameters as data points. Beyond that however, as we get even more complex with our model, we can possibly see a decrease in test error again. Crazy!

        -

        We can demonstrate this on the classic mtcars dataset3, which has only 32 observations! We repeatedly trained a model to predict miles per gallon on only 10 of those observations, and assess test error on the rest. The model we used is a form of ridge regression, but we implemented splines for the car’s weight, horsepower, and displacement4, i.e. we GAMed it up (Section 6.4). We trained increasingly complex models, and in what follows we visualize the error as a function of model complexity.

        +

        This phenomenon is encapsulated in the notion of double descent. The idea is that, with overly complex models such as those employed with deep learning, we can get to the point of interpolating the data exactly. But as we continue to increase the complexity of the model, we actually start to generalize better again, and visually this displays as a ‘double descent’ in terms of test error. We see an initial decrease in test error as the model gets better in general. After a while, it begins to rise as we would expect in the classical regime (Figure 8.1). Eventually it peaks at the point where we have as many parameters as data points. Beyond that however, as we get even more complex with our model, we can possibly see a decrease in test error again. Crazy!

        +

        We can demonstrate this on the classic mtcars dataset3, which has only 32 observations! We repeatedly trained a model to predict miles per gallon on only 10 of those observations, and assess test error on the rest. The model we used is a form of ridge regression, but we implemented splines for the car’s weight, horsepower, and displacement4, i.e. we GAMed it up (Section 7.4). We trained increasingly complex models, and in what follows we visualize the error as a function of model complexity.

        On the left part of the visualization, we see that the test error dips as we get a better model. Our best test error is noted by the large gray dot. Eventually though, the test error rises as expected, even as training error gets better. Test error eventually hits a peak when the number of parameters equals the number of training observations. But then we keep going, and the test error starts to decrease again! By the end we have essentially perfect training prediction, and our test error is as good as it was with the simpler models. This is the double descent phenomenon with one of the simplest datasets around. Cool!

        @@ -2215,21 +2232,21 @@

        -Figure 7.2: Double Descent on the classic mtcars dataset +Figure 8.2: Double Descent on the classic mtcars dataset
        -
        -

        7.4.2.3 Generalization summary

        +
        +

        8.4.2.3 Generalization summary

        The take home point is this: our primary concern is generalization error. We can reduce this error by increasing model complexity, but this may eventually cause test error to increase. However, with enough data and model complexity, we can get to the point where we can fit the training data perfectly, and yet still generalize well to new data. In many standard or at least smaller data and model settings, you can maybe assume the classical regime holds. But when employing deep learning with massive data and billions of parameters, you can worry less about the model’s complexity. But no matter what, we should use tools to help make our model work better, and we prefer smaller and simpler models that can do as well as more complex ones, even if those ‘smaller’ models are still billions of parameters!

        -
        -

        7.5 Regularization

        +
        +

        8.5 Regularization

        We now are very aware that a key aspect of the machine learning approach is having our model to work well with new data. One way to improve generalization is through the use of regularization, which is a general approach to penalize complexity in a model, and is typically used to prevent overfitting. Overfitting occurs when a model fits the training data very well, but does not generalize well to new data. This usually happens when the model is too complex and starts fitting to random noise in the training data. We can also have the opposite problem, where the model is too simple to capture the patterns in the data, and this is known as underfitting5.

        In the following demonstration, the first plot shows results from a model that is probably too complex for the data setting. The curve is very wiggly as it tries as much of the data as possible, and is an example of overfitting. The second plot shows a straight line fit as we’d get from linear regression. It’s too simple for the underlying feature-target relationship, and is an example of underfitting. The third plot shows a model that is a better fit to the data, and is an example of a model that is complex enough to capture the nonlinear aspect of the data, but not so complex that it capitalizes on a lot of noise.

        @@ -2240,7 +2257,7 @@

        -Figure 7.3: Overfitting and Underfitting +Figure 8.3: Overfitting and Underfitting
        @@ -2251,7 +2268,7 @@

        -Table 7.4: RMSE for each model on new data +Table 8.4: RMSE for each model on new data
        @@ -2745,10 +2762,10 @@

        -

        A fairly simple example of regularization can be seen with a ridge regression model (Section 4.8), where we add a penalty term to the objective function. The penalty is a function of the size of the coefficients, and helps keep the model from getting too complex. It is also known as L2 regularization due to squaring the coefficients. Another type is the L1 penalty, used in the ‘lasso’ model, which is based on the absolute values of the coefficients. Yet another common approach combines the two, called elastic net. There we adjust the balance between the L1 and L2 penalties, and use cross-validation to find the best balance. L1 and/or L2 penalties are applied in many other models such as gradient boosting, neural networks, and others, and are a key aspect of machine learning.

        +

        A fairly simple example of regularization can be seen with a ridge regression model (Section 5.8), where we add a penalty term to the objective function. The penalty is a function of the size of the coefficients, and helps keep the model from getting too complex. It is also known as L2 regularization due to squaring the coefficients. Another type is the L1 penalty, used in the ‘lasso’ model, which is based on the absolute values of the coefficients. Yet another common approach combines the two, called elastic net. There we adjust the balance between the L1 and L2 penalties, and use cross-validation to find the best balance. L1 and/or L2 penalties are applied in many other models such as gradient boosting, neural networks, and others, and are a key aspect of machine learning.

        Regularization is used in many modeling scenarios. Here is a quick rundown of some examples.

          -
        • GAMs use penalized regression for estimation of the coefficients for the basis functions (typically with L2). This keeps the ‘wiggly’ part of the GAM from getting too wiggly, as in the overfit model above (Figure 7.3). This shrinks the feature-target relationship toward a linear one.

        • +
        • GAMs use penalized regression for estimation of the coefficients for the basis functions (typically with L2). This keeps the ‘wiggly’ part of the GAM from getting too wiggly, as in the overfit model above (Figure 8.3). This shrinks the feature-target relationship toward a linear one.

        • Similarly, the variance estimate of a random effect in mixed models, e.g. for the intercept or slope, is inversely related to an L2 penalty on the effect estimates for that group effect. The more penalization applied, the less random effect variance, and the more the random effect is shrunk toward the overall mean7.

          @@ -2761,7 +2778,7 @@

          -Figure 7.4: A neural net with dropout +Figure 8.4: A neural net with dropout

        @@ -2784,8 +2801,8 @@

        -
        -

        7.6 Cross-validation

        +
        +

        8.6 Cross-validation

        So we’ve talked a lot about generalization, so now let’s think about some ways to go about a general process of selecting parameters for a model and assessing performance.

        We previously used a simple approach where we split the data into training and test sets, fitted the model on the training set, and then assessed performance on the test set. This is fine, but the test set error, or any other metric, has uncertainty. It would be slightly different with any training-test split we came up with.

        We’d also like to get better model assessment when searching the parameter space, because there are parameters for which we have no way of guessing the value beforehand, and we’ll need to try out different ones. An example would be the penalty parameter in lasso regression. In this case, we need to figure out the best parameters before assessing a final model’s performance.

        @@ -2797,7 +2814,7 @@

        -Figure 7.5: 3-fold Cross Validation +Figure 8.5: 3-fold Cross Validation
        @@ -2873,8 +2890,8 @@

        -

        7.6.1 Methods of cross-validation

        +
        +

        8.6.1 Methods of cross-validation

        There are different approaches we can take for cross-validation that we may need for different data scenarios. Here are some of the more common ones.

        • Shuffled: Shuffling prior to splitting can help avoid data ordering having undue effects.
        • @@ -2905,7 +2922,7 @@

          -Figure 7.6: A comparison of cross-validation strategies. +Figure 8.6: A comparison of cross-validation strategies.
          @@ -2927,8 +2944,8 @@

        -
        -

        7.7 Tuning

        +
        +

        8.7 Tuning

        One problem with the previous ridge logistic model we just used is that we set the penalty parameter to a fixed value. We can do better by searching over a range of values instead, and picking a ‘best’ value based on which model performs to our liking. This is generally known as hyperparameter tuning, or simply tuning. We can do this with k-fold cross-validation to assess the error for each value of the penalty parameter values. We then select the value of the penalty parameter that gives the lowest average error. This is a form of model selection.

        Another potential point of concern is that we are using the same data to both select the model and assess its performance. This is a form of a more general phenomenon of data leakage, and may result in an overly optimistic assessment of performance. One solution is to do as we’ve discussed before, which is to split the data into three parts: training, validation, and test. We use the training set(s) to fit the model, assess their performance on the validation set(s), and select the best model. Then finally we use the test set to assess the best model’s performance. So the validation approach is used to select the model, and the test set is used to assess that model’s performance. The following visualizations from the scikit-learn documentation illustrates the process.

        @@ -2954,7 +2971,7 @@

        -Figure 7.7: A tuning workflow. +Figure 8.7: A tuning workflow.
        @@ -2974,8 +2991,8 @@

        -
        -

        7.7.1 A tuning example

        +
        +

        8.7.1 A tuning example

        While this may start to sound complicated, it doesn’t have to be, as tools are available to make our generalization journey a lot easier. In the following we demonstrate this with the same ridge logistic regression model. The approach we use is called a grid search, where we explicitly step through potential values of the penalty parameter, fitting a model with the selected value through cross-validation. While we only look at one parameter here, for a given modeling approach we could construct a ‘grid’ of sets of parameter values to search over as well9. For each hyperparameter value, we are interested in the average accuracy score across the folds to assess the best performance. The final model can then be assessed on the test set10

        @@ -3018,7 +3035,7 @@

        -Table 7.5: Results of hyperparameter tuning +Table 8.5: Results of hyperparameter tuning
        @@ -3085,7 +3102,7 @@

        -Table 7.6: Results of hyperparameter tuning +Table 8.6: Results of hyperparameter tuning
        @@ -3101,8 +3118,8 @@

        So there you have it. We searched the parameter space, chose the best set of parameters via k-fold cross validation, and got an assessment of generalization error. Neat!

        -
        -

        7.7.2 Parameter spaces

        +
        +

        8.7.2 Parameter spaces

        In the previous example, we used a grid search to search over a range of values for the penalty parameter. It is a quick and easy way to get started, but generally we want something that can search a better space of parameter values rather than a limited grid. It can also be computationally expensive with many hyperparameters, as we might have with boosting methods. We can do better by using more efficient approaches. For example, we can use a random search, where we randomly sample from the parameter space. This is generally faster than a grid search, and can be just as effective. Other methods are available that better explore the space and do so more efficiently.

        -
        -

        7.8 Pipelines

        +
        +

        8.8 Pipelines

        For production-level work, or just for reproducibility, it is often useful to create a pipeline for your modeling work. A pipeline is a series of steps that are performed in sequence. For example, we might want to perform the following steps:

        • Impute missing values
        • @@ -3224,20 +3241,20 @@

          Development and deployment of pipelines will depend on your specific use case, and can get notably complicated. Think of a case where your model is the culmination of features drawn from dozens of wildly different databases, and the model itself being a complex ensemble of models, each with their own hyperparameters. You can imagine the complexity of the pipeline that would be required to handle all of that, but it is possible. Even then the basic approach is the same, and pipelines are a great way to organize your modeling work.

        -
        -

        7.9 Wrapping Up

        +
        +

        8.9 Wrapping Up

        When machine learning began to take off, it seemed many in the field of statistics sat on their laurels, and often scoffed at these techniques that didn’t bother to test their assumptions11! ML was, after all, mostly just a rehash of statistics right? But the machine learning community, which actually comprised both computer scientists and statisticians, was able to make great strides in predictive performance, and the application of machine learning in myriad domains continues to enable us to push the boundaries of what is possible. Statistical analysis wasn’t going to provide ChatGPT or self-driving cars, but it remains vitally important whenever we need to understand the uncertainty of our predictions, or when we need to make inferences about the data world. Eventually, a more general field of data science became the way people use traditional statistical analysis and machine learning to solve their data challenges. The best data scientists will be able to draw from both, use the best tool for the job, and as importantly, have fun with modeling!

        -
        -

        7.9.1 The common thread

        +
        +

        8.9.1 The common thread

        If using a model like the lasso or ridge regression, machine learning is simply a different focus to modeling compared to what we see in traditional linear modeling contexts. You could still do standard interpretation and statistical inference regarding the coefficient output even. However, in traditional statistical application of linear models, we rarely see cross-validation or hyperparameter tuning. It does occur in some contexts though and definitely should be more common.

        As we will see though, the generality of machine learning’s approach allows us to use a wider variety of models than in standard linear model settings, and incorporates those that are not easily summarized from a statistical standpoint, such as boosting and deep learning models. The key is that any model, from linear regression to deep learning, can be used with the tools of machine learning.

        -
        -

        7.9.2 Choose your own adventure

        +
        +

        8.9.2 Choose your own adventure

        At this point you’re ready to dive in and run some common models used in machine learning for tabular data, so head to Chapter 8!

        -
        -

        7.9.3 Additional resources

        +
        +

        8.9.3 Additional resources

        If looking for a deeper dive into some of these topics, here are some resources to consider:

        • A core ML text is Elements Statistical Learning (Hastie, Tibshirani, and Friedman (2017)) which paved the way for modern ML.
        • @@ -3266,8 +3283,8 @@

        -
        -

        7.10 Exercise

        +
        +

        8.10 Exercise

        We did not run the pipeline above and think that doing so would be a good way for you to put your new skills to the test.

        1. Start by using the non-standardized features from the movie_reviews dataset.
        2. @@ -3817,12 +3834,12 @@

          diff --git a/docs/matrix_operations.html b/docs/matrix_operations.html index ab0c28d..71d7028 100644 --- a/docs/matrix_operations.html +++ b/docs/matrix_operations.html @@ -7,7 +7,7 @@ -Appendix B — Matrix Operations – [Models Demystified]{.smallcaps} +Appendix A — Matrix Operations – [Models Demystified]{.smallcaps} @@ -1071,13 +1088,13 @@

          -

          8.5 Penalized Linear Models

          -

          So let’s get on with some models already! Let’s use the classic linear model as our starting point for ML. We show explicitly how to estimate models like lasso and ridge regression in Section 4.8. Those work well as a baseline, and so should be in your ML modeling toolbox.

          -
          -

          8.5.1 Elastic Net

          -

          Another common linear model approach is elastic net, which we also saw in Chapter 7. It combines two techniques: lasso and ridge regression. We will not show how to estimate elastic net by hand here, but all you have to know is that it combines the two penalties- one for lasso and one for ridge, along with a standard objective function for a numeric or categorical target. The relative proportion of the two penalties is controlled by a mixing parameter, and the optimal value for it is determined by cross-validation. So for example, you might end up with a 75% lasso penalty and 25% ridge penalty. In the end though, we’re just going to do a slightly fancier logistic regression!

          -

          Let’s apply this to the heart disease data. We are only doing simple cross-validation here to get a better performance assessment, but you are more than welcome to tune both the penalty parameter and the mixing ratio as we have demonstrated before (Section 7.7). We’ll revisit hyperparameter tuning towards the end of this chapter.

          +
          +

          9.5 Penalized Linear Models

          +

          So let’s get on with some models already! Let’s use the classic linear model as our starting point for ML. We show explicitly how to estimate models like lasso and ridge regression in Section 5.8. Those work well as a baseline, and so should be in your ML modeling toolbox.

          +
          +

          9.5.1 Elastic Net

          +

          Another common linear model approach is elastic net, which we also saw in Chapter 8. It combines two techniques: lasso and ridge regression. We will not show how to estimate elastic net by hand here, but all you have to know is that it combines the two penalties- one for lasso and one for ridge, along with a standard objective function for a numeric or categorical target. The relative proportion of the two penalties is controlled by a mixing parameter, and the optimal value for it is determined by cross-validation. So for example, you might end up with a 75% lasso penalty and 25% ridge penalty. In the end though, we’re just going to do a slightly fancier logistic regression!

          +

          Let’s apply this to the heart disease data. We are only doing simple cross-validation here to get a better performance assessment, but you are more than welcome to tune both the penalty parameter and the mixing ratio as we have demonstrated before (Section 8.7). We’ll revisit hyperparameter tuning towards the end of this chapter.

          @@ -1145,8 +1162,8 @@

          -

          8.5.2 Strengths & weaknesses

          +
          +

          9.5.2 Strengths & weaknesses

          Strengths

          • Intuitive approach. In the end, it’s still just a standard regression model you’re already familiar with.
          • @@ -1160,13 +1177,13 @@

            -

            8.5.3 Additional thoughts

            +
            +

            9.5.3 Additional thoughts

            Using penalized regression is a very good default method in the tabular data setting, and is something to strongly consider for more interpretative model settings like determining causal effects. These approaches predict better on new data than their standard, non-regularized complements, so they provide a nice balance between interpretability and predictive power. However, in general they are not going to be as strong of a method as others typically used in the machine learning world, and may not even be competitive without a lot of feature engineering. If prediction is all you care about, you’ll likely want to try something else.

          -
          -

          8.6 Tree-based Models

          +
          +

          9.6 Tree-based Models

          Let’s move beyond standard linear models and get into a notably different type of approach. Tree-based methods are a class of models that are very popular in machine learning, and for good reason, they work very well. To get a sense of how they work, consider the following classification example where we want to predict a binary target as ‘Yes’ or ‘No’.

          @@ -1191,8 +1208,8 @@

          For these models, the number of trees and learning rate play off of each other. Having more trees allows for a smaller rate3, which might improve the model but will take longer to train. However, it can lead to overfitting if other steps are not taken.

          The depth of each tree refers to how many levels we allow the model to branch out, and is a crucial parameter. It controls the complexity of each tree, and thus the complexity of the overall model- less depth helps to avoid overfitting, but if the depth is too shallow, you won’t be able to capture the nuances of the data. The minimum number of observations in each leaf is also important for similar reasons.

          It’s also generally a good idea to take a random sample of features for each tree (or possibly even each branch), to also help reduce overfitting, but it’s not obvious what proportion to take. The regularization parameters are typically less important in practice, but help reduce overfitting as in other modeling circumstances. As with hyperparameters in other model settings, you’ll use something like cross-validation to settle on final values.

          -
          -

          8.6.1 Example with LightGBM

          +
          +

          9.6.1 Example with LightGBM

          Here is an example of gradient boosting with the heart disease data. Although boosting methods are available in scikit-learn for Python, in general we recommend using the lightgbm or xgboost packages directly for boosting implementation, which have a sklearn API anyway (as demonstrated). Also, they both provide R and Python implementations of the package, making it easy to not lose your place when switching between languages. We’ll use lightgbm here, but xgboost is also a very good option 4.

          @@ -1273,8 +1290,8 @@

          -

          8.6.2 Strengths & weaknesses

          +
          +

          9.6.2 Strengths & weaknesses

          Random forests and boosting methods, though not new, are still ‘state of the art’ in terms of performance on tabular data like the type we’ve been using for our demos here. As of this writing, you’ll find that it will usually take considerable effort to beat them, though many have tried with many deep learning models.

          Strengths

            @@ -1291,26 +1308,26 @@

            -

            8.7 Deep Learning and Neural Networks

            +
            +

            9.7 Deep Learning and Neural Networks

            -Figure 8.1: A neural network +Figure 9.1: A neural network

            Deep learning has fundamentally transformed the world of data science, and the world itself. It has been used to solve problems in image detection, speech recognition, natural language processing, and more, from assisting with cancer diagnosis to summarizing entire novels. As of now, it is not a panacea for every problem, and is not always the best tool for the job, but it is an approach that should be in your toolbox. Here we’ll provide a brief overview of the key concepts behind neural networks, the underlying approach to deep learning, and then demonstrate how to implement a simple neural network to get things started.

            -
            -

            8.7.1 What is a neural network?

            +
            +

            9.7.1 What is a neural network?

            Neural networks form the basis of deep learning models. They have actually been around a while - computationally and conceptually going back decades67. Like other models, they are computational tools that help us understand how to get outputs from inputs. However, they weren’t quickly adopted due to computing limitations, similar to the slow adoption of Bayesian methods. But now neural networks, or deep learning more generally, have recently become the go-to method for many problems.

            -
            -

            8.7.2 How do they work?

            -

            At its core, a neural network can be seen as a series of matrix multiplications and other operations to produce combinations of features, and ultimately a desired output. We’ve been talking about inputs and outputs since the beginning (Section 2.3.2), but neural networks like to put a lot more in between the inputs and outputs than we’ve seen with other models. However, the core operations are often no different than what we’ve done with a basic linear model, and sometimes even simpler! But the combinations of features they produce can represent many aspects of the data that are not easily captured by simpler models.

            +
            +

            9.7.2 How do they work?

            +

            At its core, a neural network can be seen as a series of matrix multiplications and other operations to produce combinations of features, and ultimately a desired output. We’ve been talking about inputs and outputs since the beginning (Section 2.3), but neural networks like to put a lot more in between the inputs and outputs than we’ve seen with other models. However, the core operations are often no different than what we’ve done with a basic linear model, and sometimes even simpler! But the combinations of features they produce can represent many aspects of the data that are not easily captured by simpler models.

            One notable difference from models we’ve been seeing is that neural networks implement multiple combinations of features, where each combination is referred to as hidden nodes or units8. In a neural network, each feature has a weight, just like in a linear model. These features are multiplied by their weights and then added together. But we actually create multiple such combinations, as depicted in the ‘H’ or ‘hidden’ nodes in the following visualization.

            @@ -1318,11 +1335,11 @@

            -Figure 8.2: The first hidden layer +Figure 9.2: The first hidden layer

          -

          The next phase is where things can get more interesting. We take those hidden units and add in nonlinear transformations before moving deeper into the network. The transformations applied are typically referred to as activation functions9. So, the output of the current (typically linear) part is transformed in a way that allows the model to incorporate nonlinearities. While this might sound new, this is just like how we use link functions in generalized linear models (Section 5.2). Furthermore, these multiple combinations also allow us to incorporate interactions between features.

          +

          The next phase is where things can get more interesting. We take those hidden units and add in nonlinear transformations before moving deeper into the network. The transformations applied are typically referred to as activation functions9. So, the output of the current (typically linear) part is transformed in a way that allows the model to incorporate nonlinearities. While this might sound new, this is just like how we use link functions in generalized linear models (Section 6.2). Furthermore, these multiple combinations also allow us to incorporate interactions between features.

          But we can go even further! We can add more layers, and more nodes in each layer, to create a deep neural network. We can also add components specific to certain types of processing, have some parts only connected to certain other parts and more. The complexity really is only limited by our imagination, and computational power! This is what helps make neural networks so powerful - given enough nodes and layers they can potentially approximate any function. Ultimately though, the feature inputs become an output or multiple outputs that can then be assessed in the similar ways as other models.

          @@ -1330,7 +1347,7 @@

          -Figure 8.3: A more complex neural network +Figure 9.3: A more complex neural network
          @@ -1338,14 +1355,14 @@

          -
          -

          8.7.3 Trying it out

          +
          +

          9.7.3 Trying it out

          For simplicity we’ll use similar tools as before. Our model is a multi-layer perceptron (MLP), which is a model like the one we’ve been depicting. It consists of multiple hidden layers of varying sizes, and we can incorporate activation functions as we see fit.

          Do know this would be considered a bare minimum approach for a neural network, and generally you’d need to do more. To begin with, you’d want to tune the architecture, or structure of hidden layers. For example, you might want to try more layers, as well as ‘wider’ layers, or more nodes per layer. Also, as noted in the data discussion, we’d usually want to use embeddings for categorical features as opposed to the one-hot approach used here (Section 10.2.2)11.

          For our example, we’ll use the data with one-hot encoded features. For our architecture, we’ll use three hidden layers with 200 nodes each. As noted, these and other settings are hyperparameters that you’d normally prefer to tune.

          @@ -1472,8 +1489,8 @@

          -

          8.7.4 Strengths & weaknesses

          +
          +

          9.7.4 Strengths & weaknesses

          Strengths

          • Good prediction generally.
          • @@ -1488,9 +1505,9 @@

            -

            8.8 A Tuned Example

            -

            We noted in the chapter on machine learning concepts that there are often multiple hyperparameters we are concerned with for a given model (Section 7.7). We had hyperparameters for each of the models in this chapter also. For the elastic net model, we might want to tune the penalty parameters and the mixing ratio. For the boosting method, we might want to tune the number of trees, the learning rate, the maximum depth of each tree, the minimum number of observations in each leaf, and the number of features to consider at each tree/split. And for the neural network, we might want to tune the number of hidden layers, the number of nodes in each layer, the learning rate, the batch size, the number of epochs, and the activation function. There is plenty to explore!

            +
            +

            9.8 A Tuned Example

            +

            We noted in the chapter on machine learning concepts that there are often multiple hyperparameters we are concerned with for a given model (Section 8.7). We had hyperparameters for each of the models in this chapter also. For the elastic net model, we might want to tune the penalty parameters and the mixing ratio. For the boosting method, we might want to tune the number of trees, the learning rate, the maximum depth of each tree, the minimum number of observations in each leaf, and the number of features to consider at each tree/split. And for the neural network, we might want to tune the number of hidden layers, the number of nodes in each layer, the learning rate, the batch size, the number of epochs, and the activation function. There is plenty to explore!

            Here is an example using the boosting model. We’ll tune the number of trees, the learning rate, the minimum number of observations in each leaf, and the maximum depth of each tree. We’ll use a randomized search across the parameter space to sample from the set of hyperparameters, rather than searching every possible combination as in a grid search. This is a good approach when you have a lot of hyperparameters to tune, and/or when you have a lot of data.

            @@ -1584,8 +1601,8 @@

            <

            Looks like we’ve done a lot better than guessing. Even if we don’t do better than our previous model, we should feel better that we’ve done our due diligence in trying to find the best set of underlying parameters, rather than just going with defaults or what seems to work best.

            -
            -

            8.9 Comparing Models

            +
            +

            9.9 Comparing Models

            We can tune all the models and compare them head to head. We first split the same data into training and test sets (20% test). Then with training data, we tuned each model over different settings:

            • Elastic net: penalty and mixing ratio
            • @@ -1601,7 +1618,7 @@

              -Figure 8.5: Cross-validation results for tuned models. +Figure 9.5: Cross-validation results for tuned models.
              @@ -1611,7 +1628,7 @@

              -Table 8.2 +Table 9.2
              @@ -1624,30 +1641,30 @@

              -Table 8.3: Metrics for tuned models on holdout data. +Table 9.3: Metrics for tuned models on holdout data.
              -
              +
              @@ -2133,11 +2150,11 @@

            -
            -

            8.10 Interpretation

            +
            +

            9.10 Interpretation

            When it comes to machine learning, many models we use don’t have an easy interpretation, like with coefficients in a linear model. However, that doesn’t mean we can’t still figure out what’s going on. Let’s use the boosting model as an example.

            -
            -

            8.10.1 Feature Importance

            +
            +

            9.10.1 Feature Importance

            The default importance metric for a lightgbm model is the number of splits in which a feature is used across trees, and this will depend a lot on the chosen parameters of the best model. But there are other ways to think about what importance means that will be specific to a model, data setting, and ultimate goal of the modeling process. For this data and the model, depending on the settings, you might see that the most important features are age, cholesterol, and max heart rate.

            @@ -2168,30 +2185,30 @@

            -Table 8.4: Top 4 features from a tuned LGBM model. +Table 9.4: Top 4 features from a tuned LGBM model.
            -
            +
            @@ -2641,7 +2658,7 @@

            Section 3.3.6) to see the effects of cholesterol and being male. From this we can see that males are expected to have a higher probability of heart disease, and that cholesterol has a positive relationship with heart disease, though this occurs mostly after midpoint for cholesterol (shown by vertical line). The plot shown is a prettier version of what you’d get with the following code, but the model predictions are the same.

            +

            Now let’s think about a visual display to aid our understanding. Here we show a partial dependence plot (Section 4.3.6) to see the effects of cholesterol and being male. From this we can see that males are expected to have a higher probability of heart disease, and that cholesterol has a positive relationship with heart disease, though this occurs mostly after midpoint for cholesterol (shown by vertical line). The plot shown is a prettier version of what you’d get with the following code, but the model predictions are the same.

            @@ -2687,15 +2704,15 @@

            -Figure 8.6: Partial dependence plot for cholesterol +Figure 9.6: Partial dependence plot for cholesterol

            -
            -

            8.11 Other ML Models for Tabular Data

            +
            +

            9.11 Other ML Models for Tabular Data

            When you research classical machine learning models for the kind of data we’ve been exploring, you’ll find a variety of methods. Popular approaches from the past include k-nearest neighbors regression, principal components regression, support vector machines (SVM), and more. You don’t see these used in practice as much though for several reasons:

            • Some, like k-nearest neighbors regression, generally don’t predict as well as other models.
            • @@ -2707,8 +2724,8 @@

              Chapter 9). As of this writing, the main research effort for new models for tabular data regards deep learning methods like large language models (LLMs). While typically used for text data, they can be adapted for tabular data as well. They are very powerful, but also computationally expensive. The issue is primarily whether a model can be devised that can consistently beat boosting and other approaches, and while it hasn’t happened yet, there is a good chance it will in the near future. For now, the best approach is to use the best model that works for your data, and to be open to new methods as they come along.

            -
            -

            8.12 Wrapping Up

            +
            +

            9.12 Wrapping Up

            In this chapter we’ve provided a few common and successful models you can implement with much success in machine learning. You don’t really need much beyond these for tabular data unless your unique data condition somehow requires it. But a couple things are worth mentioning before moving on…

            Feature engineering will typically pay off more in performance than the model choice.

            @@ -2720,17 +2737,17 @@

            The best model is simply the one that works best for your situation.

            You’ll always get more payoff by coming up with better features to use in the model, as well as just using better data that’s been ‘fixed’ because you’ve done some good exploratory data analysis. Thinking harder about the problem means you will waste less time going down dead ends, and you typically can find better data to use to solve the problem by thinking more clearly about the question at hand. And finally, it’s good to not be stuck on one model, and be willing to use whatever it takes to get things done efficiently.

            -
            -

            8.12.1 The common thread

            +
            +

            9.12.1 The common thread

            When it comes to machine learning, you can use any model you feel like, and this could be standard statistical models like we’ve covered elsewhere. Both boosting and neural networks, like GAMs and related techniques, can be put under a common heading of basis function models. GAMs with certain types of smooth functions are approximations of gaussian processes, and gaussian processes are equivalent to a neural network with an infinitely wide hidden layer (Neal (1996)). Even the most complicated deep learning model typically has components that involve feature combinations and transformations that we use in far simpler models.

            -
            -

            8.12.2 Choose your own adventure

            +
            +

            9.12.2 Choose your own adventure

            If you haven’t had much exposure to statistical approaches we suggest heading to any chapter of Part I. Otherwise, consider an overview of more machine learning techniques (Chapter 9), data (Chapter 10), or causal modeling (Chapter 11).

            -
            -

            8.12.3 Additional resources

            -

            Additional resources include those mentioned in Section 7.9.3, but here are some more to consider:

            +
            +

            9.12.3 Additional resources

            +

            Additional resources include those mentioned in Section 8.9.3, but here are some more to consider:

            • Google’s Course Decision Forests
            • Interpretable ML (Molnar (2023))
            • @@ -2746,8 +2763,8 @@

              -

              8.13 Exercise

              +
              +

              9.13 Exercise

              Tune a model of your choice to predict whether a movie is good or bad with the movie review data. Use the categorical target, and use one-hot encoded features if needed. Make sure you use a good baseline model for comparison!

              @@ -3227,12 +3244,12 @@

              diff --git a/docs/ml_common_models_files/figure-html/fig-benchmark-1.png b/docs/ml_common_models_files/figure-html/fig-benchmark-1.png deleted file mode 100644 index 30d47c7..0000000 Binary files a/docs/ml_common_models_files/figure-html/fig-benchmark-1.png and /dev/null differ diff --git a/docs/ml_common_models_files/figure-html/pdp-r-plot-1.png b/docs/ml_common_models_files/figure-html/pdp-r-plot-1.png deleted file mode 100644 index e883944..0000000 Binary files a/docs/ml_common_models_files/figure-html/pdp-r-plot-1.png and /dev/null differ diff --git a/docs/ml_more.html b/docs/ml_more.html index b8bb7a2..52fc177 100644 --- a/docs/ml_more.html +++ b/docs/ml_more.html @@ -7,7 +7,7 @@ -9  More Machine Learning – [Models Demystified]{.smallcaps} +10  More Machine Learning – [Models Demystified]{.smallcaps} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
              +
              + +
              + +
              + + +
              + + + +
              + +
              +
              +

              2  Thinking About Models

              +
              + + + +
              + + + + +
              + + + +
              + + +

              +

              Before we get models and how they work, let’s think more about what we mean when talking about them. As we’ll see, there are different ways we can express models and ultimately use them, so let’s start by understanding what a model is and what it can do for us.

              +
              +

              2.1 What is a Model?

              +

              At its core, a model is just an idea. It’s a way of thinking about the world, about how things work, how things change over time, how things are different from each other, and how they are similar. The underlying thread is that a model expresses relationships about things in the world around us. One can also think of a model as a tool, one that allows us to take in information, derive some meaning from it, and act on it in some way. Just like other ideas and tools, models have consequences in the real world, and they can be used wisely or foolishly.

              +
              +
              +

              2.2 What Goes into a Model? What Comes Out?

              +
              +

              2.2.1 Features and targets

              +

              In the context of a model, how we specify the nature of the relationship between various things depends on the context. In the interest of generality, we’ll refer to the target as what we want to explain, and features as those aspects of the data we will use to explain it. Because people come at data from a variety of contexts, they often use different terminology to mean the same thing. The table below shows some of the common terms used to refer to features and targets. Note that they can be mixed and matched, e.g. someone might refer to covariates and a response, or inputs and a label.

              +
              +
              +
              +
              +Table 2.1: Common Terms for Features and Targets +
              +
              +
              +
              + +

  • OLS Regression Results
    + + + + + + + + + + + + + + + + + + + + + + + + +
    FeatureTarget
    independent variabledependent variable
    predictor variableresponse
    explanatory variableoutcome
    covariatelabel
    xy
    inputoutput
    right-hand sideleft-hand side
    +
    +
    +
    + +
    +
    +

    Some of these terms actually suggest a particular type of relationship (e.g., a causal relationship, an experimental setting), but here we’ll typically avoid those terms if we can, since those connotations may not apply to most situations. In the end though, you may find us using any of these words to describe the relationships of interest so that you are comfortable with the terminology, but typically we’ll stick with features and targets for the most part. In our opinion, these terms have the least hidden assumptions/implications, and just implies ‘features of the data’ and the ‘target’ we’re trying to explain or predict.

    +
    +
    +
    +

    2.3 Expressing relationships

    +

    As noted, a model is a way of expressing a relationship between a set of features and a target, and one way of thinking about this is in terms of inputs and outputs. But how can we go from input to output?

    +

    Well, first off, we assume that the features and target are correlated, that there is some relationship between the feature x and target y. The output of a model will correspond to the target if they are correlated, and more closely match it with stronger correlation. If so, then we can ultimately use the features to predict the target. In the simplest setting, a correlation implies a relationship where x and y typically move up and down together (left plot) or they move in opposite directions where x goes up and y goes down (right plot).

    +
    +
    +
    +
    +
    + +
    +
    +Figure 2.1: Correlation +
    +
    +
    +
    +
    + +

    Even with multiple features, or nonlinear feature-target relationships, where things are more difficult to interpret, we can stick to this general notion of correlation, or simply association, to help us understand how the features account for the target’s variability, or why it behaves the way it does.

    + +
    +

    2.3.1 A mathematical expression of an idea

    +

    Models are expressed through a particular language, math, but don’t let that worry you if you’re not so inclined. As a model is still just an idea at its core, the idea is the most important thing to understand about it. The math is just a formal way of expressing the idea in a manner that can be communicated and understood by others in a standard way, and math can help make the idea precise. Here is a generic model formula expressed in math:

    +
    +
    +

    +
    A generic model
    +
    +
    +

    In words, this equation says we are trying to explain something \(y\), as a function \(f()\) of other things \(X\), but there is typically some aspect we don’t explain \(u\). This is the basic form of a model, and it’s essentially the same for linear regression, logistic regression, and even random forests and neural networks.

    +

    But in everyday terms, we’re trying to understand everyday things, like how the amount of sleep relates to cognitive functioning, how the weather affects the number of people who visit a park, how much money to spend on advertising to increase sales, how to detect fraud, and so on. Any of these could form the basis of a model, as they stem from scientifically testable ideas, and they all express relationships between things we are interested in, possibly even with an implication of causal relations.

    +
    +
    +

    2.3.2 Expressing models visually

    +

    Often it is useful to express models visually, as it can help us understand the relationships between things more easily. For example, we can express the relationship between a feature and target similar to the previous Figure 2.1. A more formal way is with a graphical model, and the following is a generic representation of a linear model.

    +
    +
    +
    + +
    +
    +Figure 2.2: A linear model +
    +
    +
    +

    This makes clear there is an output from the model that is created from the inputs (X). The ‘w’ values are weights, which can be different for each input, and the output is the combination of these inputs. As we’ll see later, we’ll want to find a way to create the best correspondence between the outputs of the model and the target, which is the essence of fitting a model.

    +
    +
    +

    2.3.3 Expressing models in code

    +

    Applying models to data can be simple. For example, if you wanted to create a linear model to understand the relationship between sleep and cognitive functioning, you might express it in code as follows.

    +
    + +
    +
    +
    +
    lm(cognitive_functioning ~ sleep, data = df)
    +
    +
    +
    +
    +
    from statsmodels.formula.api import ols
    +
    +model = ols('cognitive_functioning ~ sleep', data = df).fit()
    +
    +
    +
    +
    +

    The first part with the ~ is the model formula, which is how math comes into play to help us express relationships. Beyond that we just specify where, for example, the numeric values for cognitive functioning and the amount of sleep are to be located. In this case, they are found in the same data frame called df, which may have been imported from a spreadsheet somewhere. Very easy isn’t it? But that’s all it takes to express a straightforward idea. More conceptually, we’re saying that cognitive functioning is a linear function of sleep. You can probably already guess why R’s function is lm, and you’ll eventually also learn why statsmodels function is ols, but for now just know that both are doing the same thing.

    +
    +
    +

    2.3.4 Models as implementations

    +

    In practice, models are implemented in a variety of ways, and the code above is just one way to express a model. For example, the linear model can be expressed in a variety of ways depending on the tool used, such as a simple linear regression, a penalized regression, or a mixed model. When we think of models as a specific implementation, we are thinking of something like glm or lmer in R, or LinearRegression or XGBoostClassifier in Python. It is with these functions that we will specify the formula or the inputs and target in some fashion. After words, or in conjunction with this specification, we will fit the model to the data, which is the process of finding the best way to map the feature inputs to the target.

    +
    +
    +
    +

    2.4 Some Clarifications

    +

    You will sometimes see models referred to as a specific statistic, an specific aspect of the model, or a specific algorithm. This is often a source of confusion for those early on in in their data science journey, because the terms don’t really refer to what the model represents. For example, a t-test is a statistical result, not a model in and of itself. Similarly, some refer to ‘logit model’ or ‘probit model’, but these are link functions used in fitting what is in fact the same model. A ‘classifier’ tells you the task of the model, but not what the model is. ‘OLS’ is an estimation technique used for many types of models, etc., not just a name for a linear regression model. Machine learning can potentially be used to fit any model, and not a specific collection of models.

    +

    All this is to say that it’s good to be clear about the model, and to try to keep it distinguished from specific aspects or implementations of it. Sometimes the nomenclature can’t help but get a little fuzzy, and that’s okay. Again though, at the core of a model is the idea that specifies the relationship between the features and target.

    +
    +
    +

    2.5 Getting Ready for More

    +

    The goal of this book is to help you understand models in a practical way that makes clear what we’re trying to understand, but also how models produce those results we’re so interested in. We’ll be using a variety of models to help you understand the relationships between features and targets, and how to use models to make predictions, and how to interpret the results. We’ll also show you how the models are estimated, how to evaluate them, and how to choose the right one for the job. We hope you’ll come away with a better understanding of how models work, and how to use them in your own projects. So let’s get started!

    + + +
    + + + + +
    + + + + + + \ No newline at end of file diff --git a/docs/models_files/figure-html/fig-corr-plot-1.png b/docs/models_files/figure-html/fig-corr-plot-1.png new file mode 100644 index 0000000..ba67785 Binary files /dev/null and b/docs/models_files/figure-html/fig-corr-plot-1.png differ diff --git a/docs/more_models.html b/docs/more_models.html index 803dde0..c368f97 100644 --- a/docs/more_models.html +++ b/docs/more_models.html @@ -7,7 +7,7 @@ -Appendix D — More Models – [Models Demystified]{.smallcaps} +Appendix C — More Models – [Models Demystified]{.smallcaps}