diff --git a/docs/causal.html b/docs/causal.html index d16dda3..2afe79a 100644 --- a/docs/causal.html +++ b/docs/causal.html @@ -411,7 +411,6 @@

12.1 Key Ideas

@@ -431,20 +430,16 @@

12.1.2 Helpful context

This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.

-
-
- +
Figure 12.1: A Causal DAG
-
-
@@ -690,7 +685,7 @@

Meta-learners are used in machine learning contexts to assess potentially causal relationships between some treatment and outcome. The core model can actually be any kind you might want to use, but in which extra steps are taken to assess the causal relationship. The most common types of meta-learners are:

  • S-learner - single model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our previous code demo.
  • -
  • T-learner - two models, one for each of the control and treatment groups; predict the values as if all observations are treated vs when all are control using both models, and take the difference.
  • +
  • T-learner - two models, one for each of the control and treatment groups; predict the values as if all observations are ‘treated’ versus when all are ‘control’ using both models, and take the difference.
  • X-learner - a more complicated modification to the T-learner also using a multi-step approach.

Some additional variants of these models exist, and they can be used in a variety of settings, not just uplift modeling. The key idea is to use the model to predict the potential outcomes of the treatment, and then to take the difference between the two predictions as the causal effect.

@@ -743,7 +738,7 @@

8. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal.

+

If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then see if the effect is large enough to be of interest, and if the result is useful in making decisions8. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal.

As another example, consider the world happiness data we’ve used in previous demonstrations. We want to explain the association of country level characteristics and the population’s happiness. We likely aren’t going to be as interested in predicting next year’s happiness score, but rather what attributes are correlated with a happy populace in general. In this election year (2024) in the U.S., we’d be interested in specific factors related to presidential elections, of which there are relatively very few data points. In these cases, explanation is the focus, and we may not even need a model at all to come to our conclusions.

So we can understand that in some settings we may be more interested in understanding the underlying mechanisms of the data, as with these examples, and in others we may be more interested in predictive performance, as in our demonstrations of machine learning. However, the distinction between prediction and explanation in the end is a bit problematic, not the least of which is that we often want to do both.

Although it’s often implied as such, prediction is not just what we do with new data. It is the very means by which we get any explanation of effects via coefficients, marginal effects, visualizations, and other model results. Additionally, where the focus is on predictive performance, if we can’t explain the results we get, we will typically feel dissatisfied, and may still question how well the model is actually doing.

@@ -754,7 +749,7 @@

12.7 Wrapping Up

@@ -767,7 +762,7 @@

12.7.2 Choose your own adventure

-

From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources below.

+

From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources.

12.7.3 Additional resources

@@ -814,7 +809,7 @@

Your authors have to admit some bias here. We’ve spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and too little generalization, and were grossly overfit. Many SEM programs even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that’s not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions, but it is not a magic bullet, and it doesn’t make anyone look fancier by using it.↩︎

  • This is basically the S-Learner approach to meta-learning, which we’ll discuss in a bit. It is generally too weak↩︎

  • The G-computation approach and S-learners are essentially the same approach, but came about from different domain contexts.↩︎

  • -
  • This is a contrived example, but it is definitely something what you might see in the wild. The relationship is weak, and though statistically significant, the model can’t predict the target well at all. The statistical power is actually decent in this case, roughly 70%, but this is mainly because the sample size is so large and it is a very simple model setting.
    This is a common issue in many academic fields, and it’s why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.↩︎

  • +
  • This is a contrived example, but it is definitely something that you might see in the wild. The relationship is weak, and though statistically significant, the model can’t predict the target well at all. The statistical power is actually decent in this case, roughly 70%, but this is mainly because the sample size is so large and it is a very simple model setting.
    This is a common issue in many academic fields, and it’s why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.↩︎

  • Gentle reminder that making an assumption does not mean the assumption is correct, or even provable.↩︎

  • diff --git a/docs/causal_files/figure-html/fig-causal-dag-1.png b/docs/causal_files/figure-html/fig-causal-dag-1.png index cd55c7f..2c0bb8d 100644 Binary files a/docs/causal_files/figure-html/fig-causal-dag-1.png and b/docs/causal_files/figure-html/fig-causal-dag-1.png differ diff --git a/docs/danger_zone.html b/docs/danger_zone.html index 7a3512d..a0ff5d8 100644 --- a/docs/danger_zone.html +++ b/docs/danger_zone.html @@ -730,7 +730,7 @@

    <

    14.4.4 Big data isn’t always as big as you think

    -

    Consider a model setting with 100,000 samples. Is this large? Let’s say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where outcome label you’re interested in occurs. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you’d be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don’t have enough data to make a reliable estimate of the interaction effect.

    +

    Consider a model setting with 100,000 samples. Is this large? Let’s say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where the outcome label you’re interested in is present. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you’d be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction effect on the target, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don’t have enough data to make a reliable estimate of the interaction effect.

    Oh wait, did you want to use cross-validation also? A simple random sample approach might result in some validation sets with no positive values at all! Don’t forget that you may have already split your 100,000 samples into training and test sets, so you have even less data to start with! The following table shows the final cell count for a dataset with these properties.

    The point is that it’s easy to forget that large data can get small very quickly due to class imbalance, interactions, etc. There is not much you can do about this, but you should not be surprised when these situations are not very revealing in terms of your model results.

    diff --git a/docs/data.html b/docs/data.html index 65f0622..54d0950 100644 --- a/docs/data.html +++ b/docs/data.html @@ -464,7 +464,7 @@

    13.1.2 Helpful context

    -

    We’re talking very generally about data here, so not much background is needed. The models mentioned are covered in other chapters, or build upon those, but we’re not doing any actual modeling here.

    +

    We’re talking very generally about data here, so not much background is needed. The models mentioned here are covered in other chapters, or build upon those, but we’re not doing any actual modeling here.

    @@ -488,26 +488,26 @@

    -
    +
    @@ -1017,7 +1017,7 @@

    -

    Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions2 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle3. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 13.2.3.

    +

    Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions2 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle3. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 13.2.3.

    @@ -2162,7 +2162,7 @@

    13.2.3.3 Rank data

    -

    Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allowed for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.

    +

    Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allow for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.

    @@ -2925,7 +2925,7 @@

    13.10 Data Augmentation

    Data augmentation is a technique where you artificially increase the size of your dataset by creating new data points based on the existing data. This is a common technique in deep learning for computer vision, where you might rotate, flip, or crop images to create new training data. This can help improve the performance of your model, especially when you have a small dataset. Techniques are also available for text.

    In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 13.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.

    -

    Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process11. Downsampling the majority class can potentially throw away usefu information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.

    +

    Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process11. Downsampling the majority class can potentially throw away useful information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.

    diff --git a/docs/dataset_descriptions.html b/docs/dataset_descriptions.html index 429c260..a6d9dd9 100644 --- a/docs/dataset_descriptions.html +++ b/docs/dataset_descriptions.html @@ -365,26 +365,26 @@

    -
    +
    @@ -849,26 +849,26 @@

    -
    +
    @@ -1408,26 +1408,26 @@

    -
    +
    @@ -1956,26 +1956,26 @@

    -
    +
    @@ -2534,26 +2534,26 @@

    -
    +
    @@ -3018,26 +3018,26 @@

    -
    +
    @@ -3553,26 +3553,26 @@

    -
    +
    @@ -4025,26 +4025,26 @@

    -
    +
    diff --git a/docs/estimation.html b/docs/estimation.html index ee83d96..93c040c 100644 --- a/docs/estimation.html +++ b/docs/estimation.html @@ -468,7 +468,7 @@

    6.1 Key Ideas

    -

    A few concepts we’ll keep using here are fundamental to understanding estimation and optimization. We’ll should note that we’re qualifying our present discussion of these topics to typical linear models and similar settings, but they are much more broad and general than presented here. We’ve seen some of this before, but we’ll be getting a bit cozier with the concepts now.

    +

    A few concepts we’ll keep using here are fundamental to understanding estimation and optimization. We should note that we’re qualifying our present discussion of these topics to typical linear models and similar settings, but they are much more broad and general than presented here. We’ve seen some of this before, but we’ll be getting a bit cozier with the concepts now.

    • Parameters are the values associated with a model that we have to estimate.
    • Estimation is the process of finding the parameters associated with a model.
    • @@ -524,26 +524,26 @@

      -
      +
      @@ -1011,7 +1011,7 @@

      -

      For our purposes here, and we’ll drop any rows with missing values, and we’ll use scaled features so that they have the same variance, which as noted in the data chapter, can help make estimation easier.

      +

      For our purposes here, we’ll drop any rows with missing values, and we’ll use scaled features so that they have the same variance, which as noted in the data chapter, can help make estimation easier.

      @@ -1130,26 +1130,26 @@

      -
      +
      @@ -1865,7 +1865,7 @@

      -

      Optimization functions typically return multiple values, including the best parameters found, the value of the objective function at that point, and sometimes other information like the number of iterations it took to reach the returned value and whether or not the process converged. This can be quite a bit of stuff so we don’t show the output above, but we definitely encourage you to inspect it closely. The following table shows the estimated parameters and the objective value for our model, and we can compare it to the standard functions to see how we did.

      +

      Optimization functions typically return multiple values, including the best parameters found, the value of the objective function at that point, and sometimes other information like the number of iterations it took to reach the returned value and whether or not the process converged. This can be quite a bit of stuff so we don’t show the raw output, but we definitely encourage you to inspect it closely. The following table shows the estimated parameters and the objective value for our model, and we can compare it to the standard functions to see how we did.

      @@ -1874,26 +1874,26 @@

      -
      +
      @@ -2442,7 +2442,7 @@

      -

      With a guess for the parameters and an assumption about the data’s distribution, we can calculate the likelihood of each data point. We get a total likelihood for all observations, similar to how we added squared errors previously. But unlike errors, we want more likelihood, not less. In theory we’d multiply each likelihood, but in practice we sum the log of the likelihood, otherwise values would get too small for our computers to handle. We can also turn our problem into a minimization problem by supply the negative log-likelihood, and then minimizing that value, which many optimization algorithms are designed to do7.

      +

      With a guess for the parameters and an assumption about the data’s distribution, we can calculate the likelihood of each data point. We get a total likelihood for all observations, similar to how we added squared errors previously. But unlike errors, we want more likelihood, not less. In theory we’d multiply each likelihood, but in practice we sum the log of the likelihood, otherwise values would get too small for our computers to handle. We can also turn our problem into a minimization problem by calculating the negative log-likelihood, and then minimizing that value, which many optimization algorithms are designed to do7.

      The following is a function we can use to calculate the likelihood of the data given our parameters. The actual likelihood value isn’t easily interpretable, but it reflects the relative likelihood of the data given the parameters, so higher is generally better. Since many default optimization algorithms are designed to minimize, we’ll multiply the likelihood by -1 to turn it into a minimization problem, so lower is better in that case. The value can also be used to compare models with different parameter guesses8. We’ll hold off with our result

      @@ -2521,26 +2521,26 @@

      -
      +
      @@ -3075,7 +3075,7 @@

      - +
      Figure 6.5: Likelihood Surface for Happiness and Life Expectancy (interactive) @@ -4997,7 +4997,7 @@

      6.10.2 Stochastic gradient descent

      Stochastic gradient descent (SGD) is a version of gradient descent that uses a random sample of data to guess the gradient, instead of using all the data. This makes it less accurate in some ways, but it’s faster and can be parallelized. This speed is useful in machine learning when there’s a lot of data, which often makes the discrepancy between standard GD and SGD small. As such you will see variants of it incorporated in many models in deep learning, but it can be with much simpler models as well.

      -

      Let’s see this in action with the happiness model. The following is a conceptual version of the AdaGrad approach10, which is a variation of SGD that adjusts the learning rate for each parameter. We will also add a variation that averages the parameter estimates across iterations, which is a common approach to improve the performance of SGD, but by default it is not used, just something you can play with. We are going to use a ‘batch size’ of one, which is similar to a ‘streaming’ or ‘online’ version where we update the model with each observation. Since our data are alphabetically ordered, we’ll shuffle the data first. We’ll also use a stepsize_tau parameter, which is a way to adjust the learning rate at early iterations. We’ll set it to zero for now, but you can play with it to see how it affects the results. The values for the learning rate and stepsize_tau are arbitrary, selected after some initial playing around, but you can play with them to see how they affect the results.

      +

      Let’s see this in action with the happiness model. The following is a conceptual version of the AdaGrad approach10, which is a variation of SGD that adjusts the learning rate for each parameter. We will also add a variation that averages the parameter estimates across iterations, which is a common approach to improve the performance of SGD, but by default it is not used, just something you can play with. We are going to use a batch size of one, which is similar to a ‘streaming’ or ‘online’ version where we update the model with each observation. Since our data are alphabetically ordered, we’ll shuffle the data first. We’ll also use a stepsize_tau parameter, which is a way to adjust the learning rate at early iterations. We’ll set it to zero for now, but you can play with it to see how it affects the results. The values for the learning rate and stepsize_tau are arbitrary, selected after some initial playing around, but you can play with them to see how they affect the results.

      @@ -6867,7 +6867,7 @@

      Example

      -

      Let’s do a simple example to show how this comes about. We’ll use a binomial model where we have penalty kicks taken for a soccer player, and we want to estimate the probability of the player making a goal, which we’ll call \(\theta\). For our prior distribution, we’ll use use a beta distribution that has a mean of 0.5, suggesting that we think this person would have about a 50% chance of converting the kick on average. For the likelihood, we’ll use a binomial distribution. We also use this in our GLM chapter (Equation 7.1), which, as we noted earlier, is akin to using the log loss (Section 6.9.2).

      +

      Let’s do a simple example to show how this comes about. We’ll use a binomial model where we have penalty kicks taken for a soccer player, and we want to estimate the probability of the player making a goal, which we’ll call \(\theta\). For our prior distribution, we’ll use a beta distribution that has a mean of 0.5, suggesting that we think this person would have about a 50% chance of converting the kick on average. For the likelihood, we’ll use a binomial distribution. We also use this in our GLM chapter (Equation 7.1), which, as we noted earlier, is akin to using the log loss (Section 6.9.2).

      We’ll then calculate the posterior distribution for the probability of making a shot, given our prior and the evidence at hand, i.e., the data.

      Let’s start with some data, and just like our other estimation approaches, we’ll have some guesses for \(theta\) which represents the probability of making a goal. We’ll use a triangular prior, but you can change it to a uniform or beta prior if you like. We’ll then calculate the likelihood of the data given the parameter, and then the posterior distribution.

      @@ -6960,7 +6960,7 @@

      Example

      -

      Here is the table that puts all this together. Our prior distribution is centered around 0.5, and the likelihood is centered closer to 0.7. The posterior distribution is a combination of the two. It gives no weight to smaller values, or the max values. Our final estimate is is 0.6, which falls between the two. With more evidence in the form of data, our estimate will shift more and more towards what the likelihood would suggest. This is a simple example, but it shows how the Bayesian approach works, and this holds for more complex parameter estimation as well.

      +

      Here is the table that puts all this together. Our prior distribution is centered around 0.5, and the likelihood is centered closer to 0.7. The posterior distribution is a combination of the two. It gives no weight to smaller values, or the max values. Our final estimate is 0.6, which falls between the two. With more evidence in the form of data, our estimate will shift more and more towards what the likelihood would suggest. This is a simple example, but it shows how the Bayesian approach works, and this holds for more complex parameter estimation as well.

      @@ -7472,7 +7472,19 @@

      Example

      -

      :::{.callout-info ‘Priors as Regularization’} In the context of penalized estimation and machine learning, the prior distribution can be thought of as a form of regularization (See -Section 6.8 above, and -Section 9.5 later). In this context, the prior shrinks the estimate, pulling the parameter estimates towards it, just like the penalty parameter does in the penalized estimation methods. In fact, many penalized methods can be thought of as a Bayesian approach with a specific prior distribution. A specific example would be ridge regression, which can be thought of as a Bayesian approach with a normal prior distribution for the coefficients. :::

      +
      +
      +
      + +
      +
      +Priors as Regularization +
      +
      +
      +

      In the context of penalized estimation and machine learning, the prior distribution can be thought of as a form of regularization (See Section 6.8 above, and Section 9.5 later). In this context, the prior shrinks the estimate, pulling the parameter estimates towards it, just like the penalty parameter does in the penalized estimation methods. In fact, many penalized methods can be thought of as a Bayesian approach with a specific prior distribution. A specific example would be ridge regression, which can be thought of as a Bayesian approach with a normal prior distribution for the coefficients.

      +
      +

    Application

    @@ -7507,7 +7519,7 @@

    Application

    -

    When we are interested in making predictions, we can use the results to generate a distribution of possible predictions for each observation, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as posterior predictive distribution, which is explored in non-bayesian context in Section 4.4. Here is a plot of several draws of predicted values against the true happiness scores.

    +

    When we are interested in making predictions, we can use the results to generate a distribution of possible predictions for each observation, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as posterior predictive distribution, which is explored in a non-bayesian context in Section 4.4. Here is a plot of several draws of predicted values against the true happiness scores.

    @@ -8001,7 +8013,7 @@

    Application

    - -
    +

    As we saw in Section 4.4, nothing is keeping you from doing ‘posterior predictive checks’ with other estimation approaches, and it’s a very good idea to do so. For example, in a GLM you have the beta estimates and the covariance matrix for them, and can simulate from a normal distribution with those estimates. It’s just more straightforward with the Bayesian approach, where packages will do it for you with little effort.

    @@ -8135,11 +8147,11 @@

    Some disciplines seem to confuse models with estimation methods and link functions. It doesn’t really make sense, nor is it informative, to call something an OLS model or a logit model. Many models are estimated using a least squares objective function, even deep learning, and different types of models use a logit link, from logistic regression, to beta regression, to activation functions used in deep learning.↩︎

  • You may find that some packages will only minimize (or maximize) a function, even to the point of reporting nonsensical things like negative squared values, so you’ll need to take care when implementing your own metrics.↩︎

  • The actual probability of a specific value in this setting is 0, but the probability of a range of values is greater than 0. You can find out more about likelihoods and probabilities at the discussion here, but in general many traditional statistical texts will cover this also.↩︎

  • -
  • The negative log-likelihood is often what is reported in the model output.↩︎

  • +
  • The negative log-likelihood is often what is reported in the model output as well.↩︎

  • Those who have experience here will notice we aren’t putting a lower bound on sigma. You typically want to do this otherwise you may get nonsensical results by not keeping sigma positive. You can do this by setting a specific argument for an algorithm that uses boundaries, or more simply by exponentiating the parameter so that it can only be positive. In the latter case, you’ll have to exponentiate the final parameter estimate to get back to the correct scale. We leave this detail out of the code for now to keep things simple.↩︎

  • Linear regression will settle on a line that cuts through the means, and when standardizing all variables, the mean of the features and target are both zero, so the line goes through the origin.↩︎

  • MC does not recall exactly where this origin of his function came from except that Murphy’s PML book was a key reference when he came up with it (Murphy (2012)).↩︎

  • -
  • You’d get better results by also standardizing the target. The initial shuffling that we did can help as well in case the data are ordered. When we’re dealing with larger data and repeated runs/epochs, shuffling allows the samples/batches to be more representative of the entire data set. Also, we had to ‘hand-tune’ our learning rate and stepsize, which is not ideal, and normally we would use cross-validation to find the best values.↩︎

  • +
  • You’d get better results by also standardizing the target. The initial shuffling that we did can help as well in case the data are ordered. When we’re dealing with larger data and repeated runs/epochs, shuffling allows the samples/batches to be more representative of the entire data set. Also, we had to ‘hand-tune’ our learning rate and step size, which is not ideal, and normally we would use cross-validation to find the best values.↩︎

  • We’re using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We’re not exactly sure how that terminological muddling arose in ML, but be on the lookout for it.↩︎

  • Many people’s default interpretation of a standard confidence interval is incorrectly the actual interpretation of a Bayesian confidence interval. This is partly because the Bayesian interpretation of confidence intervals and p-values is how we tend to naturally think about those statistics. But that’s okay, everyone is in the same boat. We also think it’s fine if you want to call the Bayesian version a confidence interval.↩︎

  • We used the R package for brms for these results.↩︎

  • diff --git a/docs/generalized_linear_models.html b/docs/generalized_linear_models.html index d6ee4b1..db418e2 100644 --- a/docs/generalized_linear_models.html +++ b/docs/generalized_linear_models.html @@ -473,7 +473,7 @@

    As we’ve seen, you will often have a binary variable that you might want to use as a target – it could be dead/alive, lose/win, quit/retain, etc. You might be tempted to use a linear regression, but you will quickly find that it’s not the best option in that setting. So let’s try something else.

    7.3.1 The binomial distribution

    -

    Logistic regression is differs from linear regression mostly because it is used with a binary target instead of a continuous one as with linear regression. We typically assume that the target follows a binomial distribution. Unlike the normal distribution,, which is characterized by its mean (\(\mu\)) and variance (\(\sigma^2\)), the binomial distribution is defined by the parameters: p (also commonly \(\pi\)) and a known value n. Here, p represents the probability of a specific event occurring (like flipping heads, winning a game, or defaulting on a loan), and n is the number of trials or attempts under consideration.

    +

    Logistic regression differs from linear regression mostly because it is used with a binary target instead of a continuous one as with linear regression. As a result, we typically assume that the target follows a binomial distribution. Unlike the normal distribution, which is characterized by its mean (\(\mu\)) and variance (\(\sigma^2\)), the binomial distribution is defined by the parameters: p (also commonly \(\pi\)) and a known value n. Here, p represents the probability of a specific event occurring (like flipping heads, winning a game, or defaulting on a loan), and n is the number of trials or attempts under consideration.

    It’s important to note that the binomial distribution, which is commonly employed in GLMs for logistic regression, doesn’t just describe the probability of a single event. It actually represents the distribution of the number of successful outcomes in n trials, which can be greater than 1. In other words, it’s a count distribution that tells us how many times we can expect the event to occur in a given number of trials.

    Let’s see how the binomial distribution looks with 100 trials and probabilities of ‘success’ at p = .25, .5, and .75:

    @@ -548,7 +548,7 @@

    \[p = \frac{\textrm{exp}(\alpha + X\beta)}{1 + \textrm{exp}(\alpha + X\beta)}\]

    or equivalently:

    \[p = \frac{1}{1 + \textrm{exp}(-(\alpha + X\beta))}\]

    -

    Whenever we get results for a logistic regression model, the default coefficients and predictions are almost always on the log odds scale. We usually exponentiate the coefficients them to get the odds ratio. For example, if we have a coefficient of .5, we would say that for every one unit increase in the feature, the odds of the target being a ‘success’ increase by a factor of exp(.5) = 1.6. And we can convert the predicted log odds to probabilities using the inverse-logit function.

    +

    Whenever we get results for a logistic regression model, the default coefficients and predictions are almost always on the log odds scale. We usually exponentiate the coefficients to get the odds ratio. For example, if we have a coefficient of .5, we would say that for every one unit increase in the feature, the odds of the target being a ‘success’ increase by a factor of exp(.5) = 1.6. And we can convert the predicted log odds to probabilities using the inverse-logit function.

    7.3.2 Probability, odds, and log odds

    @@ -1153,11 +1153,11 @@

    -

    Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. As shown in Table 7.1, knowledge of the baseline rate is required for a good understanding of them. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice elsewhere, they are not as common. The take-home message is that we can interpret our result in terms of odds (ratios of probabilities), log-odds (linear space), or as probabilities (nonlinear space), but it can take a little more effort than our linear regression setting1. Our own preference is to stick with predicted probabilities, but it’s good to have familiarity of odds ratios, since they are often reported in academic papers and media reports.

    +

    Odds ratios might be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. As shown in Table 7.1, knowledge of the baseline rate is required for a good understanding of them. Furthermore, doubling the odds is not the same as doubling the probability, so we’re left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but in practice elsewhere, they are not as common. The take-home message is that we can interpret our result in terms of odds (ratios of probabilities), log-odds (linear space), or as probabilities (nonlinear space), but it can take a little more effort than our linear regression setting1. Our own preference is to stick with predicted probabilities, but it’s good to have familiarity with odds ratios, since they are often reported in academic papers and media reports.

    7.3.3 A logistic regression model

    -

    Now let’s get our hands dirty and do a classification model using logistic regression. For our model let’s return to the movie review data, but now we’ll use the binary rating_good (‘good’ vs. ‘bad’) as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We examine the relationship of word_count and gender features with the likelihood of getting a good rating.

    +

    Now let’s get our hands dirty and do a classification model using logistic regression. For our model, let’s return to the movie review data, but now we’ll use the binary rating_good (‘good’ vs. ‘bad’) as our target. Before we get to modeling, see if you can find out the frequency of ‘good’ and ‘bad’ reviews, and the probability of getting a ‘good’ review. We examine the relationship of word_count and gender features with the likelihood of getting a good rating.

    @@ -1266,13 +1266,13 @@

    Date: -Sun, 01 Sep 2024 +Mon, 02 Sep 2024 Deviance: 1257.4 Time: -18:56:28 +14:35:33 Pearson chi2: 1.02e+03 @@ -3659,7 +3659,7 @@

    For more accessible fare that doesn’t lack on core details either:

    • An Introduction to Generalized Linear Models is generally well regarded (Dobson and Barnett (2018)).
    • -
    • Generalized Linear Models is more accessible (Hardin and Hilbe (2018)).
    • +
    • Generalized Linear Models is another accessible text (Hardin and Hilbe (2018)).
    • Roback and Legler’s Beyond Multiple Linear Regression, available for free.
    • Applied Regression Analysis and Generalized Linear Models (Fox (2015))
    • Generalized Linear Models with Examples in R (Dunn and Smyth (2018))
    • diff --git a/docs/img/causal-dag.svg b/docs/img/causal-dag.svg new file mode 100644 index 0000000..d68a92a --- /dev/null +++ b/docs/img/causal-dag.svg @@ -0,0 +1,64 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +target +v +w1 +w2 +x +z1 +z2 + + diff --git a/docs/img/lm-extend-length_genre_rating.svg b/docs/img/lm-extend-length_genre_rating.svg new file mode 100644 index 0000000..1297e56 --- /dev/null +++ b/docs/img/lm-extend-length_genre_rating.svg @@ -0,0 +1,1312 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1 +2 +3 +4 +5 + + + + + + + + + + +100 +110 +120 +130 +140 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1 +2 +3 +4 +5 + + + + + + + + + + + + + +Act/Adv +Drama +Kids +Romance +Comedy +Horror +Other +Sci-Fi + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Action/Adventure + + + + +Comedy + + + + +Drama + + + + +Horror + + + + +Kids + + + + +Other + + + + +Romance + + + + +Sci-Fi + + + + + + diff --git a/docs/img/lm-extend-random_effects.svg b/docs/img/lm-extend-random_effects.svg new file mode 100644 index 0000000..a24f12c --- /dev/null +++ b/docs/img/lm-extend-random_effects.svg @@ -0,0 +1,2078 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +-3 +-2 +-1 +0 +1 +2 +3 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +TGO +BEN +BGR +CIV +GEO +BFA +LBR +NER +CAF +SLE +MDG +CMR +ARM +UGA +KEN +ZWE +MNG +MRT +SDN +PSE +CHN +YEM +PHL +KGZ +IRQ +COD +NAM +ROU +SOM +DOM +DJI +SWZ +NIC +AGO +UKR +LSO +XKX +SYR +ZMB +ALB +IRN +MDV +BHR +VNM +UZB +LTU +MDA +TUR +CUB +HKG +ECU +SVK +POL +LBN +BOL +LBY +JAM +MYS +GUY +JOR +MLT +GTM +CHL +CYP +CZE +ITA +ARG +DEU +QAT +ESP +FRA +BRA +MEX +PAN +ISL +BEL +AUT +CRI +AUS +NZL +NOR +CHE +GIN +COG +GAB +BDI +COM +TCD +KHM +TZA +MLI +SSD +SEN +HTI +SRB +RWA +LKA +NPL +TJK +MKD +MMR +AZE +BIH +LVA +MWI +HUN +MOZ +BWA +GMB +GHA +ETH +AFG +BGD +NGA +MAR +ZAF +MNE +EGY +PRT +LAO +EST +HND +IDN +TUN +IND +BTN +RUS +PER +PRY +PAK +MUS +HRV +KAZ +BLR +SVN +SLV +GRC +KOR +DZA +URY +TWN +THA +JPN +SUR +TTO +TKM +BLZ +COL +KWT +SGP +OMN +SAU +LUX +GBR +ARE +ISR +VEN +IRL +USA +SWE +FIN +NLD +CAN +DNK +Intercept + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +-2 +-1 +0 +1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +SYR +VEN +JOR +LSO +YEM +RWA +ZMB +CAF +EGY +ZWE +TUN +SWZ +DZA +CAN +BGD +BLZ +COL +PAK +IRN +KWT +CYP +LBY +ARG +ARE +CHE +AUS +DJI +QAT +NAM +JPN +MDG +SLE +BTN +AUT +BDI +MYS +ALB +GUY +GBR +OMN +SWE +TTO +MRT +NGA +IRQ +MOZ +BOL +GTM +CHL +VNM +PSE +KEN +HRV +SOM +ECU +DEU +PRY +RUS +TCD +CZE +TWN +KAZ +POL +URY +KHM +NER +ARM +SVN +BFA +SVK +UZB +KGZ +BIH +LTU +XKX +HUN +MNG +NIC +BEN +ROU +COG +BGR +AFG +LBN +IND +BWA +TKM +MWI +AGO +SSD +COD +ETH +PAN +BRA +CRI +TZA +ESP +USA +MEX +SAU +HTI +UKR +BEL +TUR +DNK +IRL +NZL +ITA +GRC +GMB +FRA +NOR +SGP +SDN +GHA +MMR +NLD +LAO +ZAF +BLR +HKG +JAM +THA +MDV +SUR +CUB +LKA +MAR +UGA +IDN +KOR +ISR +COM +LBR +LUX +FIN +MLI +ISL +PER +MLT +MDA +MUS +MNE +SLV +HND +SEN +DOM +PRT +AZE +CMR +MKD +NPL +CHN +TJK +EST +PHL +GEO +TGO +BHR +GAB +CIV +LVA +SRB +GIN +Country +Decade +Trend +Estimated random effects for country + + diff --git a/docs/img/lm-extend-random_effects_cor_plot.svg b/docs/img/lm-extend-random_effects_cor_plot.svg new file mode 100644 index 0000000..4e265a6 --- /dev/null +++ b/docs/img/lm-extend-random_effects_cor_plot.svg @@ -0,0 +1,443 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +AFG +AGO +ALB +ARE +ARG +ARM +AUS +AUT +AZE +BDI +BEL +BEN +BFA +BGD +BGR +BHR +BIH +BLR +BLZ +BOL +BRA +BTN +BWA +CAF +CAN +CHE +CHL +CHN +CIV +CMR +COD +COG +COL +COM +CRI +CUB +CYP +CZE +DEU +DJI +DNK +DOM +DZA +ECU +EGY +ESP +EST +ETH +FIN +FRA +GAB +GBR +GEO +GHA +GIN +GMB +GRC +GTM +GUY +HKG +HND +HRV +HTI +HUN +IDN +IND +IRL +IRN +IRQ +ISL +ISR +ITA +JAM +JOR +JPN +KAZ +KEN +KGZ +KHM +KOR +KWT +LAO +LBN +LBR +LBY +LKA +LSO +LTU +LUX +LVA +MAR +MDA +MDG +MDV +MEX +MKD +MLI +MLT +MMR +MNE +MNG +MOZ +MRT +MUS +MWI +MYS +NAM +NER +NGA +NIC +NLD +NOR +NPL +NZL +OMN +PAK +PAN +PER +PHL +POL +PRT +PRY +PSE +QAT +ROU +RUS +RWA +SAU +SDN +SEN +SGP +SLE +SLV +SOM +SRB +SSD +SUR +SVK +SVN +SWE +SWZ +SYR +TCD +TGO +THA +TJK +TKM +TTO +TUN +TUR +TWN +TZA +UGA +UKR +URY +USA +UZB +VEN +VNM +XKX +YEM +ZAF +ZMB +ZWE + + +-1 +0 +1 + + + + + + + + + +3 +4 +5 +6 +7 +8 +Intercept +Decade +Trend +Estimated random effects for country + + diff --git a/docs/img/lm-my-first-model-predictions-plot.svg b/docs/img/lm-my-first-model-predictions-plot.svg new file mode 100644 index 0000000..6d8d8d5 --- /dev/null +++ b/docs/img/lm-my-first-model-predictions-plot.svg @@ -0,0 +1,1068 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1 +2 +3 +4 +5 + + + + + + + +2.5 +3.0 +Predicted Rating +Observed +Rating +Points have been jittered for better visibility. + + diff --git a/docs/index.html b/docs/index.html index e7f0ded..f290b28 100644 --- a/docs/index.html +++ b/docs/index.html @@ -362,11 +362,11 @@

      What Will You Get Out of

      Brief Prerequisites

      You’ll definitely want to have some familiarity with R or Python (both are used for examples), and some very basic knowledge of statistics will be helpful. We’ll try to explain things as we go, but we won’t be able to cover everything. If you’re looking for a good introduction to R, we recommend R for Data Science or the Python for Data Analysis book for Python. Beyond that, we’ll try to provide the context you need so that you can be comfortable trying things out.

      -

      Also, if you happen to be reading this book in print, you can find all the content at https://m-clark.github.io/book-of-models. There you’ll find all the code, figures, and other content that you can interact with more easily.

      +

      Also, if you happen to be reading this book in print, you can find the book in web form at https://m-clark.github.io/book-of-models. There you’ll find all the code, figures, and other content that you can interact with more easily, as well as the most up-to-date content, fixes, etc.

      About the Authors

      -

      Michael is a senior machine learning scientist for Strong Analytics. Prior to industry he honed his chops in academia, earning a PhD in Experimental Psychology before turning to data science full-time as a consultant. His models have been used in production across a variety of industries, and can be seen in dozens of publications across several disciplines. He has a passion for helping others learn difficult stuff, and has taught a variety of data science courses and workshops for people of all skill levels in many different contexts.

      +

      Michael is a senior machine learning scientist for Strong Analytics1. Prior to industry he honed his chops in academia, earning a PhD in Experimental Psychology before turning to data science full-time as a consultant. His models have been used in production across a variety of industries, and can be seen in dozens of publications across several disciplines. He has a passion for helping others learn difficult stuff, and has taught a variety of data science courses and workshops for people of all skill levels in many different contexts.

      He also maintains a blog, and has several posts and long-form documents on a variety of data science topics there. He lives in Ann Arbor Michigan with his wife and his dog, where they all enjoy long walks around the neighborhood.

      @@ -388,6 +388,12 @@

      About the Authors

    +
    +
    +
      +
    1. By the time you’re reading this, Strong’s merger with OneSix should be complete (2025).↩︎

    2. +
    +