diff --git a/docs/causal.html b/docs/causal.html index d16dda3..2afe79a 100644 --- a/docs/causal.html +++ b/docs/causal.html @@ -411,7 +411,6 @@
This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.
-Some additional variants of these models exist, and they can be used in a variety of settings, not just uplift modeling. The key idea is to use the model to predict the potential outcomes of the treatment, and then to take the difference between the two predictions as the causal effect.
@@ -743,7 +738,7 @@If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then see if the effect is large enough to be of interest, and if the result is useful in making decisions8. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal.
As another example, consider the world happiness data we’ve used in previous demonstrations. We want to explain the association of country level characteristics and the population’s happiness. We likely aren’t going to be as interested in predicting next year’s happiness score, but rather what attributes are correlated with a happy populace in general. In this election year (2024) in the U.S., we’d be interested in specific factors related to presidential elections, of which there are relatively very few data points. In these cases, explanation is the focus, and we may not even need a model at all to come to our conclusions.
So we can understand that in some settings we may be more interested in understanding the underlying mechanisms of the data, as with these examples, and in others we may be more interested in predictive performance, as in our demonstrations of machine learning. However, the distinction between prediction and explanation in the end is a bit problematic, not the least of which is that we often want to do both.
Although it’s often implied as such, prediction is not just what we do with new data. It is the very means by which we get any explanation of effects via coefficients, marginal effects, visualizations, and other model results. Additionally, where the focus is on predictive performance, if we can’t explain the results we get, we will typically feel dissatisfied, and may still question how well the model is actually doing.
@@ -754,7 +749,7 @@From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources below.
+From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we’ve mentioned here, and see how they are used in practice via the additional resources.
Your authors have to admit some bias here. We’ve spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and too little generalization, and were grossly overfit. Many SEM programs even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that’s not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions, but it is not a magic bullet, and it doesn’t make anyone look fancier by using it.↩︎
This is basically the S-Learner approach to meta-learning, which we’ll discuss in a bit. It is generally too weak↩︎
The G-computation approach and S-learners are essentially the same approach, but came about from different domain contexts.↩︎
This is a contrived example, but it is definitely something what you might see in the wild. The relationship is weak, and though statistically significant, the model can’t predict the target well at all. The statistical power is actually decent in this case, roughly 70%, but this is mainly because the sample size is so large and it is a very simple model setting.
This is a common issue in many academic fields, and it’s why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.↩︎
This is a contrived example, but it is definitely something that you might see in the wild. The relationship is weak, and though statistically significant, the model can’t predict the target well at all. The statistical power is actually decent in this case, roughly 70%, but this is mainly because the sample size is so large and it is a very simple model setting.
This is a common issue in many academic fields, and it’s why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.↩︎
Gentle reminder that making an assumption does not mean the assumption is correct, or even provable.↩︎
Consider a model setting with 100,000 samples. Is this large? Let’s say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where outcome label you’re interested in occurs. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you’d be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don’t have enough data to make a reliable estimate of the interaction effect.
+Consider a model setting with 100,000 samples. Is this large? Let’s say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where the outcome label you’re interested in is present. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you’d be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction effect on the target, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don’t have enough data to make a reliable estimate of the interaction effect.
Oh wait, did you want to use cross-validation also? A simple random sample approach might result in some validation sets with no positive values at all! Don’t forget that you may have already split your 100,000 samples into training and test sets, so you have even less data to start with! The following table shows the final cell count for a dataset with these properties.
The point is that it’s easy to forget that large data can get small very quickly due to class imbalance, interactions, etc. There is not much you can do about this, but you should not be surprised when these situations are not very revealing in terms of your model results.
We’re talking very generally about data here, so not much background is needed. The models mentioned are covered in other chapters, or build upon those, but we’re not doing any actual modeling here.
+We’re talking very generally about data here, so not much background is needed. The models mentioned here are covered in other chapters, or build upon those, but we’re not doing any actual modeling here.
Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions2 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle3. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 13.2.3.
+Using a log transformation for numeric targets and features is straightforward, and comes with several benefits. For example, it can help with heteroscedasticity, which is when the variance of the target is not constant across the range of the predictions2 (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations less straightforward. Also if you have a lot of zeros, ‘log plus one’ transformations are not going to be enough to help you overcome that hurdle3. Logging also won’t help much when the variables in question have few distinct values, like ordinal variables, which we’ll discuss later in Section 13.2.3.
Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allowed for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.
+Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allow for ties). For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called learning to rank methods, like the RankNet and LambdaRank algorithms, and other variants for deep learning models.
@@ -2925,7 +2925,7 @@Data augmentation is a technique where you artificially increase the size of your dataset by creating new data points based on the existing data. This is a common technique in deep learning for computer vision, where you might rotate, flip, or crop images to create new training data. This can help improve the performance of your model, especially when you have a small dataset. Techniques are also available for text.
In the tabular domain, data augmentation is less common, but still possible. You’ll see it most commonly applied with class-imbalance settings (Section 13.4), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.
-Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process11. Downsampling the majority class can potentially throw away usefu information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.
+Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process11. Downsampling the majority class can potentially throw away useful information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn’t generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.