-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VIF? #428
Comments
Sure, feel free to make a pull request (with tests and documentation). Though just like #415 it would make sense to add the empty function definition to StatsBase and StatsModels too. |
@azev77 are you (or anyone else) working on this? If not, I'll give it a swing. I just don't want to duplicate effort. |
Not working on it |
After working on this, I learned that the solution mentioned at the beginning of the post is a little outdated and doesn't take categorical variables into account accurately. (No criticism meant. I apologize if I came off harsh.) There are more modern methods for calculating VIFs. See:
I'm not going to pretend to be more of an expert on the topic than them. So, I copied the code for the default @nalimilan, I have a couple of design questions.
Below is a screenshot of a linear model that does have a categorical variable, and one that doesn't; and then how the Please let me know what you think. Thank you. |
Just to clarify: There is more work to be done with VIFs when it comes to mixed models, weighted liner models, and models with interactions in them. I'll keep working on the code to ensure that these other features are ready before a pull is requested. Just to clarify the output: The beginning part of the "If all terms in an unweighted linear model have 1 df, then the usual variance-inflation factors are calculated. If any terms in an unweighted linear model have more than 1 df, then generalized variance-inflation factors (Fox and Monette, 1992) are calculated. These are interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the term in comparison with what would be obtained for orthogonal data." |
Be careful. I don’t think we can copy R code & keep the MIT license |
Thanks for the heads up. How much different does it have to be to not count as copying? Lolz. The Mine also just implements mathematical equations and outputs it in Julia. If you know of resources on what makes something "copied" or not, please let me know. I can read further into it. |
R's |
Thanks for working on this @HiramTheHero. Unfortunately, @azev77 is right that we cannot include code that is derived from GPL code in GLM.jl, as the MIT license is less restrictive than GPL, so the whole package would have to be GPL. It's hard to defined what is "derived" code though. Unless the code is less than a handful of lines, to be on the safe side, I would recommend writing to John Fox to ask him for permission to release your Julia code inspire by his R code under the MIT license (you can put me in CC as I know him).
Better add code for LMs and GLMs here, and code for mixed models in MixedModels.jl. Or if the code can be written only using APIs defined in StatsAPI.jl, maybe it could live in StatsModels.jl. Anyway we should add an empty definition of
We probably don't want GLM.jl to depend on PrettyTables.jl just for that. You could take inspiration from the code used for |
Thank you @nalimilan for the response. I've send an email to John Fox with your email on github CC'd to the email.
To calculate the VIF, I've used the
👍
I created two structs that hold the results of the
Accessing Results:
|
Just an update. We emailed John Fox and he was very supportive on us integrating his VIF methods into the Julia ecosystem. It sounds like someone else has already built a VIF functionality for MixedModels, and that creator is wanting to extend their method to GLM. (JuliaStats/MixedModels.jl#661 (comment)) To avoid duplication of effort, I won't be putting more effort into this until a decision is made by the package maintainers on what direction should be taken. |
The implementation in MixedModelsExtras will actually already handle (some) models fit with GLM.jl: @testset "GVIF and RegrssionModel" begin
duncan = rdataset("car", "Duncan")
lm1 = lm(@formula(Prestige ~ 1 + Income + Education), duncan)
@test termnames(lm1) == coefnames(lm1)
vif_lm1 = vif(lm1)
# values here taken from car
@test isapprox(vif_lm1, [2.1049, 2.1049]; atol=1e-5)
@test isapprox(vif_lm1, gvif(lm1))
lm2 = lm(@formula(Prestige ~ 1 + Income + Education + Type), duncan)
@test termnames(lm2) == ["(Intercept)", "Income", "Education", "Type"]
@test isapprox(gvif(lm2), [2.209178, 5.297584, 5.098592]; atol=1e-5)
@test isapprox(gvif(lm2; scale=true),
[1.486330, 2.301648, 1.502666]; atol=1e-5)
end |
Might be worth moving the implementation to a more upstream package for better visibility. It didn't look like the implementations were using anything MixedModel specific (but I might have missed it) |
What is the status of this issue. Is there any help needed to get it moving? |
@bkamins The code is written; there is just a little bit of discussion about what the API should look like and where various parts should live: JuliaStats/MixedModels.jl#661 (comment) Currently we have up to three functions:
cc @kleinschmidt if we put a |
My opinion is:
|
Any updates on this feature being implemented? |
if @kleinschmidt is happy for StatsModels to own |
@HiramTheHero in the meantime, you should "just" be able to do using MixedModelsExtras
vif(my_glm_model) and things will probably work. |
Just a side note for anyone following along with this thread. The following code works if your model has categorical variables and you're wanting scaled GVIFs (similar output to
|
@kleinschmidt Is that OK for you? |
I think I mentioned elsewhere that I'm very much against VIF being added to any packages, as it's a statistical footgun in the waiting. I'll explain why here. The VIF measures the degree to which two variables are confounded with each other. In other words, if the VIF between A and B is high, the variables are strongly correlated with each other, and either one could be the true explanation of the outcome variable. We cannot know, without additional data, whether A, B, or both are the real cause of the outcome variable Y. The problem with this measure is that early in the history of econometrics, it was misinterpreted by the person deriving it as saying that, if we remove B from our regression, we will be able to measure the effect of A more accurately. This is not the case--pretending that a confounder doesn't exist will make your regression worse, not better. Unfortunately, it is common to remove variables with a high VIF from a regression, because economists (and scientists in general) are often trained to think of their p-values and estimated standard errors as indicating their study's quality. Removing variables with a high VIF invalidates any calculated statistics or p-values, as you are now applying a post-hoc procedure, and (in the case of observational data) routinely breaks causal inference, by removing variables that need to be included to avoid confounding. This procedure introduces a garden of forking paths, where the analysis researchers actually perform is strongly dependent on the results of their experiment, introducing substantial model-based variability that is not captured in the p-values or standard errors reported by GLM.jl. |
We are not responsible for how users use or misuse a particular statistical quantity -- otherwise, I certainly wouldn't want to be responsible for any code that outputs a p-value! There are valid uses of VIF and not all statistical education follows the pattern in econometrics. The typical applications I've seen in my domains is an indicator of whether or not individual coefficient standard errors are conflated or whether there are potential numerical issues. I had not seen the advice to remove terms with a high VIF, but rather discussions on orthogonalization, residualization, etc. |
In that case, if the goal is to use it as a diagnostic, I think it makes sense to have the VIF as an internal function. However, I would avoid exporting it or including it in StatsAPI.jl, to avoid encouraging common misunderstandings that users must "check" for collinearity, and GLM.jl should automatically center/orthogonalize variables when necessary (rather than pushing this problem onto the user).
I understand that in some cases, statistics can be used incorrectly or misinterpreted. My main concern is that from the statistical (not numerical) perspective, the VIF has very few valid uses (unlike the p-value), but many common invalid ones. I don't think we can bear all the responsibility for users' abuse of software, but we should still make an effort to make the software easy to use correctly. If we created a function called |
As @palday noted there are valid use cases for the VIF and people expect to be able to compute it. Automatically orthogonalizing variables wouldn't replace this. (FWIW, in my field the BIC is also often abused to claim that a variable has no effect, yet I wouldn't suggest dropping it as it's a classic indicator.) |
I think that's a mostly correct interpretation (it's correct if you assume that the coefficient might be 0). This is roughly my point: there are correct and incorrect uses for BIC, or p-values, or most other measures, but most uses are at least 50% correct. People don't understand lots of statistics, but they at least know basics like:
i.e. even if they're misinterpreting them, they at least know whether the variable is good or bad, so it's not doing more harm than just not doing statistics at all. For a better analogy, imagine people routinely dropped variables that improved the BIC, i.e. they tried to select the model that assigned the lowest probability to the data. |
No, even assuming that there is a clear definition of R^2 for a given model (which there isn't for mixed models). https://www.johnmyleswhite.com/notebook/2016/07/23/why-im-not-a-fan-of-r-squared/ |
I'm not a huge fan of R^2 either, and it's definitely easy to misinterpret. But at least this interpretation is somewhat accurate in some common situations. "High R^2 is better" is dramatically oversimplified and ignores a lot of other factors that can matter more to a model. Unfortunately, this misunderstanding causes people to make mistakes in how they interpret . But it still has a kernel of truth at its core: lower variance means better predictions. Sometimes, people using R^2 are using it correctly (e.g. to compare linear regression models; as the blog post notes, the true model will always have a better R^2 than an incorrect model, given a big enough sample size). I don't think I've ever seen a correct use of VIF in a paper. As another example, imagine replacing all our p values with a brand-new |
Again, no! For one thing, R2 captures a particular measure of variance (relative to a saturated model) and not the variance of the predictions. The Stein paradox and the bias-variance tradeoff speak to there being a tradeoff that needs to be made, but making that tradeoff should be informed by the broader context of the analysis. VIF is not a quantity constructed explicitly to be misleading like your alternative rho-value or error-maximizing approach, but rather one that has valid applications and is included in software implemented by prominent applied statisticians (e.g. John Fox in R's If you want to view it from a responsibility to users perspective, then users will compute this value somehow. Whether it's using RCall or some problematic code they copied and pasted from somewhere, they will find a way to get it computed. Maybe their boss or a reviewer or a regulatory agency demanded it, but they will find a way to compute it. The least we can do is make sure that they have the correctly computed value and not something that uses the wrong or numerically unstable formula. That's how we as software authors can support our users doing valid inference. It's up to them to use that value correctly. For that part, we can be educators (and I do also teach), but that is a different role than software author. Finally, I want to note that the proposal is for defining a function to compute this value and not for showing it in default outputs. I would also be opposed to including VIF in the default |
Again, I'm talking in the context of comparing simple linear regression models on the same dataset, in which case the scaling isn't important and there is no bias-variance tradeoff (as OLS is unbiased). But regardless, my point is it's common to misunderstand R^2 partially, but complete reversals are rare. Few people try to choose models that actively minimize R^2.
Right, but my point is it's just as misleading as those examples. The intentions of the inventor don't matter; what matters is whether the function is misleading. If a function does something very different from what users expect it to do, there's a problem with the function. |
FWIW, I am this user about to copy-pasta a VIF formula from the wilds of the internet (not because I particularly want to, or understand the subject greatly, but simply because it is being required by the course material I am engaged with) and I would greatly appreciate a vetted function conveniently located somewhere in the Julia/MLJ ecosystem instead. |
There was a request for VIF on Discourse.
It's pretty simple to add using the fact that VIFs = the diagonal elements of the inverse of the correlation matrix.
-Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980.
or
The text was updated successfully, but these errors were encountered: