-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagation of missing values #496
Comments
Most software (at least that I've used) drops missing observations by default when fitting a regression model. However, those that do typically report how many observations were dropped, which I think is a critical piece of information that's missing from the output printed by StatsModels and other packages. IMO, the default display should include both the number of missing observations that were dropped and the number of observations that were actually used for modeling. Would adding that sufficiently address your concern here? |
Without taking a position just yet, I'll expand on a few options I can see so far. Automatic dropping
That has been my experience too, though not universally. Pandas and SQL both drop missings for summaries like If this approach is taken, including #missing in the output would be useful. Manual droppingThe popular Statistical Rethinking textbook reassures the user of its associated
taking a strong stance on that issue. To manually drop, instead of using DataFrames, GLM
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41]);
lm(@formula(y~x), d) I would use using DataFrames, GLM, StatsModels
d = DataFrame(x=[1,2,3,missing], y=[10,20,31, 41])
f = @formula(y~x)
lm(f, dropmissing(d, StatsModels.termvars(f))) which is admittedly a bit less convenient since it requires A boolean flag argument
A functional declarative approach?Maybe there is a way to wrap |
SAS drops observations with missing values and reports the number dropped. I've never used Stata but a quick google suggests it drops automatically, though I don't know whether it reflects that in the results it displays. |
It's true that skipping missing values when fitting models isn't consistent with how we handle missing values elsewhere in the ecosystem. This is because GLM was written before missing values support was added to Julia, and it's a legacy from R. As @ararslan suggested, I think it would be good to print the number of observations dropped due to the presence of missing values. #339 should make this possible. We could take a stricter stance and throw an error in the presence of missing values, but that would be breaking so I'm not sure it's worth it, and we would have to go through a deprecation period where a warning would be thrown anyway. I think the API would have to be The first option isn't contradictory with the second one so we could start with it anyway. |
An argument for not automatically dropping missings can be made when the model is weighted. As of now lm(@formula(y~x), data=df, weights=aweights(df.w)) with missing values in either |
One of the things I like most about Julia is that it propagates missing values, encouraging me to think critically about how I handle them in my data. For instance,
sum([1,2,missing])
evaluates tomissing
, not3
, which tells me I need to be careful and think about why there are missing values and how I should handle them. I might want to drop them, or impute values, or realize that my data cleaning functions are broken and I need to fix them before modeling.In the case of GLM, missing values are dropped. I would rather the result be
missing
, as it creates a summary of the data just likesum
. Then I won't have a false impression that I'm using complete data and I'll think more about the meaning of my operations.The text was updated successfully, but these errors were encountered: