-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for rank deficient GeneralizedLinearModel
#340
Conversation
0097e71
to
7160916
Compare
Codecov Report
@@ Coverage Diff @@
## master #340 +/- ##
===========================================
- Coverage 81.08% 70.90% -10.18%
===========================================
Files 7 6 -1
Lines 703 543 -160
===========================================
- Hits 570 385 -185
- Misses 133 158 +25
Continue to review full report at Codecov.
|
As discussed in issue #273, the reason why we don't pivot by default is that it would produce inaccurate results by setting argument But now what I use here is the Maybe we could check the rank when constructing DensePredChol to decide whether to compute CholeskyPivoted or not, or just compute the CholeskyPivoted by default, because if the matrix is full rank, then we would have Well... It's hard to modify GLM without affecting LM, maybe we should keep the interface consistent, and merge it first? |
I would love to see this one merged for two downstream packages. thanks! |
Since the Project.toml is reverted, the Travis CI may fail. It doesn't mean that the code has a bug. |
Since we'll squash and merge anyway you can either rebase on master, or merge master into this branch and fix the conflict that way. Since this adds a feature (support for rank-deficient GLM), we should bump the minor version right? Is there anything else holding up merging this? @andreasnoack can you take a quick look? |
Bump @andreasnoack |
Correct. (Note: this rule only applies for packages that are >= version 1.0. Since GLM is at version >= 1.0, you are correct, we bump the minor version for new features). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mostly looks good to me. Just a few minor comments.
data/rankdeficient_test.csv
Outdated
@@ -0,0 +1,1001 @@ | |||
x1,x2,y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to simulate a dataset instead of checking in datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not quite understand... What do you mean by "simulate a dataset"? Generate a random dataset every time when running the test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but if you set the random seed it will be the same every time.
@@ -7,6 +7,11 @@ The effective coefficient vector, `p.scratchbeta`, is evaluated as `p.beta0 .+ f | |||
and `out` is updated to `p.X * p.scratchbeta` | |||
""" | |||
function linpred!(out, p::LinPred, f::Real=1.) | |||
for i in eachindex(p.delbeta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining why this is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was never addressed
Project.toml
Outdated
@@ -37,3 +37,4 @@ StatsBase = "0.30, 0.31" | |||
StatsFuns = "0.6, 0.7, 0.8" | |||
StatsModels = "0.6" | |||
julia = "1" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in eachindex(p.delbeta) | ||
if isnan(p.delbeta[i]) || isinf(p.delbeta[i]) | ||
p.delbeta[i] = 0 | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case the compiler isn't able to hoist the field access:
for i in eachindex(p.delbeta) | |
if isnan(p.delbeta[i]) || isinf(p.delbeta[i]) | |
p.delbeta[i] = 0 | |
end | |
end | |
delbeta = p.delbeta | |
@inbounds for i in eachindex(delbeta) | |
if isnan(delbeta[i]) || isinf(delbeta[i]) | |
delbeta[i] = 0 | |
end | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still applies.
@jason-xuan Would you have the time to add the comment @andreasnoack requested? I can't do that myself, yet it would be too bad that this PR would be blocked just because of this detail. Thanks! |
No problem, I'll try to finish this next week |
f1049b8
to
5c5f2ef
Compare
983e100
to
a1bec45
Compare
Codecov ReportBase: 84.12% // Head: 84.43% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #340 +/- ##
==========================================
+ Coverage 84.12% 84.43% +0.30%
==========================================
Files 7 6 -1
Lines 819 790 -29
==========================================
- Hits 689 667 -22
+ Misses 130 123 -7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@test isa(m1.model.pp.chol, CholeskyPivoted) | ||
@test rank(m1.model.pp.chol) == 3 | ||
# Evaluated: 138626.46758072695 ≈ 138625.6633724341 | ||
# @test deviance(m1.model) ≈ 138625.6633724341 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DilumAluthge I'm trying to rebase it, and it seems to be almost done except for these two deviance numbers. Could you tell me where did you get them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted all my code back to commit @e8a922dd
, and got the same result:
Test Summary: | Pass Total
rankdeficient | 15 15
rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:172
Expression: deviance(m1.model) ≈ 138625.6633724341
Evaluated: 138626.46758019557 ≈ 138625.6633724341
Stacktrace:
[1] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:172
[2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1119
[3] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:146
rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:193
Expression: deviance(m2.model) ≈ 138615.90834086522
Evaluated: 138624.2610477953 ≈ 138615.90834086522
So would you check the benchmark data? Maybe it has been changed after the julia update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been so long, I don't remember where I got those original numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove those two tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,then this PR is ready to be merged @andreasnoack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?
It would also be good to check the values of coefficients (here and below).
Which part here fixed the slow convergence? |
@@ -469,6 +469,7 @@ function fit(::Type{M}, | |||
y::AbstractVector{<:Real}, | |||
d::UnivariateDistribution, | |||
l::Link = canonicallink(d); | |||
allowrankdeficient::Bool = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been renamed to dropcollinear
for other methods AFAIK. EDIT: and it's true
by default for linear models so better do the same here.
for i in eachindex(p.delbeta) | ||
if isnan(p.delbeta[i]) || isinf(p.delbeta[i]) | ||
p.delbeta[i] = 0 | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still applies.
@test isa(m1.model.pp.chol, CholeskyPivoted) | ||
@test rank(m1.model.pp.chol) == 3 | ||
# Evaluated: 138626.46758072695 ≈ 138625.6633724341 | ||
# @test deviance(m1.model) ≈ 138625.6633724341 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?
It would also be good to check the values of coefficients (here and below).
# an example of rank deficiency caused by linearly dependent columns | ||
num_rows = 100_000 | ||
dfrm = DataFrame() | ||
dfrm[!, :x1] = randn(MersenneTwister(123), num_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use StableRNGs for this.
GeneralizedLinearModel
@@ -157,11 +162,32 @@ function delbeta!(p::DensePredChol{T,<:Cholesky}, r::Vector{T}, wt::Vector{T}) w | |||
end | |||
|
|||
function delbeta!(p::DensePredChol{T,<:CholeskyPivoted}, r::Vector{T}, wt::Vector{T}) where T<:BlasReal | |||
cf = cholfactors(p.chol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreasnoack It permutes delbeta rather than doing ldiv
directly that fixed the slow convergence issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreasnoack More comments?
@jason-xuan Do you plan to finish this? It would be too bad to let this be forgotten! |
Sure! I'll see if I could figure out how to get a test case from R next week because I'm new to it. |
If you prefer you can give me a Julia example and I'll run it in R for you. Though the syntax is quite close so it shouldn't be too hard. |
[test/runtests.jl](https://github.com/jason-xuan/GLM.jl/blob/a1bec45accb4a0a94dd2b891e2e45e844d379396/test/runtests.jl#L174) |
a1bec45
to
2eb96e5
Compare
@nalimilan I calculated the deviance with RCall and got the following results: julia> @rput dfrm;
R> fit <- glm(y~1+x1+x2+x3,data=dfrm,family=binomial())
R> deviance(fit)
[1] 138626.5 Compared with |
R is just omitting the remaining digits. You can do |
This was implemented in #488 |
This pull request supersedes #314 because I don't have the write access to either REPOs.
In #314 an issue about slower converge has been discussed. This PR solves that issue and adds a test case about that issue.
cc: @Nosferican @nalimilan @jiahao @dmbates @andreasnoack @DilumAluthge