Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rank deficient GeneralizedLinearModel #340

Closed

Conversation

jason-xuan
Copy link

This pull request supersedes #314 because I don't have the write access to either REPOs.

In #314 an issue about slower converge has been discussed. This PR solves that issue and adds a test case about that issue.

cc: @Nosferican @nalimilan @jiahao @dmbates @andreasnoack @DilumAluthge

@jason-xuan jason-xuan changed the title Da/allow rank deficient solve issue about converges slower in #314 Oct 29, 2019
@codecov-io
Copy link

codecov-io commented Oct 31, 2019

Codecov Report

Merging #340 (f69d819) into master (2692f3c) will decrease coverage by 10.17%.
The diff coverage is 100.00%.

❗ Current head f69d819 differs from pull request most recent head 5c5f2ef. Consider uploading reports for the commit 5c5f2ef to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master     #340       +/-   ##
===========================================
- Coverage   81.08%   70.90%   -10.18%     
===========================================
  Files           7        6        -1     
  Lines         703      543      -160     
===========================================
- Hits          570      385      -185     
- Misses        133      158       +25     
Impacted Files Coverage Δ
src/glmfit.jl 75.86% <100.00%> (-0.59%) ⬇️
src/linpred.jl 61.05% <100.00%> (-9.66%) ⬇️
src/glmtools.jl 47.61% <0.00%> (-32.89%) ⬇️
src/lm.jl 72.22% <0.00%> (-21.12%) ⬇️
src/ftest.jl 97.95% <0.00%> (-0.51%) ⬇️
src/GLM.jl
src/negbinfit.jl 93.75% <0.00%> (+12.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b39253f...5c5f2ef. Read the comment docs.

src/linpred.jl Outdated Show resolved Hide resolved
@jason-xuan
Copy link
Author

jason-xuan commented Nov 3, 2019

As discussed in issue #273, the reason why we don't pivot by default is that it would produce inaccurate results by setting argument check=false.

But now what I use here is the piv vector which tells us linearly independent columns. So the result may no longer suffer from that.

Maybe we could check the rank when constructing DensePredChol to decide whether to compute CholeskyPivoted or not, or just compute the CholeskyPivoted by default, because if the matrix is full rank, then we would have piv=1:rank. Then, we can leave a cleaner interface to users.

Well... It's hard to modify GLM without affecting LM, maybe we should keep the interface consistent, and merge it first?

@AsafManela
Copy link

I would love to see this one merged for two downstream packages. thanks!

test/runtests.jl Outdated Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
@jason-xuan
Copy link
Author

Since the Project.toml is reverted, the Travis CI may fail. It doesn't mean that the code has a bug.

@kleinschmidt
Copy link
Member

Since we'll squash and merge anyway you can either rebase on master, or merge master into this branch and fix the conflict that way.

Since this adds a feature (support for rank-deficient GLM), we should bump the minor version right?

Is there anything else holding up merging this? @andreasnoack can you take a quick look?

@DilumAluthge
Copy link
Contributor

Bump @andreasnoack

@DilumAluthge
Copy link
Contributor

DilumAluthge commented Mar 13, 2020

Since this adds a feature (support for rank-deficient GLM), we should bump the minor version right?

Correct.

(Note: this rule only applies for packages that are >= version 1.0. Since GLM is at version >= 1.0, you are correct, we bump the minor version for new features).

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly looks good to me. Just a few minor comments.

@@ -0,0 +1,1001 @@
x1,x2,y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to simulate a dataset instead of checking in datasets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not quite understand... What do you mean by "simulate a dataset"? Generate a random dataset every time when running the test code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but if you set the random seed it will be the same every time.

@@ -7,6 +7,11 @@ The effective coefficient vector, `p.scratchbeta`, is evaluated as `p.beta0 .+ f
and `out` is updated to `p.X * p.scratchbeta`
"""
function linpred!(out, p::LinPred, f::Real=1.)
for i in eachindex(p.delbeta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining why this is necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was never addressed

Project.toml Outdated
@@ -37,3 +37,4 @@ StatsBase = "0.30, 0.31"
StatsFuns = "0.6, 0.7, 0.8"
StatsModels = "0.6"
julia = "1"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +10 to +14
for i in eachindex(p.delbeta)
if isnan(p.delbeta[i]) || isinf(p.delbeta[i])
p.delbeta[i] = 0
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case the compiler isn't able to hoist the field access:

Suggested change
for i in eachindex(p.delbeta)
if isnan(p.delbeta[i]) || isinf(p.delbeta[i])
p.delbeta[i] = 0
end
end
delbeta = p.delbeta
@inbounds for i in eachindex(delbeta)
if isnan(delbeta[i]) || isinf(delbeta[i])
delbeta[i] = 0
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still applies.

@nalimilan
Copy link
Member

@jason-xuan Would you have the time to add the comment @andreasnoack requested? I can't do that myself, yet it would be too bad that this PR would be blocked just because of this detail. Thanks!

@jason-xuan
Copy link
Author

@jason-xuan Would you have the time to add the comment @andreasnoack requested? I can't do that myself, yet it would be too bad that this PR would be blocked just because of this detail. Thanks!

No problem, I'll try to finish this next week

@jason-xuan jason-xuan force-pushed the da/allow-rank-deficient branch 2 times, most recently from 983e100 to a1bec45 Compare September 14, 2021 03:32
@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2021

Codecov Report

Base: 84.12% // Head: 84.43% // Increases project coverage by +0.30% 🎉

Coverage data is based on head (a1bec45) compared to base (fa74d14).
Patch coverage: 96.00% of modified lines in pull request are covered.

❗ Current head a1bec45 differs from pull request most recent head 2eb96e5. Consider uploading reports for the commit 2eb96e5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
+ Coverage   84.12%   84.43%   +0.30%     
==========================================
  Files           7        6       -1     
  Lines         819      790      -29     
==========================================
- Hits          689      667      -22     
+ Misses        130      123       -7     
Impacted Files Coverage Δ
src/linpred.jl 79.06% <95.83%> (+1.75%) ⬆️
src/glmfit.jl 80.14% <100.00%> (+1.54%) ⬆️
src/glmtools.jl 81.14% <0.00%> (-1.56%) ⬇️
src/ftest.jl 98.52% <0.00%> (-1.48%) ⬇️
src/GLM.jl
src/lm.jl 96.69% <0.00%> (+0.63%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@test isa(m1.model.pp.chol, CholeskyPivoted)
@test rank(m1.model.pp.chol) == 3
# Evaluated: 138626.46758072695 ≈ 138625.6633724341
# @test deviance(m1.model) ≈ 138625.6633724341
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DilumAluthge I'm trying to rebase it, and it seems to be almost done except for these two deviance numbers. Could you tell me where did you get them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted all my code back to commit @e8a922dd, and got the same result:

Test Summary: | Pass  Total
rankdeficient |   15     15
rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:172
  Expression: deviance(m1.model)  138625.6633724341
   Evaluated: 138626.46758019557  138625.6633724341
Stacktrace:
 [1] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:172
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1119
 [3] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:146
rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:193
  Expression: deviance(m2.model)  138615.90834086522
   Evaluated: 138624.2610477953  138615.90834086522

So would you check the benchmark data? Maybe it has been changed after the julia update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been so long, I don't remember where I got those original numbers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove those two tests.

Copy link
Author

@jason-xuan jason-xuan Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,then this PR is ready to be merged @andreasnoack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?

It would also be good to check the values of coefficients (here and below).

@andreasnoack
Copy link
Member

Which part here fixed the slow convergence?

@@ -469,6 +469,7 @@ function fit(::Type{M},
y::AbstractVector{<:Real},
d::UnivariateDistribution,
l::Link = canonicallink(d);
allowrankdeficient::Bool = false,
Copy link
Member

@nalimilan nalimilan Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been renamed to dropcollinear for other methods AFAIK. EDIT: and it's true by default for linear models so better do the same here.

Comment on lines +10 to +14
for i in eachindex(p.delbeta)
if isnan(p.delbeta[i]) || isinf(p.delbeta[i])
p.delbeta[i] = 0
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still applies.

@test isa(m1.model.pp.chol, CholeskyPivoted)
@test rank(m1.model.pp.chol) == 3
# Evaluated: 138626.46758072695 ≈ 138625.6633724341
# @test deviance(m1.model) ≈ 138625.6633724341
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?

It would also be good to check the values of coefficients (here and below).

# an example of rank deficiency caused by linearly dependent columns
num_rows = 100_000
dfrm = DataFrame()
dfrm[!, :x1] = randn(MersenneTwister(123), num_rows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use StableRNGs for this.

@nalimilan nalimilan changed the title solve issue about converges slower in #314 Add support for rank deficient GeneralizedLinearModel Sep 28, 2021
@@ -157,11 +162,32 @@ function delbeta!(p::DensePredChol{T,<:Cholesky}, r::Vector{T}, wt::Vector{T}) w
end

function delbeta!(p::DensePredChol{T,<:CholeskyPivoted}, r::Vector{T}, wt::Vector{T}) where T<:BlasReal
cf = cholfactors(p.chol)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreasnoack It permutes delbeta rather than doing ldiv directly that fixed the slow convergence issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreasnoack More comments?

@nalimilan
Copy link
Member

@jason-xuan Do you plan to finish this? It would be too bad to let this be forgotten!

@jason-xuan
Copy link
Author

Sure! I'll see if I could figure out how to get a test case from R next week because I'm new to it.

@nalimilan
Copy link
Member

If you prefer you can give me a Julia example and I'll run it in R for you. Though the syntax is quite close so it shouldn't be too hard.

@jason-xuan
Copy link
Author

If you prefer you can give me a Julia example and I'll run it in R for you. Though the syntax is quite close so it shouldn't be too hard.

[test/runtests.jl](https://github.com/jason-xuan/GLM.jl/blob/a1bec45accb4a0a94dd2b891e2e45e844d379396/test/runtests.jl#L174)
The result of the uncomment check, I tried but I can't reproduce the result from R myself.

@jason-xuan
Copy link
Author

jason-xuan commented Mar 25, 2022

@nalimilan I calculated the deviance with RCall and got the following results:

julia> @rput dfrm;

R> fit <- glm(y~1+x1+x2+x3,data=dfrm,family=binomial())

R> deviance(fit)
[1] 138626.5

Compared with deviance(m1.model)=138626.46758072695, it seems to be a rounded version of my result. Could it be considered OK? Is there a way to get a more accurate result?

@nalimilan
Copy link
Member

R is just omitting the remaining digits. You can do print(deviance(fit), digits=10) to see more of them. But probably the result is the same (for practical purposes).

@andreasnoack
Copy link
Member

This was implemented in #488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants