Add support for rank deficient `GeneralizedLinearModel` #340

jason-xuan · 2019-10-28T20:54:17Z

This pull request supersedes #314 because I don't have the write access to either REPOs.

In #314 an issue about slower converge has been discussed. This PR solves that issue and adds a test case about that issue.

cc: @Nosferican @nalimilan @jiahao @dmbates @andreasnoack @DilumAluthge

codecov-io · 2019-10-31T01:45:53Z

Codecov Report

Merging #340 (f69d819) into master (2692f3c) will decrease coverage by 10.17%.
The diff coverage is 100.00%.

❗ Current head f69d819 differs from pull request most recent head 5c5f2ef. Consider uploading reports for the commit 5c5f2ef to get more accurate results

@@             Coverage Diff             @@
##           master     #340       +/-   ##
===========================================
- Coverage   81.08%   70.90%   -10.18%     
===========================================
  Files           7        6        -1     
  Lines         703      543      -160     
===========================================
- Hits          570      385      -185     
- Misses        133      158       +25

Impacted Files	Coverage Δ
src/glmfit.jl	`75.86% <100.00%> (-0.59%)`	⬇️
src/linpred.jl	`61.05% <100.00%> (-9.66%)`	⬇️
src/glmtools.jl	`47.61% <0.00%> (-32.89%)`	⬇️
src/lm.jl	`72.22% <0.00%> (-21.12%)`	⬇️
src/ftest.jl	`97.95% <0.00%> (-0.51%)`	⬇️
src/GLM.jl
src/negbinfit.jl	`93.75% <0.00%> (+12.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b39253f...5c5f2ef. Read the comment docs.

src/linpred.jl

jason-xuan · 2019-11-03T02:25:37Z

As discussed in issue #273, the reason why we don't pivot by default is that it would produce inaccurate results by setting argument check=false.

But now what I use here is the piv vector which tells us linearly independent columns. So the result may no longer suffer from that.

Maybe we could check the rank when constructing DensePredChol to decide whether to compute CholeskyPivoted or not, or just compute the CholeskyPivoted by default, because if the matrix is full rank, then we would have piv=1:rank. Then, we can leave a cleaner interface to users.

Well... It's hard to modify GLM without affecting LM, maybe we should keep the interface consistent, and merge it first?

AsafManela · 2020-01-03T15:13:06Z

I would love to see this one merged for two downstream packages. thanks!

test/runtests.jl

.gitignore

Project.toml

jason-xuan · 2020-01-12T23:17:28Z

Since the Project.toml is reverted, the Travis CI may fail. It doesn't mean that the code has a bug.

kleinschmidt · 2020-03-05T14:28:04Z

Since we'll squash and merge anyway you can either rebase on master, or merge master into this branch and fix the conflict that way.

Since this adds a feature (support for rank-deficient GLM), we should bump the minor version right?

Is there anything else holding up merging this? @andreasnoack can you take a quick look?

DilumAluthge · 2020-03-13T13:04:30Z

Bump @andreasnoack

DilumAluthge · 2020-03-13T13:05:35Z

Since this adds a feature (support for rank-deficient GLM), we should bump the minor version right?

Correct.

(Note: this rule only applies for packages that are >= version 1.0. Since GLM is at version >= 1.0, you are correct, we bump the minor version for new features).

andreasnoack

It mostly looks good to me. Just a few minor comments.

andreasnoack · 2020-03-13T13:15:08Z

data/rankdeficient_test.csv

@@ -0,0 +1,1001 @@
+x1,x2,y


Better to simulate a dataset instead of checking in datasets.

Sorry, I do not quite understand... What do you mean by "simulate a dataset"? Generate a random dataset every time when running the test code?

yes, but if you set the random seed it will be the same every time.

andreasnoack · 2020-03-13T13:29:39Z

src/linpred.jl

@@ -7,6 +7,11 @@ The effective coefficient vector, `p.scratchbeta`, is evaluated as `p.beta0 .+ f
 and `out` is updated to `p.X * p.scratchbeta`
 """
 function linpred!(out, p::LinPred, f::Real=1.)
+    for i in eachindex(p.delbeta)


Please add a comment explaining why this is necessary

This one was never addressed

nalimilan · 2020-03-18T11:14:34Z

Project.toml

@@ -37,3 +37,4 @@ StatsBase = "0.30, 0.31"
 StatsFuns = "0.6, 0.7, 0.8"
 StatsModels = "0.6"
 julia = "1"
+


Suggested change

nalimilan · 2020-03-18T11:18:50Z

src/linpred.jl

+    for i in eachindex(p.delbeta)
+        if isnan(p.delbeta[i]) || isinf(p.delbeta[i])
+            p.delbeta[i] = 0
+        end
+    end


Just in case the compiler isn't able to hoist the field access:

Suggested change

for i in eachindex(p.delbeta)

if isnan(p.delbeta[i]) || isinf(p.delbeta[i])

p.delbeta[i] = 0

end

end

delbeta = p.delbeta

@inbounds for i in eachindex(delbeta)

if isnan(delbeta[i]) || isinf(delbeta[i])

delbeta[i] = 0

end

end

This comment still applies.

nalimilan · 2021-03-09T11:09:46Z

@jason-xuan Would you have the time to add the comment @andreasnoack requested? I can't do that myself, yet it would be too bad that this PR would be blocked just because of this detail. Thanks!

jason-xuan · 2021-03-14T07:10:56Z

@jason-xuan Would you have the time to add the comment @andreasnoack requested? I can't do that myself, yet it would be too bad that this PR would be blocked just because of this detail. Thanks!

No problem, I'll try to finish this next week

codecov-commenter · 2021-09-14T03:32:09Z

Codecov Report

Base: 84.12% // Head: 84.43% // Increases project coverage by +0.30% 🎉

Coverage data is based on head (a1bec45) compared to base (fa74d14).
Patch coverage: 96.00% of modified lines in pull request are covered.

❗ Current head a1bec45 differs from pull request most recent head 2eb96e5. Consider uploading reports for the commit 2eb96e5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
+ Coverage   84.12%   84.43%   +0.30%     
==========================================
  Files           7        6       -1     
  Lines         819      790      -29     
==========================================
- Hits          689      667      -22     
+ Misses        130      123       -7

Impacted Files	Coverage Δ
src/linpred.jl	`79.06% <95.83%> (+1.75%)`	⬆️
src/glmfit.jl	`80.14% <100.00%> (+1.54%)`	⬆️
src/glmtools.jl	`81.14% <0.00%> (-1.56%)`	⬇️
src/ftest.jl	`98.52% <0.00%> (-1.48%)`	⬇️
src/GLM.jl
src/lm.jl	`96.69% <0.00%> (+0.63%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

jason-xuan · 2021-09-14T07:08:52Z

test/runtests.jl

+    @test isa(m1.model.pp.chol, CholeskyPivoted)
+    @test rank(m1.model.pp.chol) == 3
+    # Evaluated: 138626.46758072695 ≈ 138625.6633724341
+#     @test deviance(m1.model) ≈ 138625.6633724341


@DilumAluthge I'm trying to rebase it, and it seems to be almost done except for these two deviance numbers. Could you tell me where did you get them?

I reverted all my code back to commit @e8a922dd, and got the same result:

Test Summary: | Pass Total rankdeficient | 15 15 rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:172 Expression: deviance(m1.model) ≈ 138625.6633724341 Evaluated: 138626.46758019557 ≈ 138625.6633724341 Stacktrace: [1] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:172 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1119 [3] top-level scope at /home/xua/.julia/dev/GLM/test/runtests.jl:146 rankdeficient GLM: Test Failed at /home/xua/.julia/dev/GLM/test/runtests.jl:193 Expression: deviance(m2.model) ≈ 138615.90834086522 Evaluated: 138624.2610477953 ≈ 138615.90834086522

So would you check the benchmark data? Maybe it has been changed after the julia update

It's been so long, I don't remember where I got those original numbers.

I would remove those two tests.

OK，then this PR is ready to be merged @andreasnoack

I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?

It would also be good to check the values of coefficients (here and below).

andreasnoack · 2021-09-28T06:41:23Z

Which part here fixed the slow convergence?

nalimilan · 2021-09-28T07:37:48Z

src/glmfit.jl

@@ -469,6 +469,7 @@ function fit(::Type{M},
    y::AbstractVector{<:Real},
    d::UnivariateDistribution,
    l::Link = canonicallink(d);
+    allowrankdeficient::Bool = false,


This has been renamed to dropcollinear for other methods AFAIK. EDIT: and it's true by default for linear models so better do the same here.

nalimilan · 2021-09-28T07:42:43Z

src/linpred.jl

+    for i in eachindex(p.delbeta)
+        if isnan(p.delbeta[i]) || isinf(p.delbeta[i])
+            p.delbeta[i] = 0
+        end
+    end


This comment still applies.

nalimilan · 2021-09-28T07:45:15Z

test/runtests.jl

+    @test isa(m1.model.pp.chol, CholeskyPivoted)
+    @test rank(m1.model.pp.chol) == 3
+    # Evaluated: 138626.46758072695 ≈ 138625.6633724341
+#     @test deviance(m1.model) ≈ 138625.6633724341


I'd rather uncomment these checks, but use the new value. That way if we break something in the future we'll notice it. Can you check these against e.g. R?

It would also be good to check the values of coefficients (here and below).

nalimilan · 2021-09-28T07:46:36Z

test/runtests.jl

+    # an example of rank deficiency caused by linearly dependent columns
+    num_rows = 100_000
+    dfrm = DataFrame()
+    dfrm[!, :x1] = randn(MersenneTwister(123), num_rows)


Better use StableRNGs for this.

jason-xuan · 2021-09-28T08:24:16Z

src/linpred.jl

@@ -157,11 +162,32 @@ function delbeta!(p::DensePredChol{T,<:Cholesky}, r::Vector{T}, wt::Vector{T}) w
 end

 function delbeta!(p::DensePredChol{T,<:CholeskyPivoted}, r::Vector{T}, wt::Vector{T}) where T<:BlasReal
-    cf = cholfactors(p.chol)


@andreasnoack It permutes delbeta rather than doing ldiv directly that fixed the slow convergence issue

@andreasnoack More comments?

nalimilan · 2022-02-06T14:26:24Z

@jason-xuan Do you plan to finish this? It would be too bad to let this be forgotten!

jason-xuan · 2022-02-06T14:31:17Z

Sure! I'll see if I could figure out how to get a test case from R next week because I'm new to it.

nalimilan · 2022-02-06T21:08:03Z

If you prefer you can give me a Julia example and I'll run it in R for you. Though the syntax is quite close so it shouldn't be too hard.

jason-xuan · 2022-02-08T13:01:04Z

If you prefer you can give me a Julia example and I'll run it in R for you. Though the syntax is quite close so it shouldn't be too hard.

[test/runtests.jl ](https://github.com/jason-xuan/GLM.jl/blob/a1bec45accb4a0a94dd2b891e2e45e844d379396/test/runtests.jl#L174)
The result of the uncomment check, I tried but I can't reproduce the result from R myself.

jason-xuan · 2022-03-25T16:37:25Z

@nalimilan I calculated the deviance with RCall and got the following results:

julia> @rput dfrm;

R> fit <- glm(y~1+x1+x2+x3,data=dfrm,family=binomial())

R> deviance(fit)
[1] 138626.5

Compared with deviance(m1.model)=138626.46758072695, it seems to be a rounded version of my result. Could it be considered OK? Is there a way to get a more accurate result?

nalimilan · 2022-03-25T17:14:56Z

R is just omitting the remaining digits. You can do print(deviance(fit), digits=10) to see more of them. But probably the result is the same (for practical purposes).

andreasnoack · 2024-05-07T19:41:33Z

This was implemented in #488

jason-xuan force-pushed the da/allow-rank-deficient branch from 0097e71 to 7160916 Compare October 29, 2019 04:24

jason-xuan changed the title ~~Da/allow rank deficient~~ solve issue about converges slower in #314 Oct 29, 2019

DilumAluthge mentioned this pull request Oct 29, 2019

Add support for rank deficient GeneralizedLinearModel #314

Closed

nalimilan requested a review from andreasnoack November 2, 2019 17:20

jason-xuan commented Nov 3, 2019

View reviewed changes

src/linpred.jl Outdated Show resolved Hide resolved

nalimilan reviewed Jan 12, 2020

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

.gitignore Outdated Show resolved Hide resolved

Project.toml Outdated Show resolved Hide resolved

andreasnoack reviewed Mar 13, 2020

View reviewed changes

nalimilan reviewed Mar 18, 2020

View reviewed changes

jason-xuan force-pushed the da/allow-rank-deficient branch from f1049b8 to 5c5f2ef Compare March 23, 2021 05:09

jason-xuan force-pushed the da/allow-rank-deficient branch 2 times, most recently from 983e100 to a1bec45 Compare September 14, 2021 03:32

jason-xuan commented Sep 14, 2021

View reviewed changes

jason-xuan requested review from kleinschmidt and andreasnoack September 25, 2021 15:31

nalimilan reviewed Sep 28, 2021

View reviewed changes

nalimilan changed the title ~~solve issue about converges slower in #314~~ Add support for rank deficient GeneralizedLinearModel Sep 28, 2021

jason-xuan commented Sep 28, 2021

View reviewed changes

DilumAluthge and others added 4 commits March 25, 2022 23:26

Add support for rank deficient Generalized Linear Models

363b74c

remove unexectued code

c415886

solve issue about converges slower

1fd0790

remove file

2eb96e5

jason-xuan force-pushed the da/allow-rank-deficient branch from a1bec45 to 2eb96e5 Compare March 25, 2022 15:33

nalimilan mentioned this pull request Apr 3, 2022

Fix coeftable for saturated linear models #458

Merged

mousum-github mentioned this pull request Jul 15, 2022

Implementation of dropcollinear feature in GeneralizedLinearModel #488

Merged

kleinschmidt removed their request for review March 20, 2024 16:25

andreasnoack closed this May 7, 2024

Add support for rank deficient GeneralizedLinearModel #340

Add support for rank deficient GeneralizedLinearModel #340

Conversation

jason-xuan commented Oct 28, 2019

codecov-io commented Oct 31, 2019 • edited Loading

Codecov Report

jason-xuan commented Nov 3, 2019 • edited Loading

AsafManela commented Jan 3, 2020

jason-xuan commented Jan 12, 2020

kleinschmidt commented Mar 5, 2020

DilumAluthge commented Mar 13, 2020

DilumAluthge commented Mar 13, 2020 • edited Loading

andreasnoack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Mar 9, 2021

jason-xuan commented Mar 14, 2021

codecov-commenter commented Sep 14, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-xuan Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreasnoack commented Sep 28, 2021

nalimilan Sep 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Feb 6, 2022

jason-xuan commented Feb 6, 2022

nalimilan commented Feb 6, 2022

jason-xuan commented Feb 8, 2022

jason-xuan commented Mar 25, 2022 • edited Loading

nalimilan commented Mar 25, 2022

andreasnoack commented May 7, 2024

Add support for rank deficient `GeneralizedLinearModel` #340

Add support for rank deficient `GeneralizedLinearModel` #340

codecov-io commented Oct 31, 2019 •

edited

Loading

jason-xuan commented Nov 3, 2019 •

edited

Loading

DilumAluthge commented Mar 13, 2020 •

edited

Loading

codecov-commenter commented Sep 14, 2021 •

edited

Loading

jason-xuan Sep 15, 2021 •

edited

Loading

nalimilan Sep 28, 2021 •

edited

Loading

jason-xuan commented Mar 25, 2022 •

edited

Loading