Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No genlasso #595

Open
wants to merge 33 commits into
base: dev
Choose a base branch
from
Open

No genlasso #595

wants to merge 33 commits into from

Conversation

dajmcdon
Copy link
Contributor

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

This completes the process of removing the genlasso dependency (because it required a C compiler).

  1. We replace the functionality with glmgen/trendfilter.
  2. But list this in Suggests and test for it if that method is requested.
  3. Also refactor growth_rate() to be a bit less clunky. It could (and often did) return vectors of unexpected lengths (due to NAs or duplicated x-values). This does not play nice with mutate() which was the main example use case.
  4. Add tests for the (many) edge cases of argument interaction.

Note that this does entail breaking changes.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dshemetov
Copy link
Contributor

/preview-docs

Copy link

🚀 Deployed on https://679400526c295d36604a015e--epiprocess.netlify.app

@brookslogan
Copy link
Contributor

@dshemetov you got ahead of me, I was just wanting to test this out!! Very useful already.

@dajmcdon how compatible are genlasso and trendfilter supposed to be? There appear to be some changes from dev:
ss-tf-dev
to this PR:
ss-tf-pr
and trendfilter seems like it might be a bit under-regularized in this case.

@dajmcdon
Copy link
Contributor Author

The implementation is completely different. genlasso is a path algorithm while trendfilter uses ADMM with warm-starts (From Ramdas+Tibshirani 2016). So the lambda's visited are likely quite different. The default here is to use the CV minimizer (in both cases). So they shouldn't be too far off, modulo the possible lambdas that could be used. We could switch to the 1se rule if we want more regularization.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumping some notes here. I plan to make at least some minor editing commits and maybe take another look / do actual testing when my mind's a bit more fresh.

R/growth_rate.R Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
R/growth_rate.R Outdated Show resolved Hide resolved
@brookslogan brookslogan self-requested a review January 24, 2025 23:17
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Installing trendfilter wasn't so bad after installing g++ on Ubuntu, so hopefully this alleviates installation issues, thanks!
  • Did some sanity checking of the code, as far as I could follow it (mostly the logic here, don't quite understand the external package parameters enough to comment on those)
  • Did some testing of the functions to make sure we're getting sensible values. Assuming those are fine, this looks good to me.

@brookslogan
Copy link
Contributor

brookslogan commented Jan 28, 2025

When I run tests locally, I get a lot of warnings like:

subscript out of bounds (index 29 >= vector size 29)

and at some point something like "partial match of tol = to tolerance".

[Ah, now I can see both again. Here's an example of one:

── Warning (test-growth_rate.R:25:3): new setup args and warnings are as expected ───
subscript out of bounds (index 29 >= vector size 29)
Backtrace:
     ▆
  1. ├─testthat::expect_length(...) at test-growth_rate.R:25:3
  2. │ └─testthat::quasi_label(enquo(object), arg = "object") at testthat/R/expect-length.R:18:3
  3. │   └─rlang::eval_bare(expr, quo_get_env(quo)) at testthat/R/quasi-label.R:45:3
  4. └─epiprocess::growth_rate(y = c(1:20, NA, 22:30), method = "trend_filter")
  5.   ├─stats::predict(obj, newx = x0, which_lambda = which_lambda) at epiprocess/R/growth_rate.R:255:9
  6.   └─trendfilter:::predict.cv_trendfilter(obj, newx = x0, which_lambda = which_lambda)
  7.     ├─stats::predict(object$full_fit, newx, which_lambda, ...)
  8.     └─trendfilter:::predict.trendfilter(...)
  9.       └─base::apply(...)
 10.         └─trendfilter (local) FUN(newX[, i], ...)
 11.           └─dspline::dspline_interp(th, object$k, object$x, newx)
 12.             └─dspline:::rcpp_dspline_interp(v, k, xd, x, implicit)

and an example of the other:

── Warning (test-growth_rate.R:187:3): trendfilter growth_rate implementation ───────
partial match of 'tol' to 'tolerance'
Backtrace:
    ▆
 1. ├─testthat::expect_length(...) at test-growth_rate.R:187:3
 2. │ └─testthat::quasi_label(enquo(object), arg = "object") at testthat/R/expect-length.R:18:3
 3. │   └─rlang::eval_bare(expr, quo_get_env(quo)) at testthat/R/quasi-label.R:45:3
 4. └─epiprocess::growth_rate(y = z, method = "trend_filter", params = growth_rate_params(nfolds = 3))
 5.   └─trendfilter::cv_trendfilter(...) at epiprocess/R/growth_rate.R:246:9
 6.     └─trendfilter::trendfilter(...)
 7.       └─trendfilter:::admm_lambda_seq(...)

]

At least some from these parts of my .Rprofile

options(
  warnPartialMatchArgs = TRUE,
  warnPartialMatchAttr = TRUE,
  warnPartialMatchDollar = TRUE,
  useFancyQuotes = FALSE
)

## seems to work on (currently installed as of time of writing) 4.0.5 though doesn't seem to be documented in ?Logic like in some later version
## thought this was only about [[ not [ though
Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2_"=TRUE)
  • Are these epiprocess or upstream problems? Do they need addressed? The subscript out of bounds thing seems a bit fishy, the partial match more innocuous.

@brookslogan
Copy link
Contributor

brookslogan commented Jan 28, 2025

@dajmcdon I finished the rewords I was hoping to do + some of the minor tasks. There's still some lingering stuff that I'm hoping you'll be able to decide what to do with & handle.

(I've linted those overly long lines in another PR.)

@dajmcdon
Copy link
Contributor Author

dajmcdon commented Jan 29, 2025

OK. By my count, two outstanding issues (after the above commits)

  • The subscript out of bounds (index 29 >= vector size 29) warning. I can't reproduce this locally, but it happens in CI.
  • The question of "are the new results" too wiggly? I think this is up for debate. @lcbrooks what do you think about using lambda.1se as the default instead of the minimizer (it was the minimizer previously). I'm not sure what could account for the difference, though Ryan may know.

@dajmcdon
Copy link
Contributor Author

@brookslogan following up on the point about underregularization. I did a bit of investigating. The previous implementation used the cv minimizer, but capped the number of steps in the solution path at 1000. Because of this, lambda_min == min(lambda) and min(lambda) from genlasso is much larger than min(lambda) from trendfilter. The 1000 steps doesn't "go down the path" as far as in the other implementation, visiting only smoother solutions. So it effectively forces more smoothness due to other algorithmic choices.

However, the trendfilter implementation also produces a questionable level of smoothness. CV there also has the property that lambda_min == min(lambda). So ideally, we would consider even smaller levels of regularization!

I'm not sure how much more to beat on this. We could do some other default choice, but I think this is a larger problem (one I've encountered before) that we shouldn't necessarily handle here. We can put it on the tooling agenda for next time we meet with Ryan.

x <- cases_deaths_subset |>
  filter(geo_value == "pa" & time_value >= "2020-06-01") |>
  select(tv = time_value, cases = cases_7d_av) |>
  arrange(tv) |>
  mutate(tv = as.numeric(tv))

# using internal defaults from current epiprocess implementation
gen <- genlasso::trendfilter(x$cases, ord = 3L, maxsteps = 1000L)
cvgen <- genlasso::cv.trendfilter(gen, k = 3)

tf <- trendfilter::cv_trendfilter(x$cases, x$tv, nfolds = 3L) # chosen to match

# both minimizers are at the minimum lambda
plot(cvgen)
plot(tf)

# convert to growth rates using lambda_min
genf <- gen$beta[,cvgen$i.min]
tff <- predict(tf, which_lambda = "lambda_min")
dgenf <- diff(genf) / diff(x$tv)
dgenf <- c(dgenf, dgenf[length(dgenf)])
dtff <- diff(tff) / diff(x$tv)
dtff <- c(dtff, dtff[length(dtff)])

plot(dgenf / genf, ty = "l", col = 1)
lines(dtff / tff, col = 2)

@brookslogan
Copy link
Contributor

Nice catches! I do think the current plot looks pretty weird; part of that might be under-regularization; part might be the degree of the polynomial.

  • I could believe there is a rapid change around Dec & Jan that's oversmoothed in genlasso, but all these other little bumps seem suspect.
  • The small bumps may look particularly unnatural due to the degree of the spline, but we may sort of be constrained there by attempting apples-to-apples comparison with smoothing splines.

I think in the short term, given that upstream changes may need to be made, we should decide among:

A. tweaking this text to be more accurate:

In this particular example, the trend filtering estimates of growth rate appear to be much more stable than those from the smoothing spline, and also much more stable than the estimates from local relative changes and linear regressions.

B. maybe trying putting 1se as default, seeing if it makes things look better, and remembering to revisit if something changes upstream / the lambda range changes (which might not only change the "min" selection but also "1se"). Conceptually I like 1se's motivation as a default, but I'm afraid there will be even more weird situations than with min, & we're already hitting one with min. Rather than worry about them all now, we could just go with whatever makes this example look better & try to iterate if we/users encounter issues later.

C. both.

@dajmcdon
Copy link
Contributor Author

I'm fine with A, but against B (and C).

Based on the above investigation, B won't have an effect here (it gives the same result as lambda min). CV is not a great estimator for this data, at least implemented as leave 1 fold out and folds are set as every vth observation. The defaults should remain.

We should be thinking of the previously "nice smooth result" as nothing more than an artefact that conveniently looked how we wanted. Along the lines of running gradient descent for only a few iterations and saying, "yeah, I stopped way early, but I like the answer!".

)

# other spline args give output (correctness not checked)
z <- rnorm(30)
Copy link
Contributor

@brookslogan brookslogan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we might be running into some nondeterminism in tests, here & in another similar line in this file. Suggest updating these to something like

z <- withr::with_rng_version("3.5.0", withr::with_seed(7164574, rnorm(30)))

but this doesn't seem to trigger the index bounds warning; I'm going to browse around some seeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI hits the bounds warning even earlier, with the test on line 23. There's no randomness there, so maybe better to poke at that one.

Copy link
Contributor

@brookslogan brookslogan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following one of these deterministic problematic lines down to the first problematic interp call, we arrive at this example:

dspline:::rcpp_dspline_interp(v = c(0.112401954925076, 0.224803909850205, 0.337205864775255, 
0.449607819700357, 0.562009774625446, 0.674411729550536, 0.786813684475624, 
0.899215639400714, 1.0116175943258, 1.12401954925089, 1.23642150417598, 
1.34882345910107, 1.46122541402616, 1.57362736895125, 1.68602932387634, 
1.79843127880143, 1.91083323372652, 2.02323518865161, 2.13563714357669, 
2.24803909850179, 2.47284300835196, 2.58524496327706, 2.69764691820214, 
2.81004887312723, 2.92245082805232, 3.03485278297741, 3.14725473790251, 
3.25965669282756, 3.37205864775269), k = 3L, xd = c(1, 2, 3, 
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
22, 23, 24, 25, 26, 27, 28, 29, 30), x = c(1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
23, 24, 25, 26, 27, 28, 29, 30), implicit = TRUE)

(from something like

withr::with_options(list(warn = 2, warnPartialMatchDollar = FALSE), {
  trace(dspline:::rcpp_dspline_interp, tracer = quote({
    .GlobalEnv[["problematic_args"]] <- as.list(environment())
  }))
  try(growth_rate(y = c(1:20, NA, 22:30), method = "trend_filter"))
  untrace(dspline:::rcpp_dspline_interp)
  dput(problematic_args)
})

)

I haven't done any tracing in dspline Rcpp code to figure out what's going on.

Copy link
Contributor

@brookslogan brookslogan Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the culprit seems likely to be how NAs are being omitted before being passed to the rcpp function, supported by:

Warning message:
In dspline:::rcpp_dspline_interp(v = c(0.112401954925076, 0.224803909850205,  :
  subscript out of bounds (index 29 >= vector size 29)

29 seems to be a 0-based index here, and

problematic_args$xd %>% length()
#> [1] 29
problematic_args$x %>% length()
#> [1] 30

suggests that it might be assuming xd and x have the same length and trying to index xd all the way through length(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding of that function, as long as the xd contains the range of x there shouldn't be any problem. I can't see where the issue would be in the Cpp code. I'd suggest making an issue in dspline and tagging Ryan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe I got mixed up about line numbers for the two warnings and there isn't a problem reproducing either warning. NA testing seems up above seems like much more of a prime suspect for the length check vs. some randomness down here being the an issue. On the fence about setting the RNG vs. not now; setting makes debugging easier, not setting makes detection more reliable in a sense, as we will be doing many draws over many test runs.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of filing issue upstream, I think I might have found a way to address: try implicit = FALSE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively by only querying non-design points with implicit = TRUE, then recombining with the design points manually. (Upstream issue here.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move genlasso to Suggests:/Enhances: Use glmgen in growth rate function when ready
3 participants