-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No genlasso #595
base: dev
Are you sure you want to change the base?
No genlasso #595
Conversation
…into no-genlasso
Merge branch 'dev' into no-genlasso # Conflicts: # DESCRIPTION
/preview-docs |
🚀 Deployed on https://679400526c295d36604a015e--epiprocess.netlify.app |
@dshemetov you got ahead of me, I was just wanting to test this out!! Very useful already. @dajmcdon how compatible are genlasso and trendfilter supposed to be? There appear to be some changes from dev: |
The implementation is completely different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumping some notes here. I plan to make at least some minor editing commits and maybe take another look / do actual testing when my mind's a bit more fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Installing
trendfilter
wasn't so bad after installing g++ on Ubuntu, so hopefully this alleviates installation issues, thanks! - Did some sanity checking of the code, as far as I could follow it (mostly the logic here, don't quite understand the external package parameters enough to comment on those)
- Did some testing of the functions to make sure we're getting sensible values. Assuming those are fine, this looks good to me.
When I run tests locally, I get a lot of warnings like:
and at some point something like "partial match of [Ah, now I can see both again. Here's an example of one:
and an example of the other:
] At least some from these parts of my .Rprofile options(
warnPartialMatchArgs = TRUE,
warnPartialMatchAttr = TRUE,
warnPartialMatchDollar = TRUE,
useFancyQuotes = FALSE
)
## seems to work on (currently installed as of time of writing) 4.0.5 though doesn't seem to be documented in ?Logic like in some later version
## thought this was only about [[ not [ though
Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2_"=TRUE)
|
@dajmcdon I finished the rewords I was hoping to do + some of the minor tasks. There's still some lingering stuff that I'm hoping you'll be able to decide what to do with & handle. (I've linted those overly long lines in another PR.) |
OK. By my count, two outstanding issues (after the above commits)
|
@brookslogan following up on the point about underregularization. I did a bit of investigating. The previous implementation used the However, the I'm not sure how much more to beat on this. We could do some other default choice, but I think this is a larger problem (one I've encountered before) that we shouldn't necessarily handle here. We can put it on the tooling agenda for next time we meet with Ryan. x <- cases_deaths_subset |>
filter(geo_value == "pa" & time_value >= "2020-06-01") |>
select(tv = time_value, cases = cases_7d_av) |>
arrange(tv) |>
mutate(tv = as.numeric(tv))
# using internal defaults from current epiprocess implementation
gen <- genlasso::trendfilter(x$cases, ord = 3L, maxsteps = 1000L)
cvgen <- genlasso::cv.trendfilter(gen, k = 3)
tf <- trendfilter::cv_trendfilter(x$cases, x$tv, nfolds = 3L) # chosen to match
# both minimizers are at the minimum lambda
plot(cvgen)
plot(tf)
# convert to growth rates using lambda_min
genf <- gen$beta[,cvgen$i.min]
tff <- predict(tf, which_lambda = "lambda_min")
dgenf <- diff(genf) / diff(x$tv)
dgenf <- c(dgenf, dgenf[length(dgenf)])
dtff <- diff(tff) / diff(x$tv)
dtff <- c(dtff, dtff[length(dtff)])
plot(dgenf / genf, ty = "l", col = 1)
lines(dtff / tff, col = 2) |
Nice catches! I do think the current plot looks pretty weird; part of that might be under-regularization; part might be the degree of the polynomial.
I think in the short term, given that upstream changes may need to be made, we should decide among: A. tweaking this text to be more accurate:
B. maybe trying putting 1se as default, seeing if it makes things look better, and remembering to revisit if something changes upstream / the lambda range changes (which might not only change the "min" selection but also "1se"). Conceptually I like 1se's motivation as a default, but I'm afraid there will be even more weird situations than with min, & we're already hitting one with min. Rather than worry about them all now, we could just go with whatever makes this example look better & try to iterate if we/users encounter issues later. C. both. |
I'm fine with A, but against B (and C). Based on the above investigation, B won't have an effect here (it gives the same result as lambda min). CV is not a great estimator for this data, at least implemented as leave 1 fold out and folds are set as every vth observation. The defaults should remain. We should be thinking of the previously "nice smooth result" as nothing more than an artefact that conveniently looked how we wanted. Along the lines of running gradient descent for only a few iterations and saying, "yeah, I stopped way early, but I like the answer!". |
) | ||
|
||
# other spline args give output (correctness not checked) | ||
z <- rnorm(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we might be running into some nondeterminism in tests, here & in another similar line in this file. Suggest updating these to something like
z <- withr::with_rng_version("3.5.0", withr::with_seed(7164574, rnorm(30)))
but this doesn't seem to trigger the index bounds warning; I'm going to browse around some seeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI hits the bounds warning even earlier, with the test on line 23. There's no randomness there, so maybe better to poke at that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following one of these deterministic problematic lines down to the first problematic interp call, we arrive at this example:
dspline:::rcpp_dspline_interp(v = c(0.112401954925076, 0.224803909850205, 0.337205864775255,
0.449607819700357, 0.562009774625446, 0.674411729550536, 0.786813684475624,
0.899215639400714, 1.0116175943258, 1.12401954925089, 1.23642150417598,
1.34882345910107, 1.46122541402616, 1.57362736895125, 1.68602932387634,
1.79843127880143, 1.91083323372652, 2.02323518865161, 2.13563714357669,
2.24803909850179, 2.47284300835196, 2.58524496327706, 2.69764691820214,
2.81004887312723, 2.92245082805232, 3.03485278297741, 3.14725473790251,
3.25965669282756, 3.37205864775269), k = 3L, xd = c(1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
22, 23, 24, 25, 26, 27, 28, 29, 30), x = c(1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30), implicit = TRUE)
(from something like
withr::with_options(list(warn = 2, warnPartialMatchDollar = FALSE), {
trace(dspline:::rcpp_dspline_interp, tracer = quote({
.GlobalEnv[["problematic_args"]] <- as.list(environment())
}))
try(growth_rate(y = c(1:20, NA, 22:30), method = "trend_filter"))
untrace(dspline:::rcpp_dspline_interp)
dput(problematic_args)
})
)
I haven't done any tracing in dspline Rcpp code to figure out what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the culprit seems likely to be how NAs are being omitted before being passed to the rcpp function, supported by:
Warning message:
In dspline:::rcpp_dspline_interp(v = c(0.112401954925076, 0.224803909850205, :
subscript out of bounds (index 29 >= vector size 29)
29 seems to be a 0-based index here, and
problematic_args$xd %>% length()
#> [1] 29
problematic_args$x %>% length()
#> [1] 30
suggests that it might be assuming xd
and x
have the same length and trying to index xd
all the way through length(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding of that function, as long as the xd
contains the range of x
there shouldn't be any problem. I can't see where the issue would be in the Cpp code. I'd suggest making an issue in dspline
and tagging Ryan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe I got mixed up about line numbers for the two warnings and there isn't a problem reproducing either warning. NA testing seems up above seems like much more of a prime suspect for the length check vs. some randomness down here being the an issue. On the fence about setting the RNG vs. not now; setting makes debugging easier, not setting makes detection more reliable in a sense, as we will be doing many draws over many test runs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the process of filing issue upstream, I think I might have found a way to address: try implicit = FALSE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively by only querying non-design points with implicit = TRUE
, then recombining with the design points manually. (Upstream issue here.)
Checklist
Please:
PR).
brookslogan, nmdefries.
DESCRIPTION
. Always incrementthe patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
process.
Change explanations for reviewer
This completes the process of removing the
genlasso
dependency (because it required a C compiler).glmgen/trendfilter
.growth_rate()
to be a bit less clunky. It could (and often did) return vectors of unexpected lengths (due to NAs or duplicated x-values). This does not play nice withmutate()
which was the main example use case.Note that this does entail breaking changes.
Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch
glmgen
in growth rate function when ready #57