-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up aggregate data #440
Conversation
…s metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like art_new_total, art_new_adult_m, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR: * drop_geometry arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files) * small test fix, we should be checking `check3` (defined in the lines just above the diff) and not `check1` * new algorithm for aggregating up data, it is just a for loop that works level by level to aggregate up the data * same new algorithm is also applied to calculating `missing_map`
Hintr PR hivtools/hintr#524 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, and hintr tests are passing with this branch so looking good. I think you can ignore that devl branch failure for some reason it is running really slowly but I don't think raising any real issues.
Really need to sort out this duckdb install stuff, not sure why it is not working from binary. (Maybe we'll have to run these tests inside docker again if I can't get that install working)
R/input-time-series.R
Outdated
##' | ||
##' @return Aggregated ART data containing columns area_id, area_name, | ||
##' area_level, area_level_label, parent_area_id, sex, age_group, time_period, | ||
##' year, quarter,calendar_quarter and art_current | ||
##' | ||
##' @export | ||
aggregate_art <- function(art, shape) { | ||
aggregate_art <- function(art, shape, drop_geometry = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love passing this through as an arg, does this definitely make it faster? Ran a quick benchmark and it looks like dropping the geometry itself is very fast.
mwi_file <- "tests/testthat/testdata/malawi.geojson"
mwi_sf <- sf::read_sf(mwi_file)
bench::mark(sf::read_sf(mwi_file), sf::read_sf(mwi_file) |> sf::st_drop_geometry(), sf::st_drop_geometry(mwi_sf), check = FALSE)
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 sf::read_sf(mwi_file) 37.9ms 39.6ms 25.2 1.1MB 0 13 0 516ms <NULL> <Rprofmem> <bench_tm> <tibble>
2 sf::st_drop_geometry(sf::read_sf(mwi_file)) 42ms 44ms 22.7 1.1MB 0 12 0 529ms <NULL> <Rprofmem> <bench_tm> <tibble>
3 sf::st_drop_geometry(mwi_sf) 46µs 56.1µs 14715. 656B 2.14 6874 1 467ms <NULL> <Rprofmem [2 × 3]> <bench_tm> <tibble>
``
> these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files
Do you have a profile that shows that? Code looked like if we passed in an sf object it wouldn't be reading anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it does make a big difference to skip the drop_geometry
I wonder if we could instead check if the areas
has a geometry or not, and not attempt to remove it if it doesn't have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay looking at it properly, we were passing in shape
to this function not areas
(areas
is result of reading the file and geometry dropped) so i can remove this arg and pass in areas
only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added a check for if its a table then no point reading it again
@@ -152,7 +152,7 @@ test_that("ART data can be aggregated when avalible at different admin levels", | |||
dplyr::group_by(area_level_label) %>% | |||
dplyr::summarise(n = dplyr::n(), .groups = "drop") | |||
|
|||
expect_equal(check1$n, c(10, 10, 6)) | |||
expect_equal(check3$n, c(10, 10, 6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'd like @rtesra to check the data manipulation before we merge but all fine from my end.
This branch tries to (and ultimately fails) speed up the review inputs metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like
art_new_total
,art_new_adult_m
, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR:drop_geometry
arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files)check3
(defined in the lines just above the diff) and notcheck1
missing_map