Speed up aggregate data #440

M-Kusumgar · 2024-09-06T17:32:08Z

This branch tries to (and ultimately fails) speed up the review inputs metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like art_new_total, art_new_adult_m, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR:

drop_geometry arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files)
small test fix, we should be checking check3 (defined in the lines just above the diff) and not check1
new algorithm for aggregating up data, it is just a for loop that works level by level to aggregate up the data
same new algorithm is also applied to calculating missing_map

…s metadata endpoint in Hintr. Profiling the code revealed these Naomi functions were the slow parts. After optimising these they are many times faster on the test data and the MWI data however where we ultimately fail is DRC for which there is only a slight speed up because the majority of the time for it is spent calculating the other plot types like art_new_total, art_new_adult_m, etc. We have a solution for this that will probably get done in a PR after the plot refactor. For now, the main changes in this PR: * drop_geometry arg to aggregate functions to skip unnecessary work (these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files) * small test fix, we should be checking `check3` (defined in the lines just above the diff) and not `check1` * new algorithm for aggregating up data, it is just a for loop that works level by level to aggregate up the data * same new algorithm is also applied to calculating `missing_map`

r-ash · 2024-09-10T09:44:22Z

Hintr PR hivtools/hintr#524

r-ash

This looks good to me, and hintr tests are passing with this branch so looking good. I think you can ignore that devl branch failure for some reason it is running really slowly but I don't think raising any real issues.

Really need to sort out this duckdb install stuff, not sure why it is not working from binary. (Maybe we'll have to run these tests inside docker again if I can't get that install working)

r-ash · 2024-09-10T09:52:01Z

R/input-time-series.R

 ##'
 ##' @return Aggregated ART data containing columns area_id, area_name,
 ##' area_level, area_level_label, parent_area_id, sex, age_group, time_period,
 ##'  year, quarter,calendar_quarter and art_current
 ##'
 ##' @export
-aggregate_art <- function(art, shape) {
+aggregate_art <- function(art, shape, drop_geometry = TRUE) {


I don't love passing this through as an arg, does this definitely make it faster? Ran a quick benchmark and it looks like dropping the geometry itself is very fast.

mwi_file <- "tests/testthat/testdata/malawi.geojson" mwi_sf <- sf::read_sf(mwi_file) bench::mark(sf::read_sf(mwi_file), sf::read_sf(mwi_file) |> sf::st_drop_geometry(), sf::st_drop_geometry(mwi_sf), check = FALSE)

# A tibble: 3 × 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> 1 sf::read_sf(mwi_file) 37.9ms 39.6ms 25.2 1.1MB 0 13 0 516ms <NULL> <Rprofmem> <bench_tm> <tibble> 2 sf::st_drop_geometry(sf::read_sf(mwi_file)) 42ms 44ms 22.7 1.1MB 0 12 0 529ms <NULL> <Rprofmem> <bench_tm> <tibble> 3 sf::st_drop_geometry(mwi_sf) 46µs 56.1µs 14715. 656B 2.14 6874 1 467ms <NULL> <Rprofmem [2 × 3]> <bench_tm> <tibble> `` > these functions were trying to drop geometry after it had already been dropped so wasting a lot of time reading the files Do you have a profile that shows that? Code looked like if we passed in an sf object it wouldn't be reading anything?

If it does make a big difference to skip the drop_geometry I wonder if we could instead check if the areas has a geometry or not, and not attempt to remove it if it doesn't have it?

ah okay looking at it properly, we were passing in shape to this function not areas (areas is result of reading the file and geometry dropped) so i can remove this arg and pass in areas only

just added a check for if its a table then no point reading it again

r-ash · 2024-09-10T09:56:11Z

tests/testthat/test-input-time-series.R

@@ -152,7 +152,7 @@ test_that("ART data can be aggregated when avalible at different admin levels",
    dplyr::group_by(area_level_label) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop")

-  expect_equal(check1$n, c(10, 10, 6))
+  expect_equal(check3$n, c(10, 10, 6))


r-ash

This looks good to me. I'd like @rtesra to check the data manipulation before we merge but all fine from my end.

M-Kusumgar requested review from jeffeaton, r-ash and rtesra September 9, 2024 12:47

M-Kusumgar changed the title ~~This branch tries to (and ultimately fails) speed up the review input…~~ Speed up aggregate data Sep 9, 2024

r-ash requested changes Sep 10, 2024

View reviewed changes

M-Kusumgar added 2 commits September 12, 2024 11:28

review comment, drop_geometry

f9471f7

get rid of drop_geometry

b3acf65

M-Kusumgar requested a review from r-ash September 12, 2024 13:04

r-ash approved these changes Sep 13, 2024

View reviewed changes

rtesra approved these changes Oct 2, 2024

View reviewed changes

M-Kusumgar merged commit 3598dd6 into master Oct 2, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up aggregate data #440

Speed up aggregate data #440

M-Kusumgar commented Sep 6, 2024 •

edited

Loading

r-ash commented Sep 10, 2024

r-ash left a comment

r-ash Sep 10, 2024

r-ash Sep 10, 2024

M-Kusumgar Sep 12, 2024

M-Kusumgar Sep 12, 2024

r-ash Sep 10, 2024

r-ash left a comment

Speed up aggregate data #440

Speed up aggregate data #440

Conversation

M-Kusumgar commented Sep 6, 2024 • edited Loading

r-ash commented Sep 10, 2024

r-ash left a comment

Choose a reason for hiding this comment

r-ash Sep 10, 2024

Choose a reason for hiding this comment

r-ash Sep 10, 2024

Choose a reason for hiding this comment

M-Kusumgar Sep 12, 2024

Choose a reason for hiding this comment

M-Kusumgar Sep 12, 2024

Choose a reason for hiding this comment

r-ash Sep 10, 2024

Choose a reason for hiding this comment

r-ash left a comment

Choose a reason for hiding this comment

M-Kusumgar commented Sep 6, 2024 •

edited

Loading