An overview of potential avenues for performance enhancement #139

fBedecarrats · 2023-03-06T22:13:43Z

fBedecarrats
Mar 6, 2023

Background: Currently the package offers an intelligible and coherent syntax for geodata manipulation, state-of-the-art processing methods and a well documented guided process to avoid common mistakes in data preparation. It is great.
Processing performance already good, but it reaches limits for small-scale analysis over large areas. From my intents, I have the impression that it can be suited (ie. taking less than days to process) to analyze average size areas of interest (>= 5km2?) at continental scale or small size area of interest at national/regional scale. However, the performance levels it achieves might not be sufficient to process small-scale areas of interest (<= 1km2?) at continental or global scale.
Some use cases (eg. the replication of Wolf et al. 2021) or the processing of statistics for all PAs in the world, could require (or benefit from) better performance.

I propose to dedicate this discussion thread to identify possible avenues to enhance the processing performance.

As a prerequisite, a few actions could help us share a common language and understanding of what this is about:

define more specific criteria to clarify what is "small scale" and "large areas" to clarify our discussion on these issues;
design benchmark tests to measure reliably the processing performance and always refer to these tests when we try code modifications or processing environments;
keep in mind that the package always serves an interesting range of use cases and it might not be its role to address use cases that require more massive processing.

Then, different complementary avenues could be explored to enhance performance:

enable nested parallelization when we process several indicators on the same AOIs (we saw here that it could lead to important performance gains);
instead of storing the data on the local file system, enable storing the data on clould enhanced data lakes (eg. AWS S3 or Azure blob storage). This could enhance the performance when parallelizing on Azure or AWS (or the open source Kubernetes environment that I use that emulates AWS S3 API);
instead of downloading resources, query them from cloud storage (see Qiusheng Wu's inventory of resources on Amazon, Azure/PC or NASA open data cloud repositories). As @goergen95 showed in A different approach to parallelization with mapme.biodiversity #135 , it is possible to read from these resources directly at the calc_indicator() stage, and only the polygons on which the calculation is done;
for gridded/honeycomb analysis at small scales, consider raster based analysis instead of vectorized cells.

Any ideas on other options to consider and/or comments on these?

goergen95 · 2023-03-07T06:57:11Z

goergen95
Mar 7, 2023
Maintainer

Thanks for starting this discussion and the valuable ideas for improving performance!

Two additional points come to mind:

Omit the creation of asset-based temporal directories. I suspect that for "small-scale" analysis this might introduce significant overhead (the same might be true for progress bar updates)
in addition to benchmarks, it could be useful to profile R code (something like this) to verify assumptions like the one above and others to understand where code changes might lead to more efficiency

0 replies

goergen95 · 2023-11-05T13:04:13Z

goergen95
Nov 5, 2023
Maintainer

Hi!
With this script I just processed a grid over Brazil (at ~10,000 ha each, N=176.732) on a customer-grade laptop in about 71 minutes. As far as I can tell, this uses only a single core. Splitting the grid into chunks and processing in parallel should decrease the processing time even further. RAM usage for this single processes maxed out at ~ 9 Gb.

library(sf)
library(terra)
library(exactextractr)
library(mapme.biodiversity)

brazil <- st_as_sf(raster::getData('GADM', country='BRA', level=2))
brazil <- st_cast(brazil, "POLYGON")

data_dir <- "./brazil-data"
dir.create(data_dir, showWarnings = FALSE)

aoi <- init_portfolio(brazil, 2000:2021, outdir = data_dir)
aoi <- get_resources(aoi, c("gfw_treecover", "gfw_lossyear"))

treecover_files <- list.files(data_dir, pattern = "treecover", full.names = TRUE, recursive = TRUE)
lossyear_files <- list.files(data_dir, pattern = "lossyear", full.names = TRUE, recursive = TRUE)

treecover_vrt <- vrt(grep(".tif$", treecover_files, value = TRUE))
lossyear_vrt <- vrt(grep(".tif$", lossyear_files, value = TRUE))

grid <- st_make_grid(brazil, cellsize = c(0.1, 0.1)) |> st_as_sf()
grid$ID <- 1:nrow(grid)

gfw <- c(treecover_vrt, lossyear_vrt)
names(gfw) <- c("treecover", "lossyear")

gfw_stats <- exact_extract(gfw, grid, function(data, cover){

  data <- data[data$treecover > cover, ]
  data$area <- data$area * data$coverage_fraction

  loss_sum <- by(data$area,data$lossyear,sum)

  result <- data.frame(
    year = 0:21,
    area =  sum(data$area),
    loss = 0
  )

  year <- as.numeric(names(loss_sum))
  value <- as.numeric(loss_sum)
  result$loss[(year+1)] <- value
  result$loss[1] <- 0 # set loss of first year to 0
  result$area <- result$area - cumsum(result$loss)
  result

}, cover = 30, include_area = TRUE, summarize_df = TRUE, append_cols = "ID")

1 reply

fBedecarrats Nov 5, 2023
Author

Nice! But what is different from the current implementation of calc_indicators("treecover_area")? My understanding was that the latter also uses vrt and allows for the exactextract engine, no? Shouldn't we get similar performance with the current mapme package?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An overview of potential avenues for performance enhancement #139

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

An overview of potential avenues for performance enhancement #139

fBedecarrats Mar 6, 2023

Replies: 2 comments · 1 reply

goergen95 Mar 7, 2023 Maintainer

goergen95 Nov 5, 2023 Maintainer

fBedecarrats Nov 5, 2023 Author

fBedecarrats
Mar 6, 2023

Replies: 2 comments 1 reply

goergen95
Mar 7, 2023
Maintainer

goergen95
Nov 5, 2023
Maintainer

fBedecarrats Nov 5, 2023
Author