Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider changing data logic #174

Open
sgratzl opened this issue Sep 18, 2021 · 4 comments
Open

consider changing data logic #174

sgratzl opened this issue Sep 18, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@sgratzl
Copy link
Member

sgratzl commented Sep 18, 2021

atm. the app loads all the data stored in multiple files (cases, death, hospitalizations, per us/states) and then one of the first steps is to filter them again by targetVariable (cases, deaths, ...), scoreType, and location.

one option would be to load only the data that is really needed and better split the up in multiple files (targetVariable x score x location (nation or states)). This would reduce the initial loading time and with https://shiny.rstudio.com/reference/shiny/1.6.0/bindCache.html shiny could take care of caching datasets.

@krivard
Copy link
Contributor

krivard commented Sep 22, 2021

@ryantibs this is a candidate improvement to complete before the October 5 meeting

@ryantibs
Copy link
Member

Thanks @krivard. And sounds like a good idea, @sgratzl

What happened to the idea of reading data from disk instead of from the S3 bucket? I remember @nmdefries mentioning that it's actually slow to read from the S3 bucket. And then I suggested we just download a local clone of all the data each Monday that we can read from for the dashboard. Has that been implemented and does it lead to speed improvements?

@nmdefries
Copy link
Collaborator

I've implemented caching for the score data in #169 that only loads the score data twice a week, after each pipeline run. Every other user besides the first one following a pipeline run will be reading the scores from memory. Releasing the caching change is waiting on another PR -- let me go poke some people.

@nmdefries
Copy link
Collaborator

nmdefries commented Apr 17, 2023

This behavior has been added for target variables, but geo types are still combined due to the only small additional benefit that change would provide. Most of the time we're handling state forecasts, both when a single state is requested or when we're summarizing across them; separating out US data only saves effort ~1/60 of the time.

A full fix here could be to load and store separately all target variable x forecaster x geo (x date?) combinations so we can load minimal data at each step and can filter chunks by dir name (in a hierarchical file structure)/element name (in a list) instead of filtering data by row. Should be a lot faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants