Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CBS processes - to store and load using S3 #17

Open
atalyaalon opened this issue Dec 18, 2021 · 4 comments
Open

Update CBS processes - to store and load using S3 #17

atalyaalon opened this issue Dec 18, 2021 · 4 comments
Assignees

Comments

@atalyaalon
Copy link
Contributor

atalyaalon commented Dec 18, 2021

Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:

  • process-files - Files from this process should be loaded to s3 - using the load_start_year parameter (and the default will be current year - 1 - hence in 2021 we'll load 2020 and 2021 if data exists (For example, in the first month of 2022 we won't have data of 2022 - but we dont want process to fail)

Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.

If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo
)

Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).

@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)

@OriHoch
Copy link
Contributor

OriHoch commented Dec 19, 2021

This greatly complicates the processes, slows it down and makes local development harder.
Can you explain why this is needed?

@OriHoch
Copy link
Contributor

OriHoch commented Dec 19, 2021

regarding process-files - what you describe can be done with local files (I think it may already work that way, I need to check), so that should be in a separate issue, it's not related to S3

@OriHoch
Copy link
Contributor

OriHoch commented Dec 19, 2021

let's keep this issue only about changing from local files to S3, evertyhing else is not related, and already works that way as you describe

@OriHoch
Copy link
Contributor

OriHoch commented Dec 21, 2021

assigning to @atalyaalon to answer the previous comments

@OriHoch OriHoch assigned atalyaalon and unassigned OriHoch Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants