Update CBS processes - to store and load using S3 #17

atalyaalon · 2021-12-18T20:10:43Z

Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:

process-files - Files from this process should be loaded to s3 - using the load_start_year parameter (and the default will be current year - 1 - hence in 2021 we'll load 2020 and 2021 if data exists (For example, in the first month of 2022 we won't have data of 2022 - but we dont want process to fail)

Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.

If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo)

Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).

@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)

OriHoch · 2021-12-19T05:07:16Z

This greatly complicates the processes, slows it down and makes local development harder.
Can you explain why this is needed?

OriHoch · 2021-12-19T05:09:28Z

regarding process-files - what you describe can be done with local files (I think it may already work that way, I need to check), so that should be in a separate issue, it's not related to S3

OriHoch · 2021-12-19T05:11:14Z

let's keep this issue only about changing from local files to S3, evertyhing else is not related, and already works that way as you describe

OriHoch · 2021-12-21T06:39:58Z

assigning to @atalyaalon to answer the previous comments

atalyaalon assigned OriHoch Dec 18, 2021

atalyaalon mentioned this issue Dec 18, 2021

Airflow process - CBS data backfill #9

Open

OriHoch assigned atalyaalon and unassigned OriHoch Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CBS processes - to store and load using S3 #17

Update CBS processes - to store and load using S3 #17

atalyaalon commented Dec 18, 2021 •

edited

Loading

OriHoch commented Dec 19, 2021

OriHoch commented Dec 19, 2021

OriHoch commented Dec 19, 2021

OriHoch commented Dec 21, 2021

Update CBS processes - to store and load using S3 #17

Update CBS processes - to store and load using S3 #17

Comments

atalyaalon commented Dec 18, 2021 • edited Loading

OriHoch commented Dec 19, 2021

OriHoch commented Dec 19, 2021

OriHoch commented Dec 19, 2021

OriHoch commented Dec 21, 2021

atalyaalon commented Dec 18, 2021 •

edited

Loading