You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:
process-files - Files from this process should be loaded to s3 - using the load_start_year parameter (and the default will be current year - 1 - hence in 2021 we'll load 2020 and 2021 if data exists (For example, in the first month of 2022 we won't have data of 2022 - but we dont want process to fail)
Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.
If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo)
Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).
@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)
The text was updated successfully, but these errors were encountered:
regarding process-files - what you describe can be done with local files (I think it may already work that way, I need to check), so that should be in a separate issue, it's not related to S3
Remove use of local storage in cbs etl for processed files - use only S3 for the most up to date CBS data:
Hence the flow can be as following :
import emails can run every day to check new emails and update CBS files in s3 - we can add a DB table that tracks the latest email id / timestamp loaded for each provider code and year.
If a new email/s is identified ** - process-files is triggered to save data in s3 - and then the parsing and next processes for data loading is triggered using minimum year from process-files - loading data from s3 first(
(https://github.com/hasadna/anyway/blob/dev/anyway/parsers/cbs/executor.py) in anyway repo)
Parsing and next processes for data loading can be triggered by a separate ETL - not related to import email - (since data is loaded from s3) - using load_start_year (that can be any year).
@OriHoch if you have other suggestion for the flow, let me know.
The important thing is that the consistent storage location for files after processing is s3 and not local storage (see here) - and this s3 repo can be trusted as most up to date source for data loading)
The text was updated successfully, but these errors were encountered: