-
Notifications
You must be signed in to change notification settings - Fork 1
Data Archive Considerations
This page highlights some initial considerations around the archiving of data in the vAirify platform. This includes some very rough calculations to decide if concentrating on archiving is worth the effort.
As the vAirify platform continues to gather data with daily runs of the ETL processes the overall size of both the database and pre-processed data textures volumes will continue to grow. Currently there is no upper limit to this growth.
The forecast ETL runs twice a day. The In-Situ data ETL runs once an hour. For the purposes of this we will ignore logs.
In order to get a rough estimate for how quickly our data stores are likely to grow I first cleared down the three main database tables and all local data textures, then reran the ETLs to populate these, I then removed any data from the current and previous days, as these may not have represented complete datasets. In effect, the only data stored was for a 5 day period. This ended up with the following:
Date | Forecast documents | In Situ documents | Data Textures documents |
---|---|---|---|
1st Aug | 12546 | 32708 | 42 |
2nd Aug | 12546 | 34019 | 42 |
3rd Aug | 12546 | 34793 | 42 |
4th Aug | 12546 | 34358 | 42 |
5th Aug | 12546 | 33880 | 42 |
According to Mongo the storage size of these databases was:
Database | Size |
---|---|
DB forecast_data | 5.39 MB |
DB in_situ_data | 12.88 MB |
DB data_textures | 28.67 kB |
(Image for DB of data_textures table not shown)
This translates to roughly 3.7 MB a day.
In addition to the database tables we have the data textures themselves (stored separately on disk). On my local machine these took up 221 MB overall, or 22.1 MB a day
Combining these equates to a very rough estimate of 25.8 MB added daily by our processes.
Given the Linux box is 200 GB in size, if we were (very) cautious we could allocate 100 GB to the data storage, which would take (100 * 1000) / 25.8 = 3,876 days, or just over 10.5 years to fill.
It should be noted that these calculations are VERY high level and rough.
Getting Started and Overview
- Product Description
- Roles and Responsibilities
- User Roles and Goals
- Architectural Design
- Iterations
- Decision Records
- Summary Page Explanation
- Deployment Guide
- Working Practices
- Q&A
Investigations and Notebooks
- CAMs Schema
- Exploratory Notebooks
- Forecast ETL Process
- In Situ air pollution data sources
- Notebook: OpenAQ data overview
- Notebook: Unit conversion
- Data Archive Considerations
Manual Test Charters
- Charter 1 (Comparing ECMWF forecast to database values)
- Charter 2 (Backend performance)
- Charter 3 (Forecast range implementation)
- Charter 4 (In situ bad data)
- Charter 5 (Filtering ppm units)
- Charter 7 (Forecast API input validation)
- Charter 8 (Forecast API database sizes)
- Charter 9 (Measurements summary API input validation)
- Charter 10 (Seeding bad data)
- Charter 11 ()Measurements API input validation
- Charter 12 (Validating echart plot accuracy)
- Charter 13 (Explore UI after data outage)
- Charter 14 (City page address)
- Charter 15 (BugFix diff 0 calculation)
- Charter 16 (City page chart data mocking)
- Charter 17 (Summary table logic)
- Charter 18 (AQI chart colour banding)
- Charter 19 (City page screen sizes)
- Charter 20 (Date picker)
- Charter 21 (Graph consistency)
- Charter 22 (High measurement values)
- Charter 23 (ppm -> µg m³)
- Charter 24 (Textures API input validation)
- Charter 25 (Graph line colours)
- Charter 26 (Fill in gaps in forecast)
- Charter 27 (Graph behaviour with mock data)
- Charter 28 (Summary table accuracy)
- Re‐execute: Charter 28
- Charter 29 (Fill in gaps in situ)
- Charter 30 (Forecast window)
- Charter 31 (UI screen sizes)