Can we predict future winter average temperatures in the Northern Hemisphere one month in advance? Where are average temperature more likely to be extreme? The challenge of seasonal forecasting is typically addressed with numerical simulations based on physics and empirical parametrization of sub-grid cells processes. While widespread, this approach is computationally expensive and requires solid meteorological modeling knowledge. In contrast, we adopt here a purely statistical approach which is computationally cheap and relates temperature anomalies to spatial and temporal patterns of typical weather.
When starting this project, I had a few goals in mind:
- Set-up a prototype for a winter hedge product, i.e. guess which meteorological stations will have maximal payouts in “Heating Degree Days” option-like weather certificates.
- Create a flexible technical environment that will serve as a testbed for several machine learning experiments.
- Illustrate how databases can efficiently be used in climate research.
The project consists of three phases:
- Data download and ingestion into MongoDB.
- Construction of the predictors.
- Seasonal prediction.
The project includes three directories, which are described below:
pred
: contains the python classes necessary to download the station and reanalysis data.scripts
: contains the scipts to be executed.env
: contains the files necessary to create a virtual environment dedicated to this project.data
: contains various files, e.g., configuration, plots, etc.dev
: contains development files, e.g., jupyter-notebooks, etc.
- Per default, it is assumed that the user has access to a running MongoDB database service. Please review and modify the access configuration file at
data/config.json
. Access Control can be defined in Mongo shell following these instructions. - All necessary Python packages can be installed in a pipenv virtual environment (venv). The Pipfile is located in env/Pipfile. In order to setup the venv:
- Install pipenv.
- Go to the 'env/' directory and execute
pipenv install
. - In order to activate the venv, execute
pipenv shell
from theenv/
directory.
This work is based on the study of Wang et al. (2017). The authors have shown that autumn patterns of sea-ice concentration (SIC), stratospheric circulation (SC) and sea surface temperature (SST) are closely related to the winter Norther Atlantic Oscillation (NAO) index. Using linear regressions and Principal Component Analysis (PCA), I managed to reproduce the following central result of this study: principal component scores of SIC, SC and SST patterns explain roughly 57% of the average winter NAO index. Next, I have extended this methodology at the spatial scale of individual stations.
The following figures show the principal component patterns for sea-ice concentration (Figure 1, first loading), stratospheric circulation (Figure 2, second loading) and sea surface temperature (Figure 3, third loading). The combined amplitudes of these patterns are related to temperature anomalies in the northern hemisphere.
Figure 1: Leading principal component for sea-ice concentration (SIC) in autumn. This mode features patterns localized in the Barents and Kara Seas during the freezing season and explains 13.3% of SIC variability.
Figure 2: Second principal component of stratospheric circulation (Z70hPa). This mode exhibits a bipolar pattern over eastern Siberia and northern Canada and explains 9.8% of stratospheric circulation variability in autumn. Its positive phase is characterized by an eastward shift of the polar vortex.
Figure 3: Third principal component of sea surface temperature (SST). This mode shows a tri-polar pattern in the Northern Atlantic sector (a warm center in mid-latitudes and cold anomalies on the tropical and polar sides) and explains 5.2% of SST variability in autumn.
Quick summary of the datasets:
- Station dataset: monthly station measurements for average monthly temperature (i.e. our “ground truth”) come from the GHCN dataset. More details are given in the README_GHCN_MONTHLY.md file.
- Grid dataset: monthly sea-ice concentration, stratospheric circulation (Z70 hPa), sea surface temperature and other variables are provided by the ERA5T re-analysis dataset. More details are given in the README_ERA5T_MONTHLY.md file.
Given that you already have a MongoDB instance running (per default locally) and with all required python package installed (see Prerequisites), source and ingest (or update) both datasets by executing the following commands:
cd env/ && pipenv shell && cd ..
python script/01_ghcn_monthly_feed.py
python script/02_era5T_feed.py
In MongoDB, the data is stored in the following two collections:
Description | Database name | Collection name |
---|---|---|
GHCNM stations | GHCNM | stations |
GHCNM data | GHCNM | dat |
ERA5t grid | ERA5t | grid |
ERA5t data | ERA5t | dat |
For all details, check README_GHCN_MONTHLY.md
The GHCN database contains two collections, one recording the location and the name of the stations, one other containing the time series of monthly average temperature (TAVG). A typical station document in MongoDB looks like this:
{'_id': ObjectId('...'),
'station_id': 12345,
'name': 'Zürich',
'loc': {'coordinates': [8.54, 47.38], 'type': 'Point'},
'country': 'Switzerland',
(…),
'wmo_id': 789}
A typical monthly station data document contains monthly observations and looks like:
{‘_id’: ObjectId(‘...’),
‘station_id’: 2345,
‘variable’: ‘TAVG’,
‘year’: 2017,
‘1’: 2.9,
‘2’: 1.7,
‘3’: 7.4,
(...)
‘12’: 2.1,
}
where "1", "2", "3", ..., "12" are the months for which average temperature (TAVG) is being reported.
For all details, check README_ERA5T_MONTHLY.md
The ERA5T database contains two collections: one containing the grid cell locations and a second collection containing the monthly time series. A typical grid document is spatially indexed and looks like this:
{
'_id': ObjectId('...'),
'id_grid': 1,
'loc': {
'coordinates': [-180.0, 90.0],
'type': 'Point'},
'lsm' : 0.0,
(...)
}
In addition to the location and the grid_id, several invariant (in time) parameters are also present, e.g., land-sea mask (lsm). For a precise description of these parameters, see Table 1 at this link.
A typical re-analysis monthly data document has indexes put on the date and grid_id and looks like this:
{'_id': ObjectId('...'),
'date': datetime.datetime(1995, 1, 1, 0, 0),
'id_grid': 1,
'ci': 1.0,
'sp': 102342.02,
'sst': 271.46,
'z70': 168316.99}
The stored variable are:
Abbreviation | Variable name |
---|---|
ci | Sea-ice cover (0-1) |
sp | Surface pressure (Pa) |
sst | Sea surface temperature (K) |
z70 | Geopotential height at 70 hPa height (m²/s²) |
Important: You need to use ECMWF's API in order to download ERA5 data. Read CDS API documentations for more details. In short, a (free) registration to Copernicus and the setting of ECMWF's API are required. Both steps are easily done. Read the README_ERA5T_MONTHLY.md file for more details.
We follow Wang et al. (2017) and perform a Principal Component Analysis (PCA) of several ERA-interim variables.
- era-int_pca_exploration.ipynb : exploration and visualization of the main modes of variability for SIC, SC, SST.
- winter_predictor.py : the class “Predictor” builds the predictor based on any ERA-interim variable.
The first step is to reproduce the central result of Wang et al. (2017) and predict the Northern Atlantic Oscillation (NAO) index using linear regression and features selection.
Code:
- era-int_NAO_definition.ipynb : definition and calculation of NAO index time series.
- era-int_NAO_prediction.ipynb : lasso regression of NAO index based on all PC scores.
Next, we apply the same methodology for each individual stations. We seek stations that are both predictable (i.e. high adjusted R² score) and that will show large temperature anomalies for the next winter.
Code:
- winter_predictor.py : this is the main code of this project. It contains a two classes: (i) a class “Predictor” that prepares covariate candidates for the regression and (ii) another class StationPrediction that perform the seasonal prediction at the station-level.
- winter_predictor.ipynb : illustrates the use of the classes mentioned above.
- scan.py: this script scans all stations and spots those that have both high predictability (i.e. R²>0.5) that will experience a large seasonal deviation.
- input.csv: contains the list of countries where the analysis is to be conducted
-
Latest stand: First attempt to run the script 02_era.py.
-
For comments: createRow: chosen to store data in one array. First attempt to run the script 02_era.py.
-
Work on the download API funtion getFiles. On 2021-05-06, loading 2021-04z70hPa worked. Now go on with other variables, ci, etc. Check this link also.
-
Predict PCA scores of next winter avg temp using PCA scores of ci, sp, sst, z70, etc.