Skip to content

Notebook: OpenAQ data overview

Sebastian Steinig edited this page May 16, 2024 · 6 revisions

File names

  • /notebooks/01_openaq-data-overview.ipynb
  • /notebooks/01_openaq-data-overview.py

Motivation

OpenAQ provides access to thousands of real-time air pollution measurement stations via a free API. For our target list of cities around the world, we want to explore the following questions:

  1. How many cities are covered by the OpneAQ database for individual pollutants?
  2. How homogeneous is the data across different stations for the same city?
  3. Will spurious measurements be a problem? How can we detect outliers?

Methods

  • fetch all available data for the last 7 days for each city from the OpenAQ API
  • save all data for stations providing any of ["o3", "no2", "so2", "pm10", "pm25"] within a radius of 25 km (i.e. API max)
  • plot global maps of cities with available data and time series for all stations for the last 7 days grouped by city

Results

Global coverage

  • no data for any pollutant at any point over the last week for 50 out of the 153 cities (and some others with single/dodgy stations)
  • best coverage for Europe, North America and Eastern Asia
  • average of $\sim$ 9 stations per location
  • most measurements for $PM_{2.5}$, fewest for $SO_{2}$ and $O_{3}$

Local Variability

  • consistency between stations within the same city/region highly variable (e.g. see below for dominant diurnal ozone cycle in Warsaw; large spread in Athens, non-stationary):
  • air quality stations likely clustered around pollution hot spots, e.g. see $NO_{2}$ peaks around large streets (Hallein A10, Hallein B159) in Salzburg:
  • large spread across stations for all pollutants in Lima (see below)
  • e.g. for $PM_{10}$, choice of station/averaging makes difference between AQI level 1 (<20) or 6 (>150)

Data Quality

  • some outliers and missing values which are not masked are easy to identify, as they are either 0 or -1000:
  • but sometimes spikes of 2-3 data points are difficult to judge/interpret:

Conclusions

  1. no useful data for any pollutant for $\sim$ half of all cities in OpenAQ data -> might need to consider additional data sources
  2. spurious data/outliers for most cities -> robust quality screening will be necessary to detect and remove these data
  3. large spread between sensors within some cities -> develop methods to generate city-level estimates

vAirify Wiki

Home

Getting Started and Overview

Investigations and Notebooks

Testing

Manual Test Charters

Clone this wiki locally