Table of Contents

Introduction

This dataset provides every day, hourly timeseries of air quality forecasts at European observation sites that have been optimised using a Model Ouput Statistic (MOS) method.

The CAMS MOS uses a machine learning algorithm to improve the Chemistry-Transport CAMS European air quality Ensemble forecasts for 4 pollutants (O3, NO2, PM10, PM2.5) at the observation sites. It is optimised on an automatic basis from predictive variables (predictors) over a learning period and delivers a 4-day forecast (0h to 96h). 

By performing an adjustment of the raw forecast of the regular chemistry-transport forecasting system, it belongs to the category of Model Ouput Statistic (MOS) products.

Description of the MOS method

Approaches

In the framework of a previous CAMS Service (CAMS_63), several machine learning postprocessing approaches (as MOS methods) have been experimented in order to take stock of the recent development of machine learning applications. 

Among the different configurations, several learning periods were tested as well as different set of predictors for each species and various performance indicators.

The selected MOS approach applies to the whole Europe meaning that a unique statistical model is built with CAMS Ensemble forecast, IFS meteorological and Observation data covering the whole modelling domain.  The advantage of this global modelling approach is that a very short time period is needed to gather enough data to train a robust model (good performances have been obtained with a short time period for training). To optimise the performances, a new model is built on a daily basis with the most recent available data. Any change in the modelling system (upgrade of a member of the Ensemble model, addition of new observation sites…) is thereby automatically and rapidly passed on a new MOS model producing appropriate correction.

In order to validate the definitive configuration in term of robustness, performance and computing time, assessments have been carried out and published in Bertrand et al., 2023: https://acp.copernicus.org/articles/23/5317/2023/

At this time, the learning period is defined at 3 days.

Predictors

The MOS is trained, over the 3 days learning period, with hourly air quality observations and modelling data (for both air quality and meteorological parameters) and predicts hourly concentrations.

The training is based on the relation between predictors and observations as the target element to define a statistical model. This statistical model is able to convert the same predictors into a concentration forecast, which is here our predictand.

Several sets of predictors have been investigated and results have shown that a limited set of predictors including the concentrations from CAMS Ensemble forecast, some predicted meteorological variables and recent observations provides good performances.

Criteria for used observations :

  • background observation sites, specific selection based on an objective classification (Categories 1 to 7 of the Joly and Peuch, 2012 classification, corresponding roughly to urban, suburban and rural background observation sites)
  • hourly observations
  • 75% availability rate of observations over the learning period

The MOS production takes place once a day from 6:30 UTC and produce forecasts for all observation sites available if the above criteria are met. Thus, the amount of observation sites varies following the species and the date.

Data used for MOS : 

European air quality Ensemble forecast variablesIFS Meteorological forecast parametersEEA Air quality observations
  • O3
  • NO2
  • PM10
  • PM2.5
  • Temperature at 2m
  • Relative humidity
  • Wind speed (zonal and meridional)
  • Boundary layer's height
  • measurements of the previous day

The MOS takes as input some model data but also observations of the previous day acknowledging the importance of the persistence in the forecast skill.

European air quality Ensemble forecast at the observation site location is taken as input for O3, NO2, PM10 and PM2.5 concentrations. 

IFS is used for all weather predictors due to the spatial coverage over Europe and high quality of its forecast.

The model data is gridded, thus a bilinear interpolation is performed to get the value at all observation sites.

Data access

Data is available for download from the CAMS Atmosphere Data Store (ADS). CAMS ADS registered users can access the available data interactively through the CAMS European air quality forecast optimised at observation sites ADS download web interface and/or programmatically using the API as per instructions detailed here.

Data availability (HH:MM)

The processing takes place at 6:30 UTC and the delivery is guaranteed by 8:00 UTC on the ADS.

Spatial resolution

Timeseries are provided at individual observation sites.

Temporal frequency

The MOS model runs once a day from 6:30 UTC.

Data are available with a time resolution of 1 hour and forecasts period from step 0h to step 96h.

Data format

Data are available in csv format with semi-comma separator. The files are split by date, countries and species. 

In the file, the observation sites are declared by their EIONET identifier.

An associated metadata file is available from the download form and gives information on the observation sites (coordinates, altitude, type of observation site as provided by the European Environment Agency, date_start, date_end).

Please also note that the location of some observation sites may change in time. As soon as an observation site displacement occurs, a new line appears in the metadata file with the new coordinates. To date these coordinate changes, date_start and date_end columns indicate the start and end dates for which MOS was produced at these specific coordinates.

Product listings

Please note that not all species are available at all observation sites for all the timesteps.

Variable NameNetCDF UnitsVariable name in ADSNote
Nitrogen dioxideµg m-3
nitrogen_dioxide
Data are available from 17-01-2024
Ozoneµg m-3
ozone
Data are available from 17-01-2024
Particulate matter < 10 µmµg m-3
particulate_matter_10um
Data are available from 17-01-2024
Particulate matter < 2.5 µmµg m-3
particulate_matter_2.5um
Data are available from 17-01-2024

Validation reports

MOS production evaluation will be made available at station level and aggregated by country through an interactive visualization platform

Example visualisation code

See below an example of how to download the data using the API and plot the data for a station:

Demonstration code to download and plot air-quality point forecasts from the Atmosphere Data Store
"""Demonstration code to download and plot air-quality point forecasts from
   the CAMS Atmosphere Data Store"""
 
import os
import sys
import yaml
import json
import cdsapi
import zipfile
import hashlib
import pandas as pd
from math import ceil, sqrt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
 
 
def main(cdsapirc_file=None):
 
    # The data to download
    request = {'variable': ['nitrogen_dioxide', 'ozone',
                            'particulate_matter_2.5um',
                            'particulate_matter_10um'],
               'country': 'netherlands',
               'type': ['raw', 'mos_optimised'],
               'leadtime_hour': ['0-23', '24-47'],
               'year': ['2024'],
               'month': ['01'],
               'day': ['27', '28', '29', '30', '31'],
               'include_station_metadata': 'yes',
               'format': 'zip'}
 
    # The stations to plot. If None, plot them all.
    #stations = None
    stations = ['NL00014']
 
    # Plotting preferences
    style = {
        'type': {
            'raw': {'color': 'tab:blue'},
            'mos': {'color': 'tab:orange'}
        },
        'leadtime_day': {
            0: {'linestyle': 'solid'},
            1: {'linestyle': 'dashed'}
        }
    }
 
    # Download the data. Use a filename that depends on the request so we
    # don't have to re-download if the data already exists.
    data_file = data_filename(request)
    if not os.path.exists(data_file):
        get_data(request, data_file, cdsapirc_file)
 
    # Read the data
    station_data, data = read_data(data_file, stations)
 
    # Make a plot for each station
    for station in data.station_id.unique():
 
        # Extract station metadata for just this site. If there's more than one
        # entry we take the latest one that's valid within this time period
        sdata = station_data.loc[
            (station_data.id == station) &
            (station_data.date_start <= data.datetime.iloc[-1]) &
            (station_data.date_end >= data.datetime.iloc[0])
        ]
        assert len(sdata), 'No metadata for site?'
        sdata = sdata.iloc[-1, :]
 
        # Extract air quality data for just this site
        adata = data[data.station_id == station]
 
        if len(adata) > 0:
            station_plot(sdata, adata, style)
        else:
            print('No data for ' + station)
 
 
def data_filename(request):
    """Return a data filename containing a hash which depends on the request"""
    hash = hashlib.md5()
    hash.update(json.dumps(request, sort_keys=True).encode())
    return 'data_' + hash.hexdigest() + '.zip'
 
 
def get_data(request, data_file, cdsapirc_ifile):
    """Download requested data from the ADS"""
 
    # Read the login credentials if provided
    if cdsapirc_file:
        with open(cdsapirc_file, 'r') as f:
            credentials = yaml.safe_load(f)
        kwargs = {'url': credentials['url'],
                  'key': credentials['key'],
                  'verify': credentials['url'].startswith('https://ads.')}
    else:
        kwargs = {}
 
    client = cdsapi.Client(**kwargs)
    client.retrieve(
        'cams-europe-air-quality-forecasts-optimised-at-observation-sites',
        request,
        data_file)
 
 
def read_data(data_file, stations):
    """Read the downloaded zip file and return the station metadata and
       concentration data as Pandas DataFrames"""
 
    data = {}
 
    # Loop over zip file contents
    with zipfile.ZipFile(data_file) as zip:
        for name in sorted(zip.namelist()):
            with zip.open(name) as f:
 
                if name.startswith('station_list'):
 
                    # Read the station metadata file
                    date_fmt = '%Y-%m-%d'
                    station_data = pd.read_csv(f, sep=';',
                                               keep_default_na=False)
 
                    # Remove metadata for stations we're not interested in
                    if stations:
                        station_data = station_data[
                            station_data.id.isin(stations)]
 
                    # Set missing end dates to a date far in the future
                    no_end = (station_data.date_end == '')
                    station_data.loc[no_end, 'date_end'] = '2099-01-01'
 
                    # Parse start and end dates into datetime objects
                    station_data.date_start = pd.to_datetime(
                        station_data.date_start,
                        format=date_fmt)
                    station_data.date_end = pd.to_datetime(
                        station_data.date_end,
                        format=date_fmt)
 
                else:
 
                    # Read the data file
                    df = pd.read_csv(f, sep=';', parse_dates=['datetime'],
                                     infer_datetime_format=True)
 
                    # Remove data for stations we're not interested in
                    if stations:
                        df = df[df.station_id.isin(stations)]
 
                    # Get the name of the column containing concentration so we
                    # can group data by raw/mos type
                    data_col = [c for c in df.columns if c.startswith('conc_')]
                    assert len(data_col) == 1
                    data_col = data_col[0]
                    if data_col not in data:
                        data[data_col] = []
 
                    data[data_col].append(df)
 
    # Merge the data frames
    merged_data = None
    for data_col in list(data.keys()):
 
        # Concatenate all times/countries/species
        x = pd.concat(data[data_col])
 
        # Merge raw and mos into combined records
        if merged_data is None:
            merged_data = x
        else:
            merged_data = merged_data.merge(x, how='outer', validate='1:1',
                                            on=[c for c in x.columns
                                                if c != data_col])
 
    return station_data, merged_data
 
 
def station_plot(station, data, style):
    """Make a series of plots for a station - one for each species"""
 
    # Species to plot
    allspecies = data.species.unique()
 
    # Create the figure
    nplotsx = ceil(sqrt(len(allspecies)))
    nplotsy = ceil(len(allspecies) / nplotsx)
    fig = plt.figure(figsize=(nplotsx*8, nplotsy*5))
    fig.subplots_adjust(hspace=0.35)
 
    # Plot each species
    for iplot, species in enumerate(allspecies):
        ax = plt.subplot(nplotsy, nplotsx, iplot + 1)
 
        # Extract data for just this species
        sdata = data[data.species == species]
 
        plot_species(station, sdata, style, ax)
 
    plt.show()
 
 
def plot_species(station, data, style, ax):
    """Make a plot for one species at one station"""
 
    allspecies = data.species.unique()
    assert len(allspecies) == 1
 
    # Plotting both raw and mos-corrected or just one?
    types = [t for t in ['raw', 'mos']
             if f'conc_{t}_micrograms_per_m3' in data.columns]
 
    # Plot different lines for each post-processing type and lead time day
    for type in types:
        lead_time_day = data.lead_time_hour // 24
        for day in lead_time_day.unique():
            dt = data[lead_time_day == day]
 
            ax.plot(dt.datetime, dt[f'conc_{type}_micrograms_per_m3'],
                    label=f'{type} forecast D+{day}',
                    **prefs(type, day, style))
 
    # Nicer date labels
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%H\n%a %d'))
 
    ax.set_ylabel('$\mu$g / m$^3$')
 
    date_range = 'from ' + ' to '.join(data.datetime.iat[i].strftime('%Y-%m-%d')
                                       for i in [0, -1])
    ax.set_title(
        ('{species} at {station} (lat={lat}, lon={lon}, altitude={alt}m)\n'
         '{dates}').format(species=allspecies[0],
                           station=station.id,
                           lat=station.lat,
                           lon=station.lon,
                           alt=station.alt,
                           dates=date_range))
 
    ax.legend()
 
 
def prefs(type, leadtime, style):
    """Return pyplot.plot() keyword arguments for given type and leadtime"""
 
    return {**style.get('type', {}).get(type, {}),
            **style.get('leadtime_day', {}).get(leadtime, {})}
 
 
if __name__ == '__main__':
 
    # The ADS credentials can be passed as an argument if they're not stored in
    # the default location
    cdsapirc_file = sys.argv[1] if len(sys.argv) > 1 else None
 
    main(cdsapirc_file=cdsapirc_file) 

Figure 1: Timeseries from 27-01-2024 to 01-02-2024 for nitrogen dioxide (top left corner), ozone (top right corner), particulate matter 10um (top left corner), and particulate matter 2.5um (top left corner) forecasts for station NL00014 in the Netherlands.

Guidelines

  • Users can select either 'Raw' or 'MOS-optimiseddaily air quality forecasts at European observation sites. The raw forecasts are European air quality ensemble forecast interpolated to the observation site location. The MOS-optimised forecasts are produced from the raw forecasts using a statistical post-processing method called machine learning postprocessing as an Model Output Statistic (MOS) method. Both types are provided in the same format.
  • Missing values may be present in the MOS product for some species and/or hours, due to the lack of observations available at the observation site for that species/hours. Indeed last observations are needed to produce MOS because they are used as predictor.

How to acknowledge, cite and refer to the data

All users of data uploaded on the Atmosphere Data Store (ADS) must provide clear and visible attribution to the Copernicus programme and are asked to cite and reference the dataset provider.

(1) Acknowledge according to the licence to use Copernicus Products.

(2) Cite each dataset used:

  • METEO FRANCE, Institut national de l'environnement industriel et des risques (Ineris), Aarhus University, Norwegian Meteorological Institute (MET Norway), Jülich Institut für Energie- und Klimaforschung (IEK), Institute of Environmental Protection – National Research Institute (IEP-NRI), Koninklijk Nederlands Meteorologisch Instituut (KNMI), Nederlandse Organisatie voor toegepast-natuurwetenschappelijk onderzoek (TNO), Swedish Meteorological and Hydrological Institute (SMHI) and Finnish Meteorological Institute (FMI) (2024): CAMS European Air Quality Forecast Optimised at observation sites. Copernicus Atmosphere Monitoring Service (CAMS) Atmosphere Data Store (ADS).  (Accessed on <DD-MMM-YYYY>), https://ads.atmosphere.copernicus.eu/cdsapp#!/dataset/cams-europe-air-quality-forecasts-optimised-at-observation-sites?tab=overview

(3) Throughout the content of your publication, the dataset used is referred to as Author (YYYY) i.e.  METEO-FRANCE et. al (2024)

References


This document has been produced in the context of the Copernicus Atmosphere Monitoring Service (CAMS).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of CAMS on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.