Table of Contents

Introduction

The ERA5 family daily statistics catalogue entries provide post-processed daily aggregated data (for four statistics) for the ERA5 and ERA5-Land hourly data. These entries replace the ERA5 daily statistics application in the legacy Climate Data Store. This gives users a more consistent feel to accessing the data, and means that the returned data is more inline with other data files produced by the CDS.

The daily statistics are calculated as part of the retrieval and the data are not permanently archived. The daily statistics are calculated using the aggregate submodule in the earthkit-transforms python package. However, to ensure that the daily statistics calculated represent the period requested (e.g. for accumulated variables), several additional steps are taken, as documented here.

How the daily statistics are calculated

This section describes in details the process for calculating the daily statistics. Most users will only need to use the CDS webforms or the CDS API to access the data.

For a full technical demonstration of the calculation, please view the Jupyter Notebook below. The general workflow is:

  1. Initialise the request and map all the variables, including:
    1. the time zone selection is converted to a time_shift  integer which represents the number of hours.
    2. The frequency  selected is converted to an integer representing the number of hours
  2. Loop over the requested variables:
    1. If the variable is an accumulated or mean-rate variable, we subtract one hour from the time_shift
      1. For ERA5 single-levels and ERA5 pressure-levels, the accumulated and mean rate variables represent the hour to the time-stamp, that is, the data time-stamped as YYYY/MM/DD 00:00, represents the accumulation/mean-rate of the data for the time period 23:00 to 00:00 for the date YYYY/MM/DD-1.
    2. If the time_shift is greater than zero, the preceding day is added to the request
      1. To ensure a full sampled period, only the UTC time-zone or time-zones West of UTC (i.e. UTC-HH:00) can be retrieved for the first available day of date (01/01/1940 for ERA5 and 01/01/1950 for ERA5-Land)
    3. If the time_shift is less than zero, the following day is added to the request
    4. The time steps required for the daily statistics calculation are created as a list and added to the request:
      • time steps list
        this_time: list[str] = [f"{i+(this_hour%frequency):02d}:00:00" for i in range(0, 24, frequency)]
    5. The data is requested and returned in grib format
    6. The data is opened using xarray with args and kwargs used by the parent entry.
    7. The daily statistic is calculated using the following command:
      • time steps list
        daily_data = earthkit.transforms.aggregate.daily_reduce(how=HOW, time_shift = {"hours": this_hour}, remove_partial_periods= True)
        # Where: 
        # daily_mean, HOW="mean"
        # daily_max, HOW="max"
        # daily_min, HOW="min"
        # daily_sum, HOW="sum"
    8. The xarray objects is written to netCDF
  3. The netCDF file[s] are returned to the user

Jupyter notebook demonstrating the calculation of the daily statistics

Daily statistics in the CDS

The following workflow demonstrates how to calculate the daily statistics from ERA5 data with earthkit.transforms. This is the methodology used by the derived daily statistics catalogue entries on the CDS.

import cdsapi
import xarray as xr
from earthkit.transforms.aggregate import temporal

Download some raw hourly data

Here we choose the ERA5 single levels 2m temperature and the top soil layer temperature data. We have chosen a coarse grid, an area sub-selection and sampled at 6 hours to reduced the amount data downloaded for the demonstration.

client = cdsapi.Client() 
dataset = "reanalysis-era5-single-levels"
request = {
    'product_type': ['reanalysis'],
    'variable': ['2m_temperature'],
    'date': '20240101/20240131',
    'time': ['00:00', '06:00', '12:00', '18:00'],
    'area': [60, -10, 50, 2],
    'grid': [1,1],
    'data_format': 'grib',
}
result_file = client.retrieve(dataset, request).download()
2024-09-10 15:52:51,773 INFO Request ID is cbe537cd-89ce-412d-9ea2-cd037046d979
2024-09-10 15:52:51,889 INFO status has been updated to accepted
2024-09-10 15:52:55,887 INFO status has been updated to successful
                                                                                       

Open the result file with xarray

The time_dims are specified to be the "valid_time" which is inline with the backend of the CADS post-processing and netCDF conversion.

ds = xr.open_dataset(
    result_file, time_dims=["valid_time"]
)
print(ds)
<xarray.Dataset> Size: 72kB
Dimensions:     (valid_time: 124, latitude: 11, longitude: 13)
Coordinates:
    number      int64 8B ...
  * valid_time  (valid_time) datetime64[ns] 992B 2024-01-01 ... 2024-01-31T18...
    surface     float64 8B ...
  * latitude    (latitude) float64 88B 60.0 59.0 58.0 57.0 ... 52.0 51.0 50.0
  * longitude   (longitude) float64 104B -10.0 -9.0 -8.0 -7.0 ... 0.0 1.0 2.0
Data variables:
    t2m         (valid_time, latitude, longitude) float32 71kB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2024-09-10T15:52 GRIB to CDM+CF via cfgrib-0.9.1...

Calculate the daily statistic

Use the temporal module from earthkit.transforms.aggregate to calculate the daily statistic of relevance. The API to earthkit.transforms.aggregate aims to be highly flexible to meet the programming styles of as many users as possible. Here we provide a handful of examples, but we encourage users to explore teh earthkit documentation for more examples.

https://earthkit-transforms.readthedocs.io/en/latest/

Daily mean
ds_daily_mean = temporal.daily_mean(ds)
print(ds_daily_mean)
<xarray.Dataset> Size: 18kB
Dimensions:     (valid_time: 31, latitude: 11, longitude: 13)
Coordinates:
    number      int64 8B 0
    surface     float64 8B 0.0
  * latitude    (latitude) float64 88B 60.0 59.0 58.0 57.0 ... 52.0 51.0 50.0
  * longitude   (longitude) float64 104B -10.0 -9.0 -8.0 -7.0 ... 0.0 1.0 2.0
  * valid_time  (valid_time) datetime64[ns] 248B 2024-01-01 ... 2024-01-31
Data variables:
    t2m         (valid_time, latitude, longitude) float32 18kB 281.4 ... 279.3
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2024-09-10T15:52 GRIB to CDM+CF via cfgrib-0.9.1...


Daily standard deviation
ds_daily_std = temporal.daily_std(ds)
print(ds_daily_std)
<xarray.Dataset> Size: 18kB
Dimensions:     (valid_time: 31, latitude: 11, longitude: 13)
Coordinates:
    number      int64 8B 0
    surface     float64 8B 0.0
  * latitude    (latitude) float64 88B 60.0 59.0 58.0 57.0 ... 52.0 51.0 50.0
  * longitude   (longitude) float64 104B -10.0 -9.0 -8.0 -7.0 ... 0.0 1.0 2.0
  * valid_time  (valid_time) datetime64[ns] 248B 2024-01-01 ... 2024-01-31
Data variables:
    t2m         (valid_time, latitude, longitude) float32 18kB 0.157 ... 1.934
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2024-09-10T15:52 GRIB to CDM+CF via cfgrib-0.9.1...

How to handle non-UTC Timezone

To caculate the daily statistics for a non-UTC time zone, we use the time_shift kwarg to specify that we want to shift the time to match the requested timezone. The time_shift can be provided as a dictionary or as a pandas-TimeDelta, we use a dictionay for ease of reading. The example below {"hours": 6} is for the time zone UTC+06:00.

In addition, remove_partial_period is set to True such that the returned result only contains values made up of complete period samples.

These arguements, along with all the other accepted arguments, are fully documented in the earthkit-transforms documentation:

https://earthkit-transforms.readthedocs.io/en/stable/_api/transforms/aggregate/temporal/index.html#transforms.aggregate.temporal.daily_mean

ds_daily_max = temporal.daily_max(
    ds, time_shift={"hours": 6}, remove_partial_periods=True
)
print(ds_daily_max)
<xarray.Dataset> Size: 18kB
Dimensions:     (valid_time: 30, latitude: 11, longitude: 13)
Coordinates:
    number      int64 8B 0
    surface     float64 8B 0.0
  * latitude    (latitude) float64 88B 60.0 59.0 58.0 57.0 ... 52.0 51.0 50.0
  * longitude   (longitude) float64 104B -10.0 -9.0 -8.0 -7.0 ... 0.0 1.0 2.0
  * valid_time  (valid_time) datetime64[ns] 240B 2024-01-02 ... 2024-01-31
Data variables:
    t2m         (valid_time, latitude, longitude) float32 17kB 282.0 ... 281.5
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2024-09-10T15:52 GRIB to CDM+CF via cfgrib-0.9.1...

Removing partial periods has resulted in the first day being lost from our initial data request, the first value of valid_time is now the 2024-01-02. Similarly, if we had requested a negative time_shift (Westward of UTC), the final day would have been lost.

The derived daily catalogue entries adjust the data request to ensure that all days requested are included in the returned result file.

For the latest version please see here: Daily statistics

Data organisation and access

The following table provides an overview of which daily data are available.

Table 1: 


Instantaneous parametersAccumulated parameters

ERA5 single levels

(reanalysis and ensemble)

ERA pressure levels

(reanalysis and ensemble)

ERA5-LandX

The data are available from the Climate Data Store (CDS) webforms:

or programatically from the CDS API:

CDS API example for ERA5 single levels daily statistics
import cdsapi

dataset = "derived-era5-single-levels-daily-statistics"
request = {
    'product_type': 'reanalysis',
    'variable': ['10m_u_component_of_wind'],
    'year': '2024',
    'month': ['01'],
    'day': ['01'],
    'daily_statistic': 'daily_mean',
    'time_zone': 'utc+00:00',
    'frequency': '1_hourly'
}

client = cdsapi.Client()
client.retrieve(dataset, request).download()

Spatial grid

The ERA5 reanalysis atmospheric data resolution is 0.25° and the ERA5 ensemble atmospheric data resolution is 0.5°.

The ERA5 wave data resolution is 0.5° and the ERA5 ensemble wave data resolution is 1.0°.

The ERA5-Land data resolution is on 0.1°.

Temporal frequency

Users can select to calculate daily statistics from 1 hour, 3 hours and 6 hours data. 

Data format

The post-processed daily statistics are provided in netCDF format only (in a zip file). The structure and naming conventions used are the same as the hourly dataset used as input. The data is provided as one netCDF file per variable, and all files will be archived in a zip file for downloading.

Accumulated variables for ERA5 land

Please note that ERA5-Land daily accumulated parameters are not available from the catalogue entry.

Please, note that the convention for accumulations used in ERA5-Land differs with that for ERA5The accumulations in the short forecasts of ERA5-Land (with hourly steps from 01 to 24) are treated the same as those in ERA-Interim or ERA-Interim/Land, i.e., they are accumulated from the beginning of the forecast to the end of the forecast step. For example, runoff at day=D, step=12 will provide runoff accumulated from day=D, time=0 to day=D, time=12. The maximum accumulation is over 24 hours, i.e., from day=D, time=0 to day=D+1,time=0 (step=24).

  • HRES: accumulations are from 00 UTC to the hour ending at the forecast step
  • For the CDS time, or validity time, of 00 UTC, the accumulations are over the 24 hours ending at 00 UTC i.e. the accumulation is during the previous day

The data time-stamped YYYY/MM/DD 00:00 represents the total daily accumulation for the date YYYY/MM/DD-1. Therefore:

  1. To calculate the daily accumulation for UTC, you just need to sample the ERA5-Land data at 00:00 and be aware that the data is representative of the day before the time stamp in the data
    1. Note, this is why we do not include this, as it would in effect be the same data but with a different time stamp, leading to confusion
  2. To calculate the daily accumulation for non-UTC time-zones we must sample the data at 2 time steps and then carefully combine them so that they are associated to the correct. This is quite complex and we provide the following notebook as guidance:

For the latest version please see here: Daily accumulation for ERA5-land

Known Issues

  1. Issue with ERA5 family daily accumulated variables for non-UTC timezones now resolved


This document has been produced in the context of the Copernicus Climate Change Service (C3S).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.

Related articles


  • No labels