Contributors: Hendrik Boogaard (WAGENINGEN ENVIRONMENTAL RESEARCH), Gerald van der Grijn (METEOGROUP)

Table of Contents

History of Modifications

Version

Date

Description of modification

Editor

2.2

11/2020

Updated dataset temporal coverage due to dataset update

ECMWF













Acronyms

Acronym

Description or definition

AgERA5

Daily surface meteorological data set for agronomic use, based on ERA5

CDS

Climate Data Store (of ECMWF)

CUS

Copernicus User Support

ECMWF

European Centre for Medium Range Weather Forecast

ECV

Essential Climate Variable

ERA5

ECMWF Re-Analysis

HRES

High Resolution Forecast

JRC

Joint Research Centre of the European Commission

LT

Local Time

MARS

Monitoring Agricultural ResourceS

NN

Nearest Neighbour

1. Scope of the document

This document summarizes the characteristics of the AgERA5 data set in a concise manner with focus on: space and time extent and resolution; data formats, metadata and flags; description of variables, strengths and limitations, usage do's and dont's.

The AgERA5 dataset provides daily surface meteorological data for the period 1979 to present at spatial resolution of 0.1° grid. The service is based on the fifth generation of ECMWF atmospheric re-analyses of the global climate, better known as ERA5. AgERA5 'connects' users in the agricultural domain to the new ERA5 data set. It includes daily aggregates of agronomic relevant elements, tuned to local day definitions and adapted to the finer topography, finer land use pattern and finer land-sea delineation of the ECMWF HRES operational model. The elements cover temperature, precipitation, snow depth, humidity, cloud cover and radiation.

2. Executive summary

The AgERA5 dataset provides daily surface meteorological data for the period 1979 to present at spatial resolution of 0.1° grid. The service is based on the fifth generation of ECMWF atmospheric re-analyses of the global climate, better known as ERA5.

AgERA5 'connects' users in the agricultural domain to the new ERA5 data set. It includes daily aggregates of agronomic relevant variables, tuned to local day
definitions and adapted to the finer topography, finer land use pattern and finer land- sea delineation of the ECMWF HRES operational model. The variables cover temperature, precipitation, snow depth, humidity, cloud cover and radiation.

3. Product description

The following text applies to AgERA5 version 1.0.

3.1. Introduction

Climate forcing data is used in analysis and agro-environmental modelling to study aspects of productivity and externalities of agriculture (e.g. Toreti et al, 2019; Glotter et al., 2016; De Wit et al., 2010). In this service we start from the hourly ECMWF ERA5 model data and convert the data into meaningful input for these analyses and modelling. It involves a large amount of data that needs to be processed. Acquisition and pre-processing of ERA5 data, both archive and near real-time (NRT) data, is a large and specialized job. It requires a heavy investment for users like technical policymakers, information agencies, NGOs, commodity traders, agri-businesses, insurance providers etcetera. The complex task and required effort may even be a barrier to start using the data.

This service is based on the original hourly deterministic ECMWF ERA5 data, at surface level and available at a spatial resolution of 30 km (~0.28125°). Data were aggregated to daily time steps and corrected towards a finer topography at a 0.1° spatial resolution. Aggregated data at daily time steps follow a local time zone definition and include a number of major agronomic parameters. The correction to the 0.1° grid was realized by applying grid and variable-specific regression equations to an ERA5 data set interpolated at 0.1° grid. The equations were trained on operational ECMWF HRES model data at a 0.1° resolution. The final data set is referred to as AgERA5. AgERA5 users will save potential users money and stimulate businesses in using such high quality data set. It avoids a possible proliferation of different data sets, originating from the basic hourly ERA5 data set.

3.2. Geophysical product description

3.2.1. Generic bioclimatic indicators

The AgERA5 includes 22 agronomic relevant variables. See table 3.1.

Table 3.1: List of variables in the AgERA5 data set

Short name

Long name

Unit

Aggregation

AGROVOC URI

Cloud_Cover_Mean

Total cloud cover (00-00LT)

(0 - 1)

Mean

Dew_Point_Temperature_2m_Mean

2 meter dewpoint temperature (00-00LT)

K

Mean

Preciptation_Flux

Total precipitation (00-00LT)

mm d-1

Sum

Preciptation_Rain_Duration_Fraction

Precipitation type duration - rain (00-00LT)

-

Count


Preciptation_Solid_Duration_Fraction

Precipitation type duration - solid fraction (no hail) composed of: precipitation types freezing rain (3), snow (5), wet snow (6), mixture of
rain and snow (7) and ice pellets (8) (00-00LT)

-

Count


Relative_Humidity_2m_06h

Relative humidity at 06LT

%

-

Relative_Humidity_2m_09h

Relative humidity at 09LT

%

-

Relative_Humidity_2m_12h

Relative humidity at 12LT

%

-

Relative_Humidity_2m_15h

Relative humidity at 15LT

%

-

Relative_Humidity_2m_18h

Relative humidity at 18LT

%

-

Snow_Thickness_LWE_Mean

Snow liquid water equivalent (00-00LT)

cm of liquid water equivalent

Mean

Snow_Thickness_Mean

Snow depth (00-00LT)

cm snow

Mean

Solar_Radiation_Flux

Surface solar radiation downwards (00-00LT)

J m-2d-1

Sum

Temperature_Air_2m_Max_24h

Maximum air temperature at 2 meter (00-00LT)

K

Maximum

Temperature_Air_2m_Max_Day_Time

Maximum air temperature at 2 meter (06-18LT)

K

Maximum

Temperature_Air_2m_Mean_24h

2 meter air temperature (00-00LT)

K

Mean

Temperature_Air_2m_Mean_Day_Tim e


2 meter air temperature (06-18LT)

K

Mean

Temperature_Air_2m_Mean_Night_Ti me


2 meter air temperature (18-06LT)

K

Mean

Temperature_Air_2m_Min_24h

Minimum air temperature at 2 meter (00-00LT)

K

Minimum

Temperature_Air_2m_Min_Night_Time

Minimum air temperature at 2 meter (18-06LT)

K

Minimum

Vapour_Pressure_Mean

Vapour pressure (00-00LT)

hPa

Mean

Wind_Speed_10m_Mean

10 meter wind component (00-00LT)

m s-1

Mean

3.3. Product target requirements

DATA DESCRIPTION


Horizontal coverage

Global (on a regular latitude-longitude grid)

Temporal Coverage

1 January 1979 to present

Temporal resolution

Daily

File format

NetCDF 4, Climate and Forecast (CF) Metadata Convention v1.6

Data type

Grid

Horizontal resolution

0.1° x 0.1°

Some remarks related to the quality of the AgERA5 data set:

  • The spatial resolution of the downloaded ERA5 was selected such that it is as close as possible to the native resolution of ERA5. Therefore the original ERA5 model level data (reanalysis-era5-complete; 0.28125°) was downloaded from the CDS instead of the interpolated version (interpolated to a 0.25° grid).
  • The applied aggregation zone definitions work very well with the local time zones of West- and East-Europe and mostly for the North-American continent. For Asia there is a shift of 2-3 hours between the actual local time definition and the definition in our study. The only extreme mismatch of the local time definitions will happen eastward of the dateline in zone E4. Fortunately, the affected areas (Pacific islands and the very western coast of Alaska) are, from an agricultural perspective, not particularly significant.
  • A location, variable and season specific bias correction towards the HRES operational model was applied. This way the finer topography, finer land use pattern and finer land-sea delineation of the HRES operational model is more or less included in the downscaled ERA5. In fact, the ERA5 data set was tuned to the detailed topography of the HRES operational model also leading to more consistent time series between ERA5 and the HRES operational model.
  • AgERA data are represented by the elevation model of the HRES operational model (period 2016-04-01 and 2018-03-31; HRES model cycles 41r2, 43r1 and 43r3) at a spatial resolution of 0.1 degree.

The quality of the bias correction has been documented in a separate document named: "C3S422Lot1.WEnR.DS2_Downscaling and bias correction v1.7.pdf". For each of the 22 variables there are netCDFs available describing the performance of the bias correction in terms of the following sample statistics: MAE, RMSE, R-squared and bias:

  • models_Cloud_Cover_Mean_Model2SelSeasonals_eval.nc
  • models_Dew_Point_Temperature_2m_Mean_Model2SelSeasonals_eval.nc
  • models_Relative_Humidity_2m_06h_Model2SelSeasonals_eval.nc
  • models_Relative_Humidity_2m_09h_Model2SelSeasonals_eval.nc
  • models_Relative_Humidity_2m_12h_Model2SelSeasonals_eval.nc
  • models_Relative_Humidity_2m_15h_Model2SelSeasonals_eval.nc
  • models_Relative_Humidity_2m_18h_Model2SelSeasonals_eval.nc
  • models_Solar_Radiation_Flux_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Max_24h_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Max_Day_Time_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Mean_24h_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Mean_Day_Time_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Mean_Night_Time_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Min_24h_Model2SelSeasonals_eval.nc
  • models_Temperature_Air_2m_Min_Night_Time_Model2SelSeasonals_eval.nc
  • models_Vapour_Pressure_Mean_Model2SelSeasonals_eval.nc
  • models_Wind_Speed_10m_Mean_Model2SelSeasonals_eval.nc

Overall, the temperature, humidity and wind speed variables benefit most from the correction. The MAE is reduced by 30% to 60% in the majority of cases. Grid points being located in mountainous areas or along coasts and lakes are improved most. This is not surprising as these are the areas where the largest systematic differences between ERA5 and HRES can be expected. But not only the relative improvements are quite large, also the absolute MAE values after the correction are small. The MAE for the 24h mean of the 2m temperatures (2t_davg) for example is for all continents below 0.72K, and for 4 of 6 continents even below 0.51K.

For the solar radiation flux (ssrd_dsumdiff) the MAE improvement is solid and ranges between 2% and 14%, depending on the region and subset. The results of element "24h mean cloud cover" (tcc_davg) are mixed. For most grid points the correction doesn't add any value. The MAE improvement of the majority of all grid points (land and below 800m) is between -2% and +4%, and therefore near zero. Only for grid points above 800m we can observe a small but clear improvement (2% - 8%).

The following conclusions were drawn from the evaluation study:

  1. The selected bias correction method has its largest benefits in mountainous areas, at coast lines and at lakes.
  2. Seasonal correction on top of the simple bias correction further improves the accuracy of the derived correction equations.
  3. The approach works remarkable well for 3 out of the 4 groups of variables. The averaged relative reduction of MAE is between 30% and 60%. These are:
    1. Temperature parameters
    2. Humidity parameters
    3. Wind speed
  4. The correction models for solar radiation flux reach a MAE improvement of 2% to 14%.
  5. For cloud cover the correction has only a minor effect for most of the grid points. However, mountainous regions still benefit from the correction with a MAE improvement of 2%-8%.

The correction towards the HRES operational model is very relevant for users that do near real time monitoring of growing conditions and agricultural production. Note that the final ERA5 product will come available with a time lag of one week including the temporary ERA5 line. For monitoring systems like JRC's Monitoring Agricultural ResourceS (MARS) such time lag is too large and therefore data in such systems have to be completed with data from the HRES operational model. When combining data of two datasets, originating from difference resolutions, biases might be introduced that negatively affect the monitoring performance. This can be avoided by correcting the ERA5 towards the HRES operational model. Similar reasoning applies to forecast products like the ENS forecasts (15/30 day ensemble forecasts). This product can also be downscaled and bias corrected towards the HRES operational model. This way more or less consistent time series are obtained linking reanalysis, HRES and ENS data all around a common 'HRES' reference. Some remarks:

  • To improve the timeliness of the foreseen service the preliminary ERA5 product, ERA5t, needs also to be processed. We hereby assume that the bias correction algorithms, which are based on ERA5 data, can also be applied on ERA5t data.
  • Specifically for users that need to link ERA5 to HRES for NRT monitoring purposes the following issue is relevant. The merge with the HRES operational model would need an additional service relying on specific data contracts with ECMWF. And the HRES operational model data must be processed in a similar way (daily aggregation, possibly elevation corrections etc.) as the ERA5 data.
  • Note that the HRES model is constantly improving (improved model physics, increased spatial resolution etc.). Therefore, with each additional HRES model upgrade, the established statistical relationship between ERA5 and HRES will become less valid. Over time, this may lead to jumps in the time series as the bias correction is correcting for aspects that changed in the HRES model. In such case users, that link ERA5 to HRES, need to be warned and eventually the bias correction needs to be updated.

3.4. Product Gap analysis

Currently the AgERA5 is available for the years 1979 to present as the remaining ERA5 dataset from 1950 to 1978 was not available during the project. 

The AgERA5 was bias corrected towards the HRES operational model. It is assumed that the HRES operational model reflects reality best as it is based on an advanced assimilation and spatialization scheme using many quality-controlled observations, satellite imagery, weather balloons etc. However, the AgERA5 data set does not yet represent the agricultural regions within a 0.1 degree grid cell. This requires an extra elevation correction for temperature (average, minimum and maximum air temperature using a lapse rate of 6.5 °C/km) and possibly for humidity. The elevation of agricultural regions could be defined as follows. For example if a 0.1 degree grid cell has more than 5% arable land the elevation could be calculated as the median of all DEM pixels within that grid cell that are under arable land. In case there is less than 5% arable land within the grid cell, the elevation could equal the lower quartile of all DEM pixels under the complete grid cell.

1 Note that ERA5t will be made available to users within 7 days of real time. It is expected to start later in 2019.

4. Data usage information

4.1. Practical usage considerations use of products

The AgERA5 data set is very much suited to study all aspects of productivity and externalities of agricultural production over the period 1979 to present. The available variables match input needs of most crop growth models like CGMS- WOFOST, EPIC-BOKU, EPIC-IIASA, EPIC-TAMU, GEPIC, LPJ-GUESS, LPJmL, pAPSIM, pDSSAT, PEGASUS, PEPIC, PRYSBI2 etc.

As an example CIMMYT used AgERA5 data for the following objectives:

  • Classification of environments. In breeding programs usually environments (location-year-management combination) are classified as Drought, Optimum, Random Drought, based basically on water regime. However, in non-irrigated experiments, water availability depends on precipitation that can be different from environments to environment even when all of them are called "Drought".
  • New traits. Crop stages can now be described not only by duration days as days to heading, days to maturity; instead, we can use received total radiation in each period, degree days, and other traits that are use of daily data.
  • Understanding environmental effect. Several new environmental variables can be obtained from daily data and their effects on response traits and genotypes effects can be studied. For example, we can use the maximum temperature at different crop stages, not only maximum temperature during the complete crop cycle.

JRC-MARS reckons AgERA5 as a candidate to replace their ERA-Interim data set in support of their crop yield forecasting and monitoring activities.

4.2. Known Limitations of product

See section 3.4.

5. Known issues

  1. Background

    Several users have reported erroneous temperature values in the Tmin-24h variable where the value for selected grid cells could reach unrealistic values of around 220 K (-50 C) in locations with otherwise high temperatures. Analysis of the spatial distribution demonstrated that the cells with erroneous values can often be found in Western Australia but are not limited to that region and can be found in other parts of the World as well (Figure 1). 

    Figure 1: Maps of AgERA5 Tmin-24h with rogue values (black cells) for several regions in the World

    A further analysis on the occurrence of the rogue Tmin-24h values demonstrates that the problem occurs quite often. Figure 2 shows the number of files per year where such rogue values occur in the Tmin-24h variable. Note that files with rogue Tmin-24h values cannot be found by looking at the temperature extremes because the low Tmin-24h values are still within the valid range. E.g. an erroneous value of 220 K in Western Australia in Summer cannot be discriminated from a valid temperature value of 220K that occurs in Eastern Siberia at the same day. Instead, figure 2 was generated by computing the first order spatial differences and selecting on a threshold value.

    Surprisingly there are large differences between the different time-periods: the problem hardly occurs with the 1979-1999 time period, quite regularly in the period 2000-2020 and often since 2021. These time periods coincide with the batches in which the AgERA5 archive has been processed. The origin of the differences is not entirely clear but could be related to different encodings of the original ERA5 input data.

     

    Figure 2: Number of Tmin-24 files per year where the problem of rogue values occurs

    Problem analysis

    To find the origin of the problem  it is needed to dive deep into the processing chain used for AgERA5 and the structure of the ERA5 files used as input for AgERA5. First of all, a feature of the ERA5 input files is that the content of each file does not contain the data for 00:00 to 24:00 UTC. For example, the ERA5 file containing air temperature data for 2024-01-01 contains data ranging from 2024-01-01T07:00:00 up till 2024-01-02T06:00:00. Therefore, the AgERA5 processing line first harmonizes all data files so they contain the time slices for the period 00:00 to 24:00 UTC. For example, for harmonizing the data for 2024-01-02 the processing line takes the files for 2024-01-01 and 2024-01-02, opens them jointly with xarray and takes the slice out of the dataset covering 2024-01-02T00:00:00 <= time < 2024-01-03T00:00:00. Analysis of the processing line of AgERA5 looking specifically what happens at those rogue Tmin-24h values demonstrated that the problem is generated at the step when two ERA5 values are joined (see Figure 3).

    Figure 3: Above: A timeseries of the original hourly ERA5 input data for variable MN2T (minimum temperature). Below: hourly ERA5 data clipped to 0-24UTC period. The grey area show the process of taking 2 ERA5 files and combining them into one new file during which the rogue Tmin-24h values are generated.

    A second point that should be understood is that figure 3 shows the ERA5 data as floating point values in degrees Kelvin. However, that is not how the data is stored on disk. The raw data coming from the ERA5 processing chain is not stored as floating point values (a 32 bit single-precision float) but instead as a C short datatype (a signed 16 bit integer) with an offset and a scaling factor associated with the variable as attributes. You can find out when looking at the data in Panoply. The AgERA5 Tmin-24h is derived from the ERA5 variable "mn2t" and panoply shows the scale_factor and offset values (figure 4).

     

    Figure 4: Encoding of the variable mn2t in ERA5 input files.

    Tools like panoply and xarray handle this completely transparent on the background: they recognize the offset and scale_factor and convert back and forth. Moreover, the scale_factor and offset are highly optimized values: each ERA5 file has its own scale_factor and offset in order to maximize the precision for the given data range.

    The tricky part is when the newly sliced dataset has to be saved into a new NetCDF file which combines data from 2024-01-01 and 2024-01-02 (figure 3). Under the hood, xarray still knows that this data is represented by a C short with a scale_factor and offset, the question is now which scale_factor and offset to apply? The one for 2024-01-01 or the one for 2024-01-02? Xarray applies the scale_factor and offset from the first file it opens, so 2024-01-01 in this case.

    The location in time where things go wrong with the variable "mn2t" is marked with the square on the red curve in the figure 5 below. It is the first slice of the second input file (red line) and xarray is applying the scale_factor and offset of the first input file (green line) to save a new NetCDF file.

    Figure 5: As top figure 3, but with the input value that turns rogue marked with a black diamond.

    The temperature value that turns rogue is the first data point on the red curve (Figure 5) whose actual value is 318.871307 K and we convert it to 16 bit integer by inverting the scale and offset of the NetCDF file represented by the green curve (Figure 4):

    >>> math.trunc((318.871307 - 268.68162880225003)/0.0015210688706274414)
    32996

    And this is what happens with the last data point on the green curve whose values is 318.309113 K

    >>> math.trunc((318.309113 - 268.68162880225003)/0.0015210688706274414)
    32626

     But the maximum value that can be stored in a signed 16 bit integer is 32767. So the first data point on the red line (marked with the diamond) is too large to fit in the range represented by a 16-bit integer because the scale_factor and offset are not representative. The last data point on the green line just fits as it remains below 32767.

    Thus, the rogue Tmin-24h values come from an integer overflow case. Unfortunately, integer overflow does not generate any errors as it just rolls over towards the negative side of the signed integer (starting from -32768) so it is hard to detect. Fixing it in terms of software is relatively easy: we just have to force xarray to write NetCDF files with single precision floats instead of 16-bit integers. This takes twice as much disk space but these are only temporary files so that won't matter. Fixing it in terms of data is more tricky: a complete reprocessing of AgERA5 will be required.

    Since 10 March 2024, the processing line has been updated in order to avoid this problem. However, fixing the issue in the full AgERA5 dataset will require a reprocessing of the archive. We are currently investigating what the consequences are and if a full reprocessing is achievable.

    The consequences of the erroneous values for the fitness for purpose of the AgERA5 dataset are small. The cells with erroneous values are mostly located in deserts and other extremely warm areas which are usually not used for agriculture. Nevertheless such errors are undesirable and should preferably be fixed.

    Impact on AgERA5 variables

    All the input variables that are taken from ERA5 are stored as 16-bit C short datatypes and therefore the problem of integer overflow (and underflow!) could happen for any ERA5 input variable that is used for generating AgERA5. Nevertheless, the impact on different AgERA5 variables is different. Below there is a expert assessment on the impact of the different variables.

    Temperature

    • Temperature variables that take the min or max of a time-slice can be affected directly depending on whether the rogue values is within the selection window (24h or day/night time)
    • Assuming a maximum differences of 90 K for a rogue Temperature value, the temperature variables that are based on the mean are affected by a maximum of ~4 degrees K (90 K / 24 timesteps = 3.75 K) for 24h mean values or ~7.5 degrees K (90 K / 12 timesteps = 7.5 K)

    Precipitation

    • Precipitation is affected slightly because the sum of all 24h values is taken. However, an overflow will turn a precipitation value for a single 1h time slice into a near-zero precipitation. The impact of this will not be noticeable due to the variable and erratic nature of precipitation

    Global radiation

    • Global radiation is affected slightly. However, an overflow will turn a radiation value for a single 1h time slice into a near-zero radiation. The impact of this will not be noticeable due to the natural variability of radiation.

    Windspeed:

    • Windspeed is hardly affected because an overflow or underflow will generate a windspeed in the opposite direction but at similar magnitude. The windspeed in AgERA5 is computed as the square root of the sum of the squared windspeeds in u and v direction. Therefore an over- or underflow will not cause a large difference in daily mean windspeed. 

    Humidity and vapour pressure:

    • Individual humidity values could be affected when they coincide with a particular slice that is affected by rogue temperature values. Given that humidity is constrained between 0 and 100 % the impact is limited.
    • Vapour pressure is computed as the mean of 24 timesteps and is therefore the impact is limited. 

    Snow thickness and LWE:

    • Snow variables are calculated as the mean of 24 time slices and therefore the impact will be limited.

    Precipitation type:

    • Precipitation type is based on a count of the different time steps and is therefore only in a limited degree affected.

This document has been produced in the context of the Copernicus Climate Change Service (C3S).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.

Related articles