Introduction
The new version of the C3S netCDF encoding standards (C3S-0.3) is an evolution of the existing encoding standards (C3S-0.2) which aims to make them more generic and permissive, in the sense that some of them could be characterized as mandatory or optional depends on operational need. In that way, the encoding standards will be more generic and can be easily applied to projects that haven't been yet operational.
An item is characterized as mandatory if it is used during the processing of the data in order to be stored in MARS. All the other items are characterized as optional. Data also adheres to the DRS (Data Reference Syntax) and Controlled Vocabulary standard for naming files and structured paths.
It's important to mention that the updated version of encoding standards, as well as the previous ones, is constrained by the CF convention, and standards coming from SPECS and CMIP5/6, which means that metadata is not strictly needed by the hosting environment but shall be mandatory to make the output compliant with those standards. As a result, it will achieve maximum interoperability, satisfying the users' expectations to be able to extract data both efficiently and in a uniform way across all models.
An example of the relaxation of the existing standards is that the dissemination of the data interpolated in a common grid could be under some circumstances not mandatory as in many cases project-dependent.
Additionally, ACDD has been also taken into account when defining the data discovery-related metadata.
Hence, the following links are valuable sources of information that have informed the definition of this proposal:
CF convention standard names tables
SPECS file content and format, data structure and metadata
CMIP6 Data Request: MIP variables search
Change List
Encoding Guide for netCDF files
File Formatting
The format of the output products should be netCDF, and conform to the CF metadata standards following the requirements below:
- The output files shall be written through the NetCDF API
- The NETCDF4 _CLASSIC model shall be adopted
- The recommended compression level shall be deflate=6
- The Shuffling shall be True
- The Fletcher32=True is strongly recommended
File Structure
The fill structure shall be:
- Each netCDF4 file shall contain a single output variable (along with coordinate/grid variables, attributes and other metadata) from a single model and a single simulation (i.e., from a single ensemble member and a single start date)
- Recommended maximum file size of 4GB (to avoid any I/O performances)
Each file shall be accompanied by a file containing a hash created with sha256sum.
Note: how to create hash files
File Naming Conventions
The filenames of the products in the C3S seasonal forecast are made following the CMIP5/6 and SPECS DRS elements, as described below.
It's important to highlight that each file must contain only a single output field from a single simulation (i.e., a single run).
In addition, the output filename shall be constructed using a subset of metadata.
C3S Output Filename Conventions
The general filename formats for output files generated within C3S shall follow the below filename convention. All the elements are separated by underscores (“_”) and must appear in the following order:
The Convention
<institute_id>_<model_id tag>_<forecast_type>_<start_date_identifier>_<modeling_realm>_<frequency>_<level_type>_<variable_name>_<ensemble_member>.nc
Details:
<institute_id> is the institute id as it is defined in the controlled vocabulary of the global attributes;
<model_id_tag> as it is defined in the description of the "source" global attribute. For project-specific files, the model-id should start with the name of the project from the global attribute "project" (e.g. CERISE-SystemName-v20240101);
<forecast_type> as it is defined in the controlled vocabulary of the global attributes;
<start_date_identifier> is a string "SYYYYMMDDHH" that defines the start date of the forecast;
<modeling_realm> as it is defined in the controlled vocabulary of the global attributes;
<frequency> as it is defined in the controlled vocabulary of the global attributes;
<level_type> as it is defined in the controlled vocabulary of the global attributes;
<variable_name> is the short name of the variable inside the netCDF file.
<ensemble_member> is the 'realization' coordinate value inside the netCDF file.
.nc is the general netCDF suffix extension
As a general condition, defined before, the file name should be able to be rebuilt from the contents of the metadata. As a result, all the above attributes should be mandatory global attributes of the netCDF file (see below).
Examples of the above convention are:
lfpw_System8-v20210101_forecast_S2023030100_atmos_12hr_pressure_ta_r25i00p00.nc (contribution to the C3S operational service)
lfpw_CERISE-SystemName-v20210101_hindcast_S2010110100_land_day_soil_mrlsl_r01i00p00.nc (contribution to the CERISE project)
Global attributes
The following properties are intended to provide information about where the data came from and what has been done to it. This information is mainly for the benefit of human readers and data discovery mechanisms. The global attribute values are all character strings. When an attribute appears both globally and as a variable attribute, it is the variable’s version which has precedence.
From version 0.3 of this encoding, the attributes have been categorised as mandatory or optional.
In addition, an attribute can be project-depended and project-oriented. This means that an attribute can be defined in general as optional but in the scope of the project it will be mandatory through a controlled vocabulary. In other words, when a value is defined in a controlled vocabulary, by definition makes that attribute mandatory in the scope of the project. The advantage of that approach is the flexibility of the encoding standards when they are used by a non-operational project.
The table below describes the minimum set of global attributes. The providers may define any additional attributes which add relevant information associated with the provider or the project and are thought to be useful. These additional attributes are allowed by the standards but it's clear that are not controlled by them.
Attribute Name | Value | Required | Examples | Comment |
---|---|---|---|---|
Conventions | CF_convention_string C3S-0.1 [Other convention] :... | Mandatory | "CF-1.11 C3S-0.3" | Multiple conventions may be included (separated by blank spaces) |
title | Controlled vocabulary <short institution name> seasonal forecast model output prepared for C3S" For project use: <short institution name> seasonal forecast model output prepared for CERISE project" CF: Free text ACDD (highly recommended) | Optional | "ECMWF seasonal forecast model output prepared for C3S" "DWD seasonal forecast model output prepared for CERISE project" | A short phrase or sentence describing the dataset. In many discovery systems, the title will be displayed in the results list from a search, and therefore should be human readable and reasonable to display in a list of such names <short institution name> is the first element of the comma-separated list of values of the corresponding "institution" attribute |
references | Controlled vocabulary: URIs (such as a URL or DOI) for papers or other references. A valid doi is recommended CF: Free text | Optional | "doi:10.5194/gmd-8-1509-2015" | Published or web-based references that describe the data or methods used to produce it. For a research project which is still under development, the attribute is optional. |
source | String contains the version of the model <model_id> Additional information for an advanced description of the model is high recommended. The following template should be followed in constructing the advanced string: "<model_id> : atmos: <model_name> (<technical_name>, <resolution_and_levels>); ocean: <model_name> (<technical_name>, <resolution_and_levels>); sea ice: <model_name> (<technical_name>); land: <model_name> (<technical_name>); coupler <model_name> (<technical_name>)'' Additional explanatory information may follow the required information. NOTE that for some models, it may not make much sense to include all these components. The first portion of the string, “model_id”, should be built using the following template: "project-model_name-vYYYYMMDD" where YYYYMMDD is the release date of that version of the model (the date when it was first used) project is used only for projects. For C3S, the operational service project is empty. | Mandatory | "System8-v20210101:atmos ARPEGEv6.4.2(cy37t1,Tl359L137); ocean NEMOv3.6 (ORCA025 L75); sea-ice GELATOv6; land surface SURFEXv8.0; coupler OASIS MCT v3.0; river routing CTRIP" "cerise-SystemName-v20240101:atmos ARPEGEv6.4.2(cy37t1,Tl359L137); ocean NEMOv3.6 (ORCA025 L75); sea-ice GELATOv6; land surface SURFEXv8.0; coupler OASIS MCT v3.0; river routing CTRIP" | The method of production of the original data. If it was model-generated, source should name the model and its version, as specifically as it could be useful. It is a character string fully identifying the model and version used to generate the output. It should include information concerning the component models. Note that information about changes in the individual components with respect to the "official" releases should be included (e.g. a different bathymetry) The "source" attribute should include as much information as possible to not just identify the model but to brief the user about it. For project-specific files the model_id should provide information about the project. |
institute_id | Controlled Vocabulary: "ecmf" for ECMWF | Mandatory | "edzw" | Standardized 4 characters identifier of the institution that produced the data; NOTE all the values come from abbreviations of WMO/GRIB "originating centre" table, except CMCC (not available there) |
institution | Controlled Vocabulary: "ECMWF, European Centre for Medium-Range Weather Forecasts, Reading, United Kingdom" "Met Office, Exeter, United Kingdom" "Météo-France, Toulouse, France" "DWD, Deutscher Wetterdienst, Offenbach, Germany" "CMCC, Centro Euro-Mediterraneo sui Cambiamenti Climatici, Bologna, Italy" "NCEP National Centres for Environmental Prediction" "JMA Japan Meteorological Agency" "ECCC, Environment and Climate Change Canada, Montreal, QC, Canada" "BOM, Australian Bureau of Meteorology, Melbourne, Australia" CF: Free text | Optional (high recommended) | "Météo-France, Toulouse, France" | Specifies where the original data was produced. The name of the institution principally responsible for originating this data. NOTE: The first element of the comma-separated list of values will be used as a shortened version of this attribute in some of the other global attributes ('summary', 'title') |
contact | Controlled Vocabulary: Copernicus User Support URI should be used CF: Free text | Optional | "http://copernicus-support.ecmwf.int" optional for projects: "https://www.cerise-project.eu/" | |
project | Controlled Vocabulary: "C3S Seasonal Forecast" or "<project>" should be used CF: Free text | Mandatory | "C3S Seasonal Forecast" "CERISE" | The attribute "project" is always mandatory, however, the value depends on the operational service or the project. |
creation_date | SPECS: YYYY-MM-DDThh:mm:ss<zone> ISO 8601:2004 extended format | Mandatory | "2011-06-24T02:53:46Z" | The date on which this version of the data was created. Modification of values implies a new version, hence this would be assigned the date of the most recent values modification. Metadata changes are not considered when assigning the creation_date NOTE: The ACDD 1.3 names this attribute as "date_create". The name "creation_date" has been used following SPECS convention. |
comment | Free text | Optional |
| Miscellaneous information about the data, not captured elsewhere. |
forecast_type | Controlled Vocabulary "forecast" or "hindcast" or "analysis" | Mandatory | "forecast" | To identify the type of data |
modeling_realm | Controlled Vocabullary "atmos", "ocean", "land", "landIce", "seaIce", "aerosol", "atmosChem", "ocnBgchem" | Mandatory | "seaIce" | A string that indicates the high-level modelling component that is particularly relevant to the variable encoded
|
frequency | Controlled Vocabulary "mon", "day", "12hr", "6hr", "3hr", "fix" | Mandatory | "day" | A string indicating the interval between individual time-samples. Value depends on the variable (see "global attributes" column in variables tables) |
level_type | Controlled Vocabulary "surface", "pressure", "soil", "ocean2d" | Mandatory | "pressure" | A string indicating the type of the level where the variable comes from Value depends on the variable (see "global attributes" column in variables tables) |
history | Controlled Vocabulary Empty string | Optional | "" | To avoid this attribute being polluted by usual netCDF tools, it must be enforced to an empty string. |
commit | timestamp + URL of a commit in a CVS repository | Optional | "2017-04-01T13:48:25Z https://git.ecmwf.int/projects/C3SS/repos/ecmf/System4_v20111101" | This attribute intends to keep trace of the tools/scripts used to post-process the data output from the model. Ideally it should contain the link to a repository containing the specific set of tools and scripts needed to reproduce the same data from the model output. It is highly desirable to have that traceability information. As a surrogate when the previous is not feasible it should include the timestamp followed by an URL pointing to the C3S documentation repository of the correspondent model version (properly labelled with the <model_id> introduced in 'source" attribute) |
summary | Controlled Vocabulary: ACDD (highly recommended) | Optional | "Seasonal Forecast data produced by DWD as its contribution to the seasonal forecast activity of the Copernicus Climate Change Service (C3S). The data has global coverage with a 1-degree horizontal resolution and spans for around 6 months since the start date" Optional for projects: "Seasonal Forecast data produced by CMCC as its contribution to the CERISE project. The data has global coverage with a 1-degree horizontal resolution and spans for around 4 months since the start date" | A short paragraph describing the dataset <short institution name> is the first element of the comma-separated list of values of the corresponding "institution" attribute |
keywords | Fixed string "Seasonal Forecasts, C3S, ECMWF, Copernicus, Climate Change, Climate Services, Earth Science Services, Environmental Advisories, Climate Advisories" ACDD (highly recommended) | Optional | A comma separated list of key words and phrases. NOTE: This attribute is likely to be modified in the future, once the contents of the Thesaurus for CDS faceting will be defined | |
forecast_reference_time | SPECS: YYYY-MM-DDThh:mm:ssZ NOTE: This is ISO 8601:2004 extended format, but time zone is required to be UTC | Mandatory | "2011-06-01T00:00:00Z" | time of the analysis from which the forecast was made
For "forecast_type"="analysis" this global attribute must be removed |
Spatial Coordinates
The table below describes all the requirements for the spatial coordinates.
The usage of a spatial coordinate depends on the data variable and it is described in the variables section. Here, is provided how a spatial coordinate should be encoded.
Type (CMIP5) | Coordinate Name (CMIP5) | Dimension Names (CMIP5) | Axis | standard_name | long_name (CMIP5) | units (CF canonical units) | positive | valid_min (CMIP5) | valid_max (CMIP5) | bounds | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
double | lat | lat | Y | latitude | latitude | degrees_north | N/A | -90. | 90. | lat_bnds |
[-89.5, -88.5 , ..., -0.5, 0.5 ... 89.5] |
double | lon | lon | X | longitude | longitude | degrees_east | N/A | 0. | 360. | lon_bnds |
[0.5 , 1.5 , ..., 358.5, 359.5] |
double | plev | plev | Z | air_pressure | pressure | Pa | down | N/A | N/A | N/A |
[1000., 925., 850., 700., 500., 400., 300., 200., 100., 50., 30., 10.]
|
double | depth | depth (for soil levels) None (scalar auxiliary coordinate for ocean variables) | Z | depth | depth | m | down | N/A | N/A | depth_bnds |
depth=300; depth_bnds=[0,300]
|
double | height | (scalar auxiliary coordinate) | Z | height | height | m | up | CMIP5: 2mtemp: 1. 10mu/v: 1. | CMIP5: 2mtemp: 10. 10mu/v: 30. | N/A |
e.g. ~2 m standard surface air temperature and surface humidity height or ~10 m standard wind speed height |
double | sigma_theta | None (scalar auxiliary coordinate) | N/A | sea_water_sigma_theta | Sigma-theta of Sea Water | kg m-3 | N/A | N/A | N/A | N/A |
|
double | temperature | None (scalar auxiliary coordinate) | N/A | sea_water_ potential_temperature | Isotherm Temperature | degC CF: canonical units are K | N/A | N/A | N/A | N/A |
|
Note about the horizontal coordinates: The regridding procedure to provide the data in the 1-degree grid must take into account that the full definition of the gird cells is given by the cell boundaries (lat_bnds, lon_bnds)
Discrete Axes
The table below describes all the requirements for the discrete axes.
Type | Coordinate Name | Dimension Names | Axis | standard_name | long_name | units | bounds | Controlled vocabulary | Note |
---|---|---|---|---|---|---|---|---|---|
char | vegetation_type | vtype | N/A | area_type | N/A | N/A | N/A | The labelled axis is used to identify the vegetation type. The names should be chosen from the list of CF area types | |
C3S: string | realization | str31=31
| E | realization | realization | 1 | N/A | members are not a physical quantity. Realization is a discrete coordinate and the members its categorical values (ordered or non-ordered ones) SPECS approach: rXXiYYpZZ In the current version, the realization coordinate variable doesn't comply with the CF conventions. In future revisions the realization variable will become a discrete axis like the vegetation type |
Time Coordinates
The table below describes the requirements for the Time Coordinates.
The usage of all three Time Coordinates as described below is mandatory by the encoding standards, however, the units depend on the project and the physical variable itself and it is associated with the temporal resolution (frequency) of each variable. A controlled vocabulary has been introduced to reflect the dependency of the time coordinates encoding with the project.
Type | Coordinate Name | Dimension Names | Axis | standard_name | long_name | calendar | units | bounds | Notes |
---|---|---|---|---|---|---|---|---|---|
double | reftime | N/A | N/A | forecast_reference_time | "Start date of the forecast" | gregorian | UDUNITS time units | N/A | In SPECS it is only given as a "global_attribute" Only for forecast/hindcast |
double | leadtime | leadtime | N/A | forecast_period | "Time elapsed since the start of the forecast" | N/A | SPECS: days C3S: requested units can be relaxed to equivalent time units | leadtime_bnds | The interval of time between the forecast reference time and the valid time Boundaries not needed when this time coordinate is used for instantaneous values (note that "time:point" is used as cell_method in those cases) When boundaries are required, the value of the coordinate must be in the centre of the correspondent time cell boundaries Only for forecast/hindcast |
double | time | leadtime (for forecast/hindcast) time (for analysis) | N/A | time | "Verification time of the forecast" or "Valid time" for analysis data | gregorian | SPECS: "days since 1850-01-01" C3S: requested units can be relaxed to equivalent time units | time_bnds | Time for which the forecast/analysis is valid Boundaries not needed when this time coordinate is used for instantaneous values (note that "time:point" is used as cell_method in those cases) |
Note: Definitions for "leadtime" and "time" have been taken from SPECS. The introduction of "reftime" as a variable has been adapted from SPECS global attribute description for the forecast reference time.
Note: Even though there are different requested time steps among the variables (6h, 12h, 24h), just one set of time axes has been defined, as that would be enough when applying the requirement of "one variable per file"
warning
In the forecasts and hindcast data, leadtime" has been selected as a dimension (instead of "time") for both "time" and "leadtime". That means "leadtime" is the coordinate and "time" is an auxiliary coordinate. The main difference between "leadtime" and "time" is that time is a time stamp representing the valid time of the forecast, while "leadtime" is the interval of time between the forecast reference time and the valid time.
- This diverges from SPECS (where "time" was the name of the dimension and the coordinate, and "leadtime" was an auxiliary coordinate)
- Here it has been done like that because
- both reftime and leadtime are the relevant (let's say "orthogonal") coordinates coming from the relationship time = reftime + leadtime
- doing like that has some advantages when merging netCDF files ("leadtime" can be easily shared by different variables in a merged file, while "time" cannot)
Cell boundaries
The table below describes the requirements for the Cell Boundaries in accordance with section 7.1 Cell Boundaries of CF convention.
Following the same approach as for Spatial and Time Coordinates, a controlled vocabulary has been introduced for providing encoding standards with are relevant only to a specific project.
Info
bounds
to the appropriate coordinate variable(s). The value of bounds
is the name of the variable that contains the vertices of the cell boundaries. We refer to this type of variable as a "boundary variable." A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable. The additional dimension should be the most rapidly varying one, and its size is the maximum number of cell vertices. Since a boundary variable is considered to be part of a coordinate variable’s metadata, it is not necessary to provide it with attributes such as long_name
and units
Bounds Name | Dimensions | Note |
---|---|---|
time_bnds | leadtime, bnds | |
leadtime_bnds | ||
lat_bnds | lat, bnds | For C3S: Values (1x1deg grid) prescribed: |
lon_bnds | lon, bnds | For C3S: Values (1x1deg grid) prescribed: [0., 1.], [1., 2.], ... [359., 360.] |
depth_bnds | depth, bnds | (for soil layers) Should define the full vertical extent of the soil model layers. |
depth_bnds | bnds | (scalar auxiliary coordinate for ocean variables) Values prescribed (depth=300) [0,300] |
Grid mapping
As described in section 5.6 Grid Mappings and Projections of CF convention. (see quote below)
When the coordinate variables for a horizontal grid are longitude and latitude, a grid mapping variable with "grid_mapping_name" of "latitude_longitude" may be used to specify the ellipsoid and prime meridian.
Following that, it has been decided to include, as mandatory, in this encoding guide the following variable
|
Appendices
Appendix II. Extension of the C3S encoding standards for analysis data