Contributors: Leopold Haimberger, Federico Ambrogi, Michael Blaschek, Ulrich Voggenberger, Susanna Winkelbauer

Issued by: UNIVIE / Leopold Haimberger

Issued Date: 09/07/2025

Official refence number service contract: 2019/C3S_311c_Lot2_UNIVIE/SC1 and 2021/C3S2_311_Lot2_CNR-IMAA

History of modifications


Version

Date

Description of modifications

Chapters / Sections

1.0

09/07/2025 

First version

Whole document

Product change log

Dataset version

Date

Description

Changes as compared to previous version

1.1.0

09/07/2025 

Initial public version

-

Summary

The present document details the methods used in the generation of the In situ Comprehensive Upper-Air Observation Network dataset (CUON, described here).

This paragraph describes the data processing workflow developed to produce a consistent and homogenised upper-air observational dataset, disseminated through the Copernicus Data Store. The process ensures that disparate data sources are integrated into a single, high-quality dataset suitable for climate monitoring and reanalysis applications. The workflow consists of five primary stages: inventory creation, data harmonisation, data merging, variable enhancement, and homogeneity adjustment with uncertainty estimation.

The CUON dataset generation follows the path outlined in this schematic:



These steps are described hereafter in more detail.

Detailed documentation of algorithm components

Creation of a Common Station Inventory

The processing begins with the construction of a comprehensive station inventory. Observational datasets from a variety of archives are collected, accompanied by metadata and gridded ERA5 reanalysis data to support later stages of quality control and analysis. The following sources provide input to the CUON dataset: 

All known stations are assigned unique identifiers  (/meta/inventory_comparison_2/code/analyze_inventory_functions.py). For stations without official WIGOS identifiers—commonly referred to as orphan stations—surrogate IDs are generated to ensure consistent referencing. In addition, mobile platforms, such as those from ships or otherwise movable launch sites, are identified and marked accordingly to distinguish them from fixed-location stations.

The CUON dataset includes data of more than 5300 upper-air observing stations, identified through the following databases:

A known issue related to the coordinates is that they might change in time due to station relocation or other historical reasons. Therefore, we allow a 30 km threshold for which the station is considered valid: 99% of all the latitude-longitude pairs should be included within the desired threshold. We then compare the values of the coordinates with known station inventory: the OSCAR, IGRA2, WBAN, CHUAN. Now we also included the additional inventories WMO World Records, SCHROEDER, AMMA, HARA, where the last two come directly from the documentation of the respective datasets. This documentation is in the form of a pdf from the accompanying scientific papers, so the metadata had to be manually digitized.

According to these inventories, WIGOS-like IDs were created, mimicking the structure of the real WIGOS ids. Below we report the list of currently implemented pseudo-WIGOS IDs, which will be replaced once proper governance for coining permanent WIGOS IDs is in place:

Relevant metadata entries from accompanying tables are incorporated into the csv files to ensure completeness and traceability (/meta/inventory_comparison_2/code/make_station_configuration.py).

The resulting inventory is enhanced with available metadata, including geographical coordinates, station type, and operational history. This enhanced station configuration list is stored in a central CSV file. The latest version is available at:

CUON_station_configuration_extended.csv

There are multiple station configuration files, generated separately for each input dataset and they serve as the foundational reference for all subsequent data handling. Refer to the Appendix in the PUG for the archived list of station inventories per CDS dataset version.

Harmonisation of Observational Data

In the harmonisation step, observational data from all sources are translated into a common data model and format. Data originally provided in various native formats are transformed into netCDF files with a unified structure. Within these files, all observations are systematically organised by date, report identifier, observed variable, and pressure level (/public/harvest/code_cop2/harvest_convert_to_netCDF_yearSplit.py).

Given their distinct nature and reporting patterns, orphan stations and mobile stations are processed using dedicated workflows that operate in parallel with the fixed-station processing.

The resulting files are fully compatible with the Common Data Model.

Merging of Data from Multiple Sources

To ensure consistency and avoid duplication, a merging process is applied whenever multiple sources provide data for the same station. For each overlapping observation, the algorithm selects the record with the most complete vertical structure—i.e., the highest top level and the greatest number of levels (/public/merge/merging_cdm_netCDF_yearSplit_SEP2023.py).

Following the merge step, a single netCDF file remains for each station identifier, containing the most comprehensive data. As with earlier steps, orphan and mobile stations are treated separately, maintaining consistency and accounting for their unique data characteristics.

Enhancement of Variables and Model-Level Coverage

To complete the station records, missing mandatory pressure levels are filled through interpolation. Variables such as humidity and wind components are calculated where they are not directly available, using transformations based on related observed quantities. This step ensures that all key meteorological variables are represented across all time periods and pressure levels (/public/resort/convert_faster_with_recarray_plus_fb_year.py).

Additional metadata like the balloon trajectories (Voggenberger et al. 2024) as well as the platform type are added in this step. The offline background departures are calculated by comparing the updated observation data to the ERA5 gridded field - these data will be later used to calculate homogeneity adjustments.

After the enhancement procedures, individual yearly files are concatenated into continuous, station-wise time series. These comprehensive datasets form the basis for homogenisation procedures. Further details on the derivation and interpolation methods are provided in accompanying Algorithm Theoretical Basis documents.

Homogeneity Adjustments and Uncertainty Estimates

In the final step - if the processing is not done via the near-real-time updating script - homogeneity adjustments are applied to account for temporal inconsistencies in the observational record.

These adjustments are essential to address biases introduced by changes in instrumentation, observation techniques, or station relocations. Dedicated procedures are used for temperature, humidity, and wind, each described in separate ATBDs:

In parallel, uncertainty estimates are calculated using the Desroziers method. This statistical technique allows for objective quantification of observational error characteristics based on innovation statistics.

This document has been produced in the context of the Copernicus Climate Change Service (C3S).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.

Related articles