Contributors: F. Ambrogi, M. Blaschek,L. Haimberger, U. Voggenberger

Issued by: UNIVIE / Leopold Haimberger

Issued Date: 30/06/2022

Ref: DC3S311c_Lot2.1.5.2_Product User Guide

Official refence number service contract: 2019/C3S_311c_Lot2_UNIVIE/SC1 and 2021/C3S2_311_Lot2_CNR-IMAA

Table of Contents

History of modifications


Version

Date

Description of modifications

Chapters / Sections

1.0

23/06/2022

First draft

Whole document

2.0

14/09/2024

Major Update to CUONv.3 

Section 4-5-6

3.0

20/06/2025

Update for public release of dataset version 1.1.0

Whole document

Acronyms

AMMA

African Monsoon Multidisciplinary Analysis

ATBD

Algorithm Theoretical Basis Document

BUFR

Binary Universal Form for the Representation

CSV

Comma Separated Values (a table interchange format)

CDM

Common Data Model

ECMWF

European Center for Medium-Range Weather Forecasts

ECV

Essential Climate Variable

ERA5

5th Generation European Reanalysis

GIUB

Geographisches Institut Universität Bern

HARA

Historical Arctic Rawinsonde Archive

IGRA2

Integrated Global Radiosonde Archive, version 2

NCAR

National Center for Atmospheric Research

NCEP

US National Centers for Environmental Prediction

NetCDF

Network Common Data Format

WOUDC

World Ozone and Ultraviolet Radiation Data Centre

MAESTRO

Mesoscale organisation of tropical convection (field campaign)

ODB

ECMWF internal representation of observation and observation feedback data

Product description

The In situ Comprehensive Upper-Air Observation Network (CUON) serves the most comprehensive harmonised public collection of conventional radiosonde and pilot balloon data available to date. It contains measurements of the ECVs temperature, humidity (relative, specific, dewpoint, dewpoint depression) and wind (speed and direction, u and v components). The network consists of more than 5400 individual upper-air observing stations at fixed location, and ca. 9000 mobile platforms. While the majority of records contain only a limited number of observations, around 1000 observing stations provide data spanning several decades. Besides the measured values and the most basic metadata (position, time), it provides observation minus background and observation minus analysis departure values from ERA5 during assimilation. CUON contains homogeneity adjustments for temperature and several wind as well as humidity variables that can be applied to the raw observations, together with observation error estimates derived from the departure statistics based on ERA5 back to 1940. Metadata regarding the instrumentation type information is also provided. All the information adhere to the Common Data Model for observations https://github.com/ecmwf-projects/cdm-obs/blob/master/ and more specifically to its subset of variables, the CDM-OBS-core .
Data are provided in binary (netCDF) as well as compressed CSV format, which are best read by standard python packages (e.g. xarray or pandas, h5py). Requests can be created via the CDS forms or using simple python scripts. A few example scripts that demonstrate basic usage are given below. 

With version 1.1.0, the database contains records up to 31 December 2024.

Updates are planned yearly, with the aim to move to monthly updates in the future. The following sections inform about data sources and coverage. Sections 7-11 inform about the content of the retrieved data and give example how to use it for data analysis. 




Figure 1: Location of fixed-based station records that contain at least one ascent. Notes: (1) "ECMWF" is the union of several data collections (see section 5, identifiers "ERA5 1" , "ERA5 2", "ERA5 1759" and "ERA5 1761"); (2) Station records may consist of ascents from different sources, and colours are plotted over the previous colours in the order listed in the legend.

Data and Metadata Sources

The following data sorces were used to build the CUON "merged" database. References describing the
sources and the data can be found in the reference section of the download tab.

NameDescription
ERA5_1

ERA5 Analysis input and feedback (since 1979)

Contains observations as well as departure statistics and instrumentation metadata

ERA5_2

ERA5 Analysis input and feedback (1940-1978)

Contains observations as well as departure statistics and instrumentation metadata

ERA5_1759

Analysis input for ERA5 received from NCAR (MIT tapes, used in ERA-40).

Superseded by ERA5_2 except for years before 1940

ERA5_1761

Analysis input for ERA5 received from NCAR (MIT tapes, used in ERA-40).

Superseded by ERA5_2 except for years before 1940

AMMAAfrican Monsoon Multidisciplinary Analysis (2006)
BUFRBUFR data from ERA-40 collection (1958-1978)
HIGH RES BUFRBUFR data in 1-2 second resolution (2014 onward)
HARAHistorical Arctic Rawinsonde Archive (1948-1996)
IGRA2Integrated Global Radiosonde Archive (IGRA) version 2 data base from NOAA NCEI (1901 to present)
MAESTROField campaign "Mesoscale organisation of tropical convection" near the Cape Verde islands (August-September 2024)
NCARNational Center for Atmospheric Research upper-air data base, NCAR catalog ds370.1 (1920 to present)
NPSOUNDNCEP/NCAR Arctic Marine Rawinsonde Archive (1954-1989)
SHIPSOUNDDaily Arctic Ocean Rawinsonde Data from Soviet Drifting Ice Stations (1976-1996)
WOUDCOzone sondes including temperature and humidity profiles (1962-2021)
Pangaea

Public Radiosonde intercomparison data collected by UBERN and RIHMI for this service.

See Imfield et al. 2021, https://doi.pangaea.de/10.1594/PANGAEA.926532


As can be seen from Fig. 2, most of the data served by CUON come from the  ERA5_1, ERA5_2 and IGRA2 collections. The ERA5 observation data collections are accessible for ECMWF member state users only via the MARS archive, due to restrictions on certain observation types (other than radiosondes). The collection of radiosonde intercomparison data, both single radiosonde launches and aggregated data, which is described in detail by Imfeld et al. (2021) is part of CUON but is not currently served via the CDS.


The data come from the sources in various formats. As a first step, all source data have been aligned with the Common Data Model and then saved in netCDF4 format. From these source data, a merged data set was created.

The merging serves three purposes:

  • Concatenate pieces of the same station record found in different sources. Quite often, station records come in pieces, sometimes with different identifiers. Identifying those pieces can be challenging. A consistent and comprehensive station inventory is essential.
  • Identify and remove duplicates. The data sources noted above show a high degree of redundancy. Users usually do not want to choose themselves which data source to use and rely on the data provider to make appropriate choices. In case of the present data sets the following rules are applied
    • selection of the record with measurements recorded in the uppermost layer of the atmosphere. If multiple sources are available, the next selections are applied.
    • If ERA5 2 or ERA5 1 data is available, use these data, since for those also background departure feedback information is present.
    • if not, use the record from the data source with the largest amount of valid data.
    • if the 'largest size' of the records from different sources is identical, we give preference to IGRA version 2.
    • For finding duplicates, certain thresholds have to be used, e.g. the launch time difference must be less than 2 hours and the spherical distance between the coordinates of the observation stations must be less than 0.3 deg.
    • The stationID of the merged record is not necessarily the same as the stationID in the chosen data set. It is first tried to identify the station with the radiosonde station IDs of the WMO OSCAR data base. If a match is found, also the WIGOS ID of the station record is clear. In case no match is found, it is tried to find the WMO ID of the station and to coin a temporary WIGOS-formatted ID starting with 0-20000-0-WMOID. For a few stations, other temporary IDs which are WIGOS-formatted but not globally valid had to be generated. These IDs will be superseded once a Copernicus facility able to coin permanent WIGOS IDs is in operations. Station lat/lons are normally kept as reported for each ascent, since this allows to keep track of station relocation. Only in cases where the lat/lon has poor quality (e.g. lost minus sign in latitude), it is corrected.
  • Verify the consistency of latitude and longitude for records with the same station ID. If there are significant discrepancies in location, such as jumps greater than 1 degree (data shows cases sometimes exceeding 10 degrees), these records should be separated into distinct entries. This step applies to approximately 200 station IDs. Special consideration should be given to quasi-stationary weather ship stations, identified by WMO IDs starting with "99," as they represent a unique case and may require different handling.

Metadata on the sensors used has been extracted from a database described by S.Schroeder ( https://ams.confex.com/ams/pdfpapers/54950.pdf) up to the year 2013 if no other instrumentation metadata was available. Other instrumentation metadata is retrieved directly from the above sources where it was reported using the WMO manual on codes https://library.wmo.int/doc_num.php?explnum_id=10235). More details on the specific codes used are given in the Appendix. 

Data Processing

The original merged data set contains a total of 5408 merged netCDF files for fixed-position station, and 9634 for mobile stations. The total amount of records exceeds 50 million.

 

 


Figure 2: – Stacked distribution of the record counts per data sources of the entire data set version 1. It contains also the non-public data.



To improve the usability of the dataset, additional processing steps are performed.

First, the original merged records are augmented with additional derived climate variables. For example, an ascent, especially if it has never been assimilated, may have only temperature, dew point, wind speed and wind direction information. In this case, relative humidity, specific humidity, dew point depression and the u and v components of the wind are calculated and added to the records. Due to this procedure, unnecessary data gaps for certain parameters are avoided. If mandatory pressure levels are missing in the data, those pressure levels are interpolated linearly in log(p), provided that the specified resolution and close measurements permit the implementation of this procedure. 

In the subsequent step, the balloon drift is calculated based on the wind observations, with the method being described in detail by Voggenberger et al. (2024). This approach enables the updating of positions for each observation. Due to the balloon drift, the columns latitude/longitude and latitude|header_table and longitude|header_table differ, with the former containing the position of the observation and the latter the position of the balloon launch. The users can therefore decide whether to assume a vertical or a slanted ascent.

Some efforts went into making available the time of the actual launch of the radiosonde or PILOT balloons. The column record_timestamp contains the nominal time of the ascent (often given at full hours in TEMP messages) whereas the column report_timestamp contains the actual launch time. They typically differ by less than an hour. If the actual launch time was not available, it has been tried to estimate it from periods where it was available, assuming the launch practice has not changed. If the profile originates from modern high resolution BUFR records, launch time and nominal time are identical.

The launch time may differ substantially from the actual time of the observation, since a typical ascent lasts about 2 hours. The time since launch is not specified in most historical reports and is also not given by CUON. If really needed it can be, however, relatively accurately reconstructed from the geopotential height of the balloon, assuming a constant ascent speed of 5 m/sec (e.g. Voggenberger et al. 2024). 

Furthermore, typically required for ascents not assimilated by ERA5, background departures and analysis departures are calculated offline by interpolating gridded data to the ascent location and time. For this procedure the beforehand calculated drift information is considered to minimize the spatial distance between observation and background field. The CDM variable "conversion_flag" in the "observations_table" tracks whether the observation value was available already in the source data or if it was calculated during merging.

The sequence of the previous steps results in the full version of the merged CUON dataset, and constitute the starting point for additional processing, namely adding bias adjustments and observation uncertainty estimates. The calculated bias estimates and the uncertainty estimates are stored in the CDM tables "advanced_homogenisation" and "advanced_uncertainty", respectively.

For a detailed description how those are calculated, we refer to the ATBDs, which accompany this document.


Access and Examples

Users can retrieve CUON data with requests generated by the web form or by using the CDS API.

The following keywords are valid for the CDS API and the CDS:

IdentifierAll possible valuesExplanation
year[YYYY, ….,YYYY], StringYears
month[MM, …,MM], StringMonths
day[D, …,DD], StringDays
area[upper,left,lower,right], Integer or FloatBoundaries of lat/lon rectangle to select stations
variable

["air_dewpoint", "air_temperature", "dew_point_depression", "eastward_wind_speed", "geopotential_height", "northward_wind_speed", "relative_humidity", "specific_humidity", "wind_from_direction", "wind_speed"], String


Meteorological variables and metadata
format['netcdf', 'csv'], StringOutput format. "netcdf" returns a single netCDF file; "csv" returns a single CSV file


For data retrieval, run the code provided in the API request section at the bottom of the web form. This will only work with the correct credentials set in the .cdsapi file or the shell environment. The returned output file (netCDF or CSV) contains tables with the columns listed below. Some columns contain numeric code values defined in the CDM-OBS data model (see list of tables at https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/)


IdentifierAll possible valuesType
observation_id	
CUON observation identifier (YYYYnnnnnnnnnnnnnnnn), n starting at 0 for each year, unique for each observationString (fixed length, byte strings in netCDF for efficiency, strings in csv
report_id	
CUON report id  (YYYYnnnnnnnnnnnnnnnn), n starting at 0,  for each year for each ascentString
data_policy_licence	
always set to 0 in CUON, meaning  "Open Data in public domain and freely available (no cost and unrestricted)", see https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/data_policy_licence.datInteger
z_coordinate	
value of vertical coordinate in Pa. Data originally in z coordinates were converted to pressure, according to ICAO standard atmosphere, see https://datalab24.com/earth_astronomy/international_standard_atmosphereFloat
z_coordinate_type

always set to 1 in CUON, indicating pressure coordinates (see https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/z_coordinate_type.dat


Int
observed_variable	

valid values are "air_dewpoint", "air_temperature", "dew_point_depression", "eastward_wind_speed", "geopotential_height", "northward_wind_speed", "relative_humidity", "specific_humidity", "wind_from_direction", "wind_speed"

String
observation_value	

The actual numerical value of the observation.

Float
units	

K for temperature values, J/kg for geopotential, m/s for wind speed values, deg (from north in anticyclonic direction) for wind direction, kg/kg for specific humidity, 1 for  relative humidity

String
sensor_id	

4 digit code according to https://gitlab.phaidra.org/ulrichv94/CUON/-/blob/master/CUON/public/merge/sensor_configuration_all.csv. Most Codes are from S. Schroeder's vapor database, from 2013 onwards WMO codes (near the end of the document) are used. Definition of WMO codes in https://library.wmo.int/records/item/35625-manual-on-codes-volume-i-2-international-codes

String
source_id	

the IDs listed in section 6 of this document, e.g. era5_1

String
station_name	

Local name of the station or ship. If no name was available, the nearest larger city with >10000 inhabitants was taken. 

String
platform_type	

0 (land station), 2 (ship) or 8 (mobile land) See https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/platform_type.dat

Int
primary_station_id	

WIGOS style ID. Stations starting with 0-20000-0 or 0-20001-0 and IDs found in the OSCAR database https://oscar.wmo.int/surface/#/ are approved IDs. IDs starting with 0-20[1-9]???-0- are temporary CUON-specific IDs of stations/platforms which need to be included into the OSCAR system at a later stage

String
height_of_station_above_sea_level	

Altitude of station in m

Int
report_timestamp	

The time of the actual launch of the balloon in seconds precision. For data encoded in modern high-resolution BUFR it is identical with record_timestamp. For TEMP and PILOT messages this time is often not available or was not reported. In those cases the time was estimated from the difference between record_timestamp and launch time when it was available (typically between 30 and 45 minutes, sometimes 2 hours) or as 30 minutes if launch time information was missing completely. 

Int64 (seconds since 1900-01-01)
record_timestamp	

The nominal report time given in TEMP/PILOT messages, usually at the full hour. For high-resolution BUFR it is given as the actual launch time in seconds precision.

Int64 (seconds since 1900-01-01)
city	

Next larger city (>100000) near the station. Not given for ship platforms. 

String
an_depar@body

departure of observation_value from ERA5 analysed state interpolated to observation location. This value was calculated during assimilation, assuming straight vertical ascent of the balloon. Value available only for temperature, relative humidity, eastward_wind_speed and northward_wind_speed. For temperature the observations were bias-adjusted before calculating the departures, for all other variables no bias adjustments were applied in ERA5

Float
fg_depar@body

departure of observation_value from ERA5 background state interpolated to observation location. The background is the 4D-VAR trajectory calculated before the variational minimisation, started twice daily (6GMT and 18GMT)  (see Hersbach et al. 2020). This value was calculated during assimilation, assuming straight vertical ascent of the balloon. Value available only for temperature, relative humidity, eastward_wind_speed and northward_wind_speed. Note that for temperature the observations were bias-adjusted before calculating the departures, for all other variables no bias adjustments were applied in ERA5. Therefore jumps that appear in fg_depar@offline time series may not appear in the fg_depar@body timeseries (see section 9). 

Float
fg_depar@offline

departure of observation_value from ERA5 background state interpolated to observation location. The background is the 4D-VAR trajectory calculated before the variational minimisation, started twice daily (6GMT and 18GMT)  (see Hersbach et al. 2020). This value was calculated offline by linear interpolation in time and in the vertical and cubic horizontal interpolation using the 3-hourly archived values from the initial 4D-Var trajectory on standard pressure levels. Value available calculated for all variables listed in the row  observed_variable, but only at 16 standard pressure levels, where an observation was available.,  The balloon drift was taken into account for calculating these departures.  Departures before 1940 were calculated as difference between observations and the 20th century reanalysis v3 (Slivinsky et al. 2019). 

Float
uncertainty_type1

Set to 1 for CUON, indicating random observation uncertainties (see https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/uncertainty_type.dat). They were calculated from an_depar@body and fg_depar@body on standard pressure levels using Desrozier's (2005) method, using a 30 day averaging window. Since they were calculated from output of a data assimilation system the uncertainty contains both measurement and representation errors.

Int
uncertainty_units1

Same as for observation_values

Str
uncertainty_value1

The actual values of the uncertainty estimates, available on standard pressure levels, for variables where fg_depar@body and an_depar@body are available, for records with sufficient values of fg_depar@body and an_depar@body. As such the values are not available before 1940, for very short records, and for record where the data source was not ERA5.  

Float
quality_flag

Indicates the result of application of quality controls. See CDM-Obs code table https://github.com/ecmwf-projects/cdm-obs/blob/master/tables/quality_flag.dat . Range checks wre applied to the data (only those that passed are published in this dataset) but no advanced QC was applied. Always set to 2 (not checked) in CUON.

Int
longitude

The longitude of the actual observation, including the balloon drift estimate. It is identical to the launch location only for stations very near (latitude beyond 89 deg)  to the Poles (e.g. Amundsen Scott station)

Float
latitude

The latitude of the actual observation, including the balloon drift estimate. It is identical to the launch location only for stations very near (latitude beyond 89 deg)  to the Poles (e.g. Amundsen Scott station)

Float
longitude|header_table

The longitude of the location where the balloon was launched.

Float
latitude|header_table

The latitude of the location where the balloon was launched. 

Float


In its current version CUON allows only subarea selection and time windows of at least one day. This means that the downloaded data often contain more data than actually needed. Therefore some additional processing is usually needed to get the desired information. Data also come as "ragged" arrays, i.e. profiles for different ascents may have different lengths, and they may contain missing values. This makes them more challenging to use than gridded data. 

There are many ways to deal with this. We show here examples of reading the downloaded file into a pandas dataframe, using python. Once in dataframe format, one can squeeze out unnecessary information using the  pandas selection tools. Other tools such as duckdb also allow performing sub-selections, using SQL like syntax (such an example is provided later). 


Let's start with a simple request. Note that we specify a target file name, which is not strictly necessary but convenient.

import cdsapi

dataset = "insitu-comprehensive-upper-air-observation-network"
request = {
    'variable': ['air_temperature', 'air_dewpoint'],
    'year': ['2010'],
    'month': ['07'],
    'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
    'format': 'netcdf',
    'area': [50, 15, 45, 20],
}

target='download.nc'
client = cdsapi.Client()
client.retrieve(dataset, request, target)

Executing this code yields output like this (afterwards it is buffered for some time):


2025-05-06 11:39:57,754 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-05-06 11:39:57,755 WARNING [2024-06-16T00:00:00] CDS API syntax is changed and some keys or parameter names may have also changed. To avoid requests failing, please use the "Show API request code" tool on the dataset Download Form to check you are using the correct syntax for your API request.
2025-05-06 11:39:58,829 INFO Request ID is ee08852f-d416-4ca1-bf2a-51417a874ad6
2025-05-06 11:39:59,020 INFO status has been updated to accepted
2025-05-06 11:40:00,207 INFO status has been updated to running
2025-05-06 11:40:02,401 INFO status has been updated to successful
'download.nc'


Executing the code above creates a netCDF file download.nc. It can be read with different tools or python modules. For example, one can use the python package xarray and pandas to have a look at the data. The code below prints the first row of each column. 

import xarray as xr
import pandas as pd
df=xr.open_dataset(target).to_dataframe()
print('size:',os.path.getsize(target)//1024,'kB')
for d in df.columns:
    print(f'{d: >35}\t{df[d][0]}')


size: 822 kB
                     observation_id	b'20100000000000060865'
                          report_id	b'20100000000000000176'
                data_policy_licence	0
                       z_coordinate	4410.0
                  z_coordinate_type	1
                  observed_variable	b'air_temperature'
                  observation_value	219.10000610351562
                              units	b'5'
                          sensor_id	b'79'
                          source_id	b'era5_1'
                        report_type	-2147483648
                       station_name	b'GRAZ-THALERHOF-FLUGHAFEN'
                       station_type	1
                      platform_type	0
                 primary_station_id	b'0-20000-0-11240'
  height_of_station_above_sea_level	378.79998779296875
                   report_timestamp	2010-07-01 02:30:00
                    report_duration	-2147483648
                   record_timestamp	2010-07-01 03:00:00
                       secondary_id	b'11239,11240,11241,5682,AUM0001'
                               city	b'Graz'
                      an_depar@body	-0.19339817762374878
                      fg_depar@body	-0.44553908705711365
                   fg_depar@offline	nan
                  uncertainty_type1	1
                 uncertainty_units1	b'5'
                 uncertainty_value1	nan
                          longitude	15.556254386901855
                           latitude	46.903221130371094
             longitude|header_table	15.430000305175781
              latitude|header_table	47.0

Note that strings are byte strings. They can be converted to unicode strings using the decode function, e.g. df['units'].str.decode('utf-8'). If data are read as csv, string variables are automatically decoded. 

The data will always include the above columns and they are available also in other upper-air data sets. This helps reaching interoperability between different observation datasets.


If format': 'netcdf'is replaced by 'format': 'csv', the data will be provided in zipped CSV format:
Executing the  retrieval code again, with different format,  creates a zipped CSV file download.zip.

import cdsapi

dataset = "insitu-comprehensive-upper-air-observation-network"
request = {
    'variable': ['air_temperature', 'air_dewpoint'],
    'year': ['2010'],
    'month': ['07'],
    'day': ['f{d:0>2}' for d in range(1,32) ],
    'format': 'csv',
    'area': [50, 15, 45, 20],
}

target='download.zip'
client = cdsapi.Client()
client.retrieve(dataset, request, target)

This file is much larger (factor ~10) than the netcdf file. It can be read with different tools or python modules. pandas can read the content of the zip file directly with one line of code, but let's read the zip file as text first. This helps finding the number of header lines.


import zipfile
with zipfile.ZipFile(target) as z:
    for n in z.namelist():
        with z.open(n) as f:
            l=-1
            h=next(f).decode()
            while len(h)<2 or '#' in h:
                print(h,end='')
                h=next(f).decode()
                l+=1  		
	
df=pd.read_csv(target,header=l)

The header explains the contents of the file:

########################################################################################
# This file contains data retrieved from the CDS https://cds.climate.copernicus.eu/cdsapp#!/dataset/insitu-comprehensive-upper-air-observation-network
# This is a C3S product under the following licences:
# 20180314_Copernicus_License_V1.1
# This is a CSV file following the CDS convention cdm-obs
# Data source: CUON
# Time extent: 20100701 - 20100731
# Geographic area (minlat/maxlat/minlon/maxlon): 45.17/49.92/15.19/19.87
# Variables selected and units:
# air_temperature [5]
# air_dewpoint [5]
# Uncertainty legend:
# uncertainty_value1 Random uncertainty
########################################################################################

It is important to consider that the data are stored in the CDS object store as flat netCDF4 files, where each file contains data from one day and from a certain longitudinal slice. The appropriate time and longitudinal slices are chosen automatically. Currently only rectangular lat/lon boxes and days can be queried. It is not possible to query for primary_ids or a specific launch time using the CDSAPI.

One can select a certain station, however, by specifying the bounding box as tight around the station coordinates as possible. This helps for fixed stations. If one is interested in ship platforms, a considerably larger domain is likely needed to catch all reports within a certain time interval. 

In order to get only a specific ascent or data from just one primary id, it will be often necessary to do a more fine grained query on the downloaded dataframe. This can be implemented using e.g. the xarray, pandas or duckdb packages. 

For the files from the above queries we can get the included stations (primary_station_id values) as follows:

import numpy as np
print(np.unique(df['primary_station_id']))


['0-20000-0-11240' '0-20000-0-11747' '0-20000-0-12843' '0-20000-0-14240'
 '0-20001-0-11035']

This can be used to query e.g. for station Vienna (11035) using duckdb:

import duckdb
import datetime
print(duckdb.query(f"select observed_variable,observation_value from df \
where record_timestamp=='2010-07-01 12:00:00' \
and primary_station_id=='0-20001-0-11035' \
and z_coordinate==50000"))
which yields temperature and dewpoint values at the 500 hPa level at midday nominal launch time:
┌───────────────────┬───────────────────┐ │ observed_variable │ observation_value │ │ varchar │ double │ ├───────────────────┼───────────────────┤ │ air_temperature │ 258.9 │ │ air_dewpoint │ 235.9 │ └───────────────────┴───────────────────┘

Alternatively one can do a similar query using pandas syntax:

df[(df.record_timestamp=='2010-07-01 12:00:00') & (df.primary_station_id=='0-20001-0-11035') & (df.z_coordinate==50000)].filter(['z_coordinate','observed_variable','observation_value'])

With this filter pattern, a rudimentary T-logp plot for one ascent can be generated with the following code:

tlogp=df[(df.record_timestamp=='2010-07-01 12:00:00') & (df.primary_station_id=='0-20001-0-11035')].\
filter(['z_coordinate','observed_variable','observation_value'])
%matplotlib inline
import matplotlib.pyplot as plt
for p in 'air_temperature','air_dewpoint':
    plt.semilogy(tlogp.observation_value[df.observed_variable==p],tlogp.z_coordinate[df.observed_variable==p],label=p)
plt.ylim(100000,1000)
plt.ylabel('hPa')
plt.xlabel('K')
plt.title('0-20001-0-11035 '+'2010-07-01 12:00:00')
plt.legend()


Figure 3: – Temperature and dewpoint at station Vienna on 1st July 2010, 12GMT, using the code above (left panel), and using metpy (right panel). See https://gitlab.phaidra.org/ulrichv94/CUON/-/blob/master/CUON/public/testing/PUG_1.1.ipynb for details to reproduce the second plot. 

More fancy displays like skewed T-log diagrams can be plotted using e,g, metpy. This package expects aligned numpy arrays for humidity and temperature observations. Since the columns of CUON output are organised as dense packed arrays, avoiding all-nan rows, one may have to expand the humidity variables who do not reach as high up as temperature values, and to fill gaps with NaNs.  


Time series at standard pressure levels can be made e.g. as follows:

request = {
    'variable': ['air_temperature'],
    'year': [f'{m}' for m in range(2000,2010)],
    'month': [f'{m:0>2}' for m in range(1,13)],
    'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
    'format': 'netcdf',
    'area': [50, 15, 45, 20],
}

target='download.nc'
client.retrieve(dataset, request, target)

df=xr.open_dataset(target).to_dataframe()
ts=df[(df.primary_station_id==b'0-20001-0-11035')&(df.observed_variable==b'air_temperature')&\
(df.z_coordinate==50000)].\
filter(['record_timestamp','observation_value'])
print(len(df),len(ts))
plt.plot(ts.record_timestamp,ts.observation_value)
plt.ylabel('K')
plt.title('Temperature at 500 hPa for station Vienna')

Figure 4: Example retrieval of a multi-year temperature observation time series from station Vienna.  

Please be aware that retrievals of time series at this stage (June 2025) take some time (~5 min) and may fail due to memory limitations if more than 5 years are retrieved. In this case it may be necessary to split requests into shorter time slices and then concatenate the series. 

How to use additional metadata

CUON records contain sensor metadata collected offline by various researchers in the past (e.g. Gaffen, 1994, Schroeder, 2013). Since several years, this information has been reported also in standard TEMP and BUFR messages, using WMO manual on codes 1.2:  https://library.wmo.int/records/item/35625-manual-on-codes-volume-i-2-international-codes. In CUON the instrument codes of Schroeder (2013) and the WMO have been harmonised to be able to provide metadata from early years as well as for recent ascents in one data column (sensor_id). This column can be used to filter stations using certain radiosonde types. 

As an example, a map of ascents using Vaisala RS41 sensors on 1 December 2024 is plotted, showing the value of the available instrumentation metadata.  Such retrievals take some time (20 minutes), but are expected to become faster in the future. 

request = {
    'variable': ['air_temperature', 'air_dewpoint'],
    'year': ['2024'],
    'month': ['12'],
    'day': ['01'],
    'format': 'netcdf',
}

mtarget='download1.nc'
client.retrieve(dataset, request, target)

import cartopy.crs as ccrs
import cartopy.feature as cfeature

df=xr.open_dataset(mtarget).to_dataframe()
ax = plt.axes([0, 0, 1, 1], projection=ccrs.PlateCarree())
ax.add_feature(cfeature.OCEAN, zorder=0)
ax.coastlines()
_,idx=np.unique(df['longitude|header_table'],return_index=True)
idy=np.where(np.isin(df['sensor_id'][idx] ,(b'123',b'124',b'125',b'141',b'142')))[0]
longitudes = list(df['longitude|header_table'][idx[idy]])
latitudes = list(df['latitude|header_table'][idx[idy]])
plt.scatter(longitudes, latitudes, s=40, alpha=1,
                edgecolor='k',)
x=plt.title('2024-12-01 number of stations with Vaisala RS41 sensors: '+str(len(idy)))

Figure 5: – Location of radiosonde stations using flavors of the Vaisala RS41 radiosonde on 1 December 2024.


CUON contains not only instrumentation metadata but also information about balloon drift. While radiosonde ascents have been assumed to be vertical in many operational applications in the past, the collocation errors made through this assumption cannot be tolerated given the current accuracy of models and data assimilation systems. The code below generates a figure which gives an impression of the drift to be expected in midlatitudes (in this case station Vienna).

request = {
    'variable': ['air_temperature', 'air_dewpoint'],
    'year': [f'{m}' for m in range(1980,1983)],
    'month': [f'{m:0>2}' for m in range(1,13)],
    'day': ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'],
    'format': 'netcdf',
    'area': [50, 15, 45, 20],
}

tstarget='tsdownload.nc'
client.retrieve(dataset, request, tstarget)

df=xr.open_dataset(tstarget).to_dataframe()

ax = plt.axes([0, 0, 1, 1], projection=ccrs.PlateCarree())
ax.set_extent([12,22,46,51])
ax.add_feature(cfeature.OCEAN, zorder=0)
ax.add_feature(cfeature.BORDERS, linestyle='-', alpha=1)
ax.coastlines()
gl = ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True,
                  linewidth=2, color='gray', alpha=0.5, linestyle='--')

idx=np.where(df['primary_station_id']==b'0-20001-0-11035')[0]
idy=np.where(df['z_coordinate'][idx]==30000)[0]
plt.scatter(df['longitude'].values[idx], df['latitude'].values[idx], s=10, alpha=1,
                color='r',)
plt.scatter(df['longitude'].values[idx[idy]], df['latitude'].values[idx[idy]], s=10, alpha=1,
                color='y',)
plt.scatter(df['longitude'].values[idx][-2:], df['latitude'].values[idx][-2:], s=40, alpha=1,
                color='k',)
x=plt.title('All Positions of radiosondes launched from Vienna 1980-1982')



Figure 6: Drift of balloons at station Vienna during the period 1980-1982 (ca. 2000 ascents). Red: All positions reached, Yellow: Position reached at or below the 300 hPa level, Black: Position of launch site. 

How to work with background departures and bias adjustments

Bias adjustments are available for all temperature, humidity and wind variables except wind speed. They are automatically retrieved together with the observation values and other columns. Owing to the method used for calculating them, they are usually set to zero if the available time series for a record is shorter than 6 months. Breakpoints and bias adjustments were calculated through analysis of the variable fg_depar@offline in the case of wind (see the corresponding ATBD document). To get adjusted observation time series one needs to subtract them from the observation_values. Users should do that when calculating statistics such as trends from the observation time series, when comparing observations e.g. with satellite radiances, or when calculating gridded observation products. 


The effect of the adjustments is often small compared to the overall variance of the observation values and thus not easy to see. It can be better seen when subtracting the adjustments from the fg_depar@offline columns, which contain obs-bg differences. The code below shows the effect of adjustments on a station in the Tropical Pacific. A break in the time series of fg_depar@body is clearly visible in 1995. A change in instrumentation is the likely reason, indicated by the available metadata., where "AH" means a change from US radiosonde types ("ZKB" is VIZ-B) to Vaisala RS80. After adjustment the standard deviation of the departures from ERA5 is clearly reduced both for temperature at 50 hPa and the relative humidity at 200 hPa levels, respectively. Note that balloon drift has been taken into account when calculating fg_depar@body.  

Figure 7: .Time series of fg_depar@offline, without adjustment (blue) and with adjustmend (fg_depar adj), for radiosonde station 91408 (KOROR, in the tropical Pacific). Left: for relative humidity at 200hPa, right for temperature at 50 hPa. The shift in 1995 is clearly indicated. For details on the calculation of the adjustments, see the ATBD on temperature and humidity adjustments (to be published soon). 


For expert users: Please note that it makes no sense to subtract the adjustments from fg_depar@body and an_depar@body for temperature, since these have been calculated AFTER subtracting the bias adjustments used when assimilating ERA5. The situation is different for humidity and wind adjustments, since humidity and wind were not adjusted when assimilating ERA5. 

How to retrieve uncertainty estimates

Observation uncertainty estimates have been calculated for records where the data source was ERA5_1 or ERA5_2 (about 76% of records) from the archived fg_depar@body and an_depar@body values, using the Desroziers (2005) method, using an averaging interval of 30 days. Note that these estimates contain
both measurement and representation errors, and it was assumed that the assimilation process was bias-free. This assumption is justifiable for most of the temperature and wind data, for humidity data it is questionable since these are known to have relatively large unadjusted biases.
Still we consider these estimates a quite useful tool to show the improvement in the insitu upper-air observation input over time. 

# tstarget is the output file already retrieved for the timeseries plot at the end of section 7

df=xr.open_dataset(tstarget).to_dataframe()
ts=df[(df.primary_station_id==b'0-20001-0-11035')&(df.observed_variable==b'air_temperature')&\
(df.z_coordinate==50000)].\
filter(['record_timestamp','uncertainty_value1'])


plt.plot(ts.record_timestamp,ts.uncertainty_value1,'.')
plt.ylabel('K')
plt.title('Temperature observation uncertainty estimate at 500 hPa for station Vienna')

Figure 8: .Time series of fg_depar@offline, without adjustment (blue) and with adjustmend (fg_depar adj), for radiosonde station 91408 (KOROR, in the tropical Pacific). Left: for relative humidity at 200hPa, right for temperature at 50 hPa. The shift in 1995 is clearly indicated. For details on the calculation of the adjustments, see the ATBD on temperature and humidity adjustments (to be published soon). 


How to request the intercomparison data

CUON contains ascents from many international radiosonde intercomparisons, such as the one in Vacoas, Mauritius, in 2010. This information will be released at a later stage. Please look for updates of this document in the forthcoming months. 

User support

The service team provides tier 2 support with a latency of one week. Queries should be submitted as JIRA requests via the CDS. They will be directed to us. We will try to maintain an FAQ.

Acknowledgements

The authors wish to thank the C3S team and collaborators for their continuous support, particularly Paul Poli, Markel Garcia-Diez and Edward Comyn-Platt.

References (other than data sources)

Desroziers, G., Berre, L., Chapnik, B. and Poli, P. (2005), Diagnosis of observation, background and analysis-error statistics in observation space. Q.J.R. Meteorol. Soc., 131: 3385-3396. https://doi.org/10.1256/qj.05.108

Hersbach, H. et al, 2020: The ERA5 global reanalysis. https://doi.org/10.1002/qj.3803

Imfeld, N., Haimberger, L., Sterin, A., Brugnara, Y., and Brönnimann, S.: Intercomparisons, Error Assessments, and Technical Information on Historical Upper-Air Measurements, Earth Syst. Sci. Data, https://doi.org/10.5194/essd-2020-395, 2021.

Slivinski, L. C., Compo, G. P., Whitaker, J. S., Sardeshmukh, P. D., Giese, B. S., McColl, C., Allan, R., Yin, X., Vose, R., Titchner, H., Kennedy, J., Spencer, L. J., Ashcroft, L., Brönnimann, S., Brunet, M., Camuffo, D., Cornes, R., Cram, T. A., Crouthamel, R., Domínguez⦣8364;Castro, F., Freeman, J. E., Gergis, J., Hawkins, E., Jones, P. D., Jourdain, S., Kaplan, A., Kubota, H., Le Blancq, F., Lee, T., Lorrey, A., Luterbacher, J., Maugeri, M., Mock, C. J., Moore, G. K., Przybylak, R., Pudmenzky, C., Reason, C., Slonosky, V. C., Smith, C., Tinz, B., Trewin, B., Valente, M. A., Wang, X. L., Wilkinson, C., Wood, K. and Wyszy;ski, P. (2019), Towards a more reliable historical reanalysis: Improvements for version 3 of the Twentieth Century Reanalysis system. Q J R Meteorol Soc..   https://doi.org/10.1002/qj3598.

Soci, C., Hersbach, H., Simmons, A., Poli, P., Bell, B., Berrisford, P., et al. (2024) The ERA5 global reanalysis from 1940 to 2022. Quarterly Journal of the Royal Meteorological Society, 150(764), 4014–4048. Available from: https://doi.org/10.1002/qj.4803

Voggenberger, U., Haimberger, L., Ambrogi, F., Poli, P.,: Balloon drift estimation and improved position estimates for radiosondes, GMD, https://doi.org/10.5194/gmd-17-3783-2024 2024

 Appendix

As noted above, the comprehensive in situ upper-air data set relies on a station inventory. It also contains a few special tables or variables, which are either extensions to the CDM, like the ERA5 observation feedback table, or where the CDM-OBS only gives table definitions and the data provider has to fill in the tables. Examples are the header_table or sensor_configuration tables.

1. Radiosonde type codes


Instrumentation metadata is essential for assessing the uncertainty of observations. Instrument type codes can be retrieved using the optional variable 'type', but in order to interpret the codes, a sensor_configuration table, which resolves the codes, is needed. The upper-air data set distinguishes between 3000 instrumentation codes. Most of them come from S. Schroeder's vapor data base (Schroeder, 2008). From 2013 onwards, WMO BUFR codes (as described in the WMO manual on codes https://library.wmo.int/records/item/35625-manual-on-codes-volume-i-2-international-codes) are used. The whole table can be downloaded not from the CDS API but from https://gitlab.phaidra.org/ulrichv94/CUON/-/blob/master/CUON/public/merge/sensor_configuration_all.csv, where the columns sensor_id and comments contain the relevant information. A few example rows are given below. 


...

" Z51 VIZ model 1463-510 US Army ML-659(V)1/TMQ-4A 403 MHz LoRAN-C {1983}^ <F=T403DL,P=Ac,T=TR,U=C,N=L>",
" Z52 VIZ model 1464-510 US Army ML-659(V)2/TMQ-4A 403 MHz Omega {1983}^ <F=T403DL,P=Ac,T=TR,U=C,N=O>",
" Z53 VIZ model 1460-510 US Army ML-659(V)3/TMQ-4A 1680 MHz RDF {1983}^ <F=T1680DL,P=Ac,T=TR,U=C,W=Ra>",
...


The sensor_id code given in the output from the CDS API matches to one row of this table. 

2. Station inventory

Table A. Links to comma-separated value station inventory files featuring station availability.

CDS dataset versionLink
1.1.0Link


This document has been produced in the context of the Copernicus Climate Change Service (C3S).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.

Related articles