ContributorsSwapan Mallick (SMHI), Jelena Bojarova (SMHI), Harald Schyberg (MET Norway), Per Dahlgren (MET Norway), Eivind Støylen (MET Norway), Xiaohua Yang (DMI), Kasper Tølløse (DMI)

Issued by: SMHI/Swapan Mallick

Date: 23/1/2026

Ref: C3S2_D361a.1.4.1_UncertaintyEstimation_v4

Official reference number service contract: 2022/C3S2_361a_METNorway/SC1 

<style>
.special_indent ul > li > ul {
    padding-left: 0;
}

.special_indent ul > li > ul > li > ul {
    padding-left: 1.5em;
}
</style>

Introduction

Reanalysis is a method for reconstructing past atmospheric states by integrating historical observations with a weather forecasting model. The Copernicus Arctic Regional Reanalysis (CARRA) is a high-resolution climate data product that assimilates an extensive time series of observations into the HARMONIE model and the 3D-Var data assimilation system to provide the most accurate estimate of the atmospheric state. An important requirement for the CARRA reanalysis is the computation of potential ensemble uncertainties for essential climate variables. Numerical models inherently contain various uncertainties and are often run in ensemble mode to enhance forecast accuracy and evaluate uncertainty. During CARRA1, the CARRA team developed an approach that utilizes a limited number of high-resolution ensembles generated over several short time intervals, in conjunction with the derivation of background error statistics (Bojarova et al., 2020). This methodology offers a relatively straightforward estimation of uncertainty for key prognostic variables, employing a scaling method that compares ensemble spread with observation error variances at observation locations. The uncertainty estimates provided are static; however, they do vary with height, season, and between the CARRA-West and CARRA-East domains. The tables containing the information for what fields uncertainties are provided as well as the data themselves can be obtained from Copernicus Arctic Regional Reanalysis (CARRA): Data User Guide and the Known issues and uncertainty information documentation pages (Copernicus Arctic Regional Reanalysis (CARRA): Data User Guide-What are the uncertainties of the data fields? and Copernicus Arctic Regional Reanalysis (CARRA): known issues and uncertainty information-Uncertainty information).

For CARRA2, we face a challenge regarding the requirement for a high-resolution, extended reanalysis dataset in terms of both domain and potentially time range, while simultaneously providing uncertainty information when an ensemble system is computationally not feasible. Our proposed approach considers the experiences gained during CARRA1 and draws inspiration from the generation of time-varying uncertainty information as described in Olesen et al. (2013), where deterministic regional-scale information is supplemented with uncertainty estimation using global ensembles for projecting regional climate change. In this work, we again utilize the ensemble dataset generated in connection with the derivation of background error statistics, which is driven by the ERA5-EDA with ensemble components. This dataset is employed to produce a coincident ensemble in the limited-area model by introducing perturbations using the "BRAND" field perturbation approach (refer to section 2.2.2 of Yang et al., 2021, and the CARRA1 system documentation for details on the BRAND approach). The objective is to establish an empirical relationship, in the form of nonlinear regression or a machine learning approach, to predict the high-resolution regional spread of Essential Climate Variables (ECVs) (a scalar for each ECV) using both the high-resolution deterministic CARRA2 and the low-resolution ERA5 reanalysis EDA components as predictors (the full fields, employing an implicit multivariate approach). By taking the high-resolution CARRA2 spread as a proxy for uncertainty, we will be able to predict the CARRA2 uncertainty using the information available during the reanalysis production, even when a corresponding high-resolution ensemble is not available.The statistical model will be trained on this collected dataset to predict the high-resolution spread in the space of ECVs (as a proxy for uncertainty), which will be used to estimate the reanalysis uncertainty for the entire reanalysis period (including periods outside the time slices used for background error statistics derivation). This method is expected to capture variations in uncertainties with height, the presence of orography, observation network density, and other factors. The method involves computing the uncertainty in model space, where we will first calculate the ensemble uncertainty in terms of standard deviation (SD), and ensemble mean from CARRA2 and ERA5-EDA. This approach also has the potential to provide weather situation-dependent uncertainties and will offer a more detailed description of the actual variations in uncertainty across space and time than was achievable with the method used in CARRA1.

Following the Introduction, section 2 of this report outlines the specifics of the ERA5 ensemble and CARRA2 ensemble dataset utilized, along with the selected parameters. Section 3 will elaborate on the application of machine learning techniques for uncertainty quantification (UQ) and will also cover the proposed methodology involving the Denoising Diffusion Probabilistic Model, as well as the training and validation processes for the machine learning approach. Section 4 will summarize the practical implementation and outputs to users. The supplementary figures illustrating performance of the Denoising Diffusion Probabilistic Model are presented in Appendix 1. Data preparation, training procedure and validation of results are described in detail in Appendix 2. Technical details on implementation of the proposed methodology including instructions of how to install the UQ system and the model output details are provided in Appendix 3.

Data Set

ERA5-EDA ensemble data structure

To train the machine learning model, one of our main data sources is the ECMWF ERA5 reanalysis (Hersbach et al., 2020). The ERA5 reanalysis datasets are generated by continuously integrating observations using 4D-Var data assimilation with the Integrated Forecasting System (IFS) model cycle CY41R2 on 31 km horizontal resolution and using 137 hybrid sigma/pressure model levels in vertical. The ERA5 dataset includes a ten-member ensemble (EDA) that has a lower spatial and temporal resolution (approximately 60 km horizontally and 3-hour temporally) compared to the original ERA5 product (around 30 km horizontally and 1-hour temporal resolution). This lower resolution analysis dataset is utilized for estimating uncertainty in ERA5. More details on the ERA5-EDA component can be found in ERA5: Data Documentation (accessible via ERA5: data documentation) that offers a comprehensive overview of the various products and lists all available geophysical parameters. The ERA5-EDA can be effectively downloaded directly from MARS or through the Copernicus Climate Data Store (CDS).

CARRA2 ensemble data structure

A time series of four sets of analysis, 3hr and 6hr forecast files were generated using the HARMONIE-AROME ensemble prediction system environment applied for creating CARRA2-EDA. The simulation periods were selected to cover all four seasons during the year 2022: 1 January - 15 February, 1 March - 15 April, 1 June - 15 July, and 1 September - 15 October. To generate ensemble forecasts, the boundary conditions were taken from ERA5-EDA. Initial conditions are generated by the HARMONIE-AROME model using BRAND ensemble generation methodology, that allows to sample the uncertainty in the model space utilizing climatological structure functions. For all experiments, multiple ensemble forecasts with 10 members were produced, using a spatial resolution of 2.5 km and 65 vertical levels. Figure 1 shows spatial distribution of 2-meter temperature (in Kelvin) over the CARRA2 domain is presented for the control member (CNTL) and the 2-meter temperature differences between the Control (CNTL) and each ensemble member are shown for the analysis time valid on January 20, 2022, at 06:00 UTC. For the control member, the model was initialized using the 3D-Var data assimilation method. For the Winter and Summer periods, only conventional observations are assimilated such as SYNOP, SHIP BUOY, and AIRCRAFT data. For the Spring and Autumn experiments, satellite data are assimilated in addition to conventional observations (offline tests, not shown here, indicate that results are not very sensitive to whether satellite data are used or not). All available observations were included within a +/- 3-hour window for each assimilation cycle. Initial conditions for nine ensemble members were obtained by adding a perturbation to the unperturbed control analysis. Figure 2. shows the representation of HARMONIE-AROME ensemble data assimilation and forecasting system.

All CARRA2 simulated raw output files for each ensemble member are stored in HARMONIE internal (FA) format. During the ML data pre-processing phase, the FA files are first converted to NetCDF format using the epygram software (https://umr-cnrm.github.io/EPyGrAM-doc/#). During the conversion from FA to NetCDF, each ensemble member and parameter are stored separately. For instance, when dealing with 2-meter temperature and 10 ensemble members, we generate 10 separate NetCDF files for each assimilation cycle. These files are subsequently utilized to calculate the standard deviation and ensemble mean, which are saved as image files (PNG format) for each parameter, serving as input for our ML model. To reduce computational costs, an alternative approach is being developed that utilizes the Zarr format (https://zarr.readthedocs.io) directly as input for the ML model. Efforts are currently underway to implement this method.


Figure 1: (a.) The spatial distribution of 2-meter temperature (in Kelvin) over the CARRA2 domain for the control member (CNTL). Additionally, (b.-j.) the 2-meter temperature differences between the Control (CNTL) and each ensemble member for the analysis time valid on January 20, 2022, at 06:00 UTC.
ö
Figure 2: A schematic representation of the HARMONIE-AROME ensemble data assimilation and forecasting system. The black curve schematically illustrates the generation of ensemble forecasts.

Choice of parameters

Selecting the right parameters for uncertainty quantification (UQ) in Machine Learning (ML) models is vital for achieving both precision and efficiency. The choice of parameters — such as grid resolution, terrain characteristics, time intervals, and variable heights — greatly influences the accuracy of predictions and the feasibility of computations. Therefore, optimizing these parameters is crucial for computational affordability of the approach. We illustrate the uncertainty estimation methodology focusing on temperature fields first and then we extend the methodology to other variables: meridional and zonal wind components, specific humidity, and surface pressure. Our analysis utilizes a subset of data from the pan-arctic region, with parameters selected for various pressure levels, including 500 hPa, 700 hPa, 900 hPa, and 950 hPa as well as near surface. Figure 3. illustrates the CARRA2 ensemble spread (standard deviation) in terms of temperature (in Kelvin) at four different levels (500, 700, 900, and 950 hPa) for the 10 ensemble members valid on 06 UTC January 20, 2022. The figure indicates that temperature uncertainty is significantly higher near the surface and decreases with altitude. In the preliminary study, it was observed that above 500 hPa, the uncertainty is very low, approaching zero, across the entire CARRA2 domain. This observation is the reason for the selection of 500 hPa as the highest level. 

For our estimation, we made a choice to focus on a limited set of output variables, including 2-meter temperature and 10-meter zonal and meridional wind. Our motivation for this is that in the CARRA2 domain, near-surface variables such as 2-meter temperature and 10-meter zonal and meridional winds are essential for comprehending near-surface weather and atmospheric dynamics. The 2-meter temperature is important for tracking Arctic warming, cold air outbreaks, and surface energy balance. In polar regions, temperature inversions are common during winter and can trap cold air at the surface, significantly influencing the formation of sea ice. These inversions can enhance ice growth by maintaining lower surface temperatures. Additionally, they may reduce wind mixing, which affects the stability and thickness of the ice (Pavelsky, et al. 2011). Conversely, summer temperature trends are crucial for assessing the impacts of ice melting on weather and climate. As temperatures rise in summer, sea ice begins to melt. It is essential to understand how the melting of sea ice influences weather patterns, such as storm tracks and heatwaves. The 10-meter winds determine surface wind patterns, which influence sea ice movement, coastal storms, and polar lows. They are also crucial for moisture transport, affecting precipitation and snowfall distribution in the Arctic.

Furthermore, strong winds facilitate heat exchange between the ocean and atmosphere, impacting Arctic cyclones. Collectively, these variables offer vital insights into climate variability, extreme weather events, and long-term changes in the Arctic. 


Figure 3: CARRA2 ensemble spread (standard deviation) in terms of temperature (in Kelvin) at four different levels (500, 700, 900, and 950 hPa) for the 10 ensemble members valid on January 20, 2022, and for 06 UTC analysis cycles.
The ensemble spread (Figure 4) and ensemble mean (Figure 5) of 2-meter temperature (in Kelvin) is shown for the 10 ensemble members of the ERA5-EDA reanalysis data (left panel) and the CARRA2 ensemble members (right panel), valid on 20 January 2022 for all four analysis cycles. The figures effectively illustrate the flow-dependent uncertainty for both ERA5-EDA and CARRA2. Furthermore, it is well established that model orography significantly influences the uncertainty among the ensemble members, and its impact is clearly evident in both datasets. Notably, over Greenland, several regions of large spread are distinctly observable in both datasets. The mean 2-meter temperature flow between the ERA5-EDA and CARRA2 ensemble datasets is very similar, strongly supporting the underlying correlation between the lower resolution (ERA5-EDA) and higher-resolution datasets (CARRA2, Figure 5).



Figure 4: The ensemble spread of 2-meter temperature (in Kelvin) for the 10 ensemble members of the ERA5-EDA reanalysis data (left panel) and the CARRA2 ensemble members (right panel), valid on 20 January 2022 for all four analysis cycles.

Table 1: List of the input near surface parameters for uncertainty estimation in model space using the machine learning method for both ERA5 and CARRA2 ensembles.

Variables

Level

CARRA2
(2.5 km)

ERA5
(ensemble)

2m Temperature (in Kelvin)

Near Surface

Y =2869; X = 2869

Y = 114; X = 130

Zonal Wind (10 m, u- in m/sec)

Near Surface

Y =2869; X = 2869

Y = 114; X = 130

Meridional Wind (10 m, v- in m/sec)

Near Surface

Y =2869; X = 2869

Y = 114; X = 130

Surface Pressure (Pa)

Surface

Y =2869; X = 2869

Y = 114; X = 130

Table 1 displays the input parameters near the surface for uncertainty estimation in model space using the ML approach for both the ERA5 ensemble and CARRA2 ensemble. Precipitation is excluded as a variable in the diffusion-based ML method because the approach requires gridded spread data among ensemble members. When there is no precipitation or a null value for any ensemble member, the spread becomes excessively high and unrealistic. As a result, the ML model is likely to perform poorly in most cases within the CARRA2 domain. 


Figure 5: Same as Figure 4, but for the ensemble mean of 2m-temperature (in Kelvin). The mean temperature flow between the ERA5-EDA and CARRA2 ensemble datasets is very similar, strongly supporting predictability and the underlying correlation between the lower resolution (ERA5-EDA) and higher-resolution (CARRA2) datasets.

We calculated ensemble spread (standard deviation, SD) and ensemble mean between the ensemble members of ERA5-EDA and CARRA2 ensemble separately and stored them in a form of an image (png format). 
The ensemble mean (μ) and ensemble standard deviation (SD) is determined using the formula

$$μ = \frac{1}{N} \sum_{i=1}^{N} x_i ; SD = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i-μ)^2}$$

where the ensemble mean, is calculated as the average of all ensemble members, N is the number of ensemble members (in our case, N=10), and xi refers to the value of the i-th ensemble member. The μ reflects the central tendency of the predictions. The SD quantifies the dispersion among the ensemble members, reflecting the level of uncertainty in the prediction. A larger SD indicates increased uncertainty, whereas a smaller SD implies greater confidence in the prediction.

Uncertainty Quantification: Machine Learning Approach

In recent years, Deep Neural Networks (DNNs) have gained prominence in weather and climate predictions, particularly in uncertainty assessment. Understanding uncertainty in reanalysis products is critical for scientific analysis and decision-making. While DNN approaches have demonstrated improvements over traditional statistical post-processing methods, their lack of interpretability remains a key challenge. This opacity undermines scientific transparency, trust, and the validation of models (Rasp et al, 2018). Without a clear understanding of the mechanisms by which these models generate predictions, researchers are unable to fully assess their physical plausibility or identify potential sources of error. The DNNs primarily capture statistical correlations rather than physically consistent relationships essential for robust climate analysis and forecasting. 

Various DNN architectures are tailored to specific tasks, and ongoing research endeavors to aim to enhance their effectiveness for atmospheric and climate-related applications. Among the most promising advancements are diffusion-based machine learning models and Graph Neural Networks (GNN), each possessing unique advantages. GNN represent data through nodes and edges, rendering them particularly suitable for spatially structured yet irregular datasets, such as those derived from weather stations, topography-aware grids, or unstructured model meshes. They are adept at capturing spatial dependencies and temporal interactions, thereby serving as powerful tools for process modeling and forecasting. A recent advancement in the operational medium-term weather forecasting system at the European Centre for Medium-Range Weather Forecasts (ECMWF) is the Artificial Intelligence Forecasting System (AIFS, Lang eat al, 2024), a data-driven model developed to enhance forecast accuracy. AIFS utilizes a GNN-based encoder-decoder architecture integrated with a sliding window transformer processor. The model is trained on ECMWF's ERA5 reanalysis dataset as well as on the operational NWP system.

Furthermore, ECMWF has introduced GraphDOP, a machine learning model that relies exclusively on observational data to produce skillful medium-range weather forecasts (Alexe et al, 2024). Unlike traditional approaches, GraphDOP is trained and initialized directly from observational datasets, thereby providing an alternative forecasting methodology that does not depend on NWP outputs. Similar to AIFS, GraphDOP employs a GNN-based encoder-decoder architecture. However, in general GNN generally exhibit limitations in uncertainty representation, and their spatial resolution is inherently constrained by the design and connectivity of the underlying graph. 

Conversely, diffusion-based models excel in producing ensemble-like predictions and reconstructing uncertainty fields by learning flow-dependent uncertainty structures. Their probabilistic framework enables the generation of fine-scale, spatially coherent, and physically realistic fields while capturing complex, high-dimensional dependencies. Nonetheless, these models continue to face challenges in explicitly modeling temporal dynamics. Uncertainty quantification in diffusion-based machine learning methods depends on the model's architecture, the sources of uncertainty, and the information leveraged. Effectively utilizing DNNs for uncertainty estimation is further complicated by the need to capture error structures within reanalysis data. Recent advancements in DNN architectures, such as the Denoising Diffusion Probabilistic Model (DDPM), enable the processing of large datasets, thereby improving predictive performance. To address this challenge, we developed a diffusion-based machine learning framework (DDPM-ML) designed to learn from uncertainty and to reconstruct the uncertainty relationship between two ensemble datasets with differing spatial resolutions: the coarser-resolution ERA-EDA and the higher-resolution CARRA2.

Denoising Diffusion Probabilistic Model (DDPM-ML)

Denoising Diffusion Probabilistic Models (DDPM-ML) are a type of generative models that haves attracted considerable interest for their capability to produce high-quality images and intricate data distributions (Jiaming et al., 2020, Wolleb et al., 2021, Nichol and Dhariwal, 2021, Yang et al., 2022, Watt and Mansfield 2024, Andrae et al., 2024). DDPM-ML uses a step-by-step process to generate samples that follow the way the real data is naturally formed. When paired with Convolutional Neural Networks (CNNs, Yamashita et al., 2018), DDPM-ML takes advantage of CNNs proficiency in effectively capturing spatial and hierarchical features, which enhances both the quality and efficiency of the generation process. DDPM-ML draws inspiration from non-equilibrium thermodynamics and Markovian diffusion processes. The fundamental concept of DDPM-ML involves gradually introducing noise to data in a forward process, followed by training a neural network to reverse this process through a denoising function. The model operates within a probabilistic framework, where data is incrementally degraded with Gaussian noise across several timesteps, and a trained model learns to progressively denoise it, ultimately reconstructing the original distribution.


Figure 6: A visual representation of a Diffusion Model; x0 represents true data observations (images), xT represents pure Gaussian noise, and xt is an intermediate noisy version of x0 and βtis a predefined variance schedule. Each q(xt) is modeled as a Gaussian distribution that uses the output of the previous state as its mean.

The forward process or the noise addition process follows a Markov chain:

$$q(x_t|x_{t-1} = Ɲ \left(x_t ; \sqrt{1-β_t} x_{t-1}, β_tI\right)$$

Where xt is the noisy version of the data at timestep t, and βtis a predefined variance schedule.
The reverse process (denoising to recover the original data) is modeled using a neural network parameterized by Θ:

$$P_θ (x_{t-1} | x_t) = Ɲ (x_{t-1} ; μ_θ (x_t , t), \Sigma_θ (x_t, t))$$

where the mean μθ and variance ∑θ are learned using deep learning techniques.

DDPM-ML combined with CNNs have revolutionized generative modeling, resulting in significant advancements in image generation, restoration, and enhancement (Shang et al., 2024). CNNs play a crucial role in DDPM-ML, particularly during the denoising process, where they progressively refine noisy inputs into cleaner outputs at each stage of the diffusion. Their strength lies in capturing local dependencies and spatial structures within images, making them highly effective for image-related tasks. A visual representation of a Diffusion Model is presented in Figure 6, where x0 represents true images and xT represents pure Gaussian noise. Noise is introduced using four different functions: sigmoid, cosine, linear, and quadratic, as illustrated in Figure 7. Figure 8 further demonstrates both the forward and reverse (denoising) processes using a single image. The reverse process remains approximately Gaussian if the diffusion steps are sufficiently small. Therefore, we utilize 1000 or more diffusion steps for both the forward and reverse processes to ensure effective denoising and enhanced model performance. These additional steps help in preserving fine image details while maintaining stability in the reconstruction process.


Figure 7: A visual representation of a Diffusion Model. Here, x represents the true data (e.g., images), x₁₀₀₀ represents Gaussian noise, and x200, x{~}₄₀₀, x₈₀₀~ , x1000 are the noisy versions of x after noise is added at steps t = 200, 400, 800, and 1000. The noise is applied using four different functions: sigmoid, cosine, linear, and quadratic.




Figure 8: A visual representation of a Variational Diffusion Model; x0 represents true data (images), xT represents Gaussian noise, and xt is an intermediate noisy version of x0. Each q xt|xt-1 is modeled as a Gaussian distribution that uses the output of the previous state as its mean. A parameterizable prediction model with parameters ɛ is used to estimate the error. The reverse process will also be (approximately) gaussian if the diffusion steps are small enough. For that reason we use 1000 steps or more of diffusion steps (t) in both the forward and the reverse process.

The U-Net architecture (Ronneberger et al., 2015), commonly used for denoising, features an encoder-decoder design with skip connections. The encoder uses CNN layers to extract high-level features from noisy images, while the decoder reconstructs clean images through transposed convolutions. Skip connections are vital for maintaining fine details and spatial information, resulting in high-quality images. Techniques like dilated convolutions, attention mechanisms, and hierarchical structures further enhance DDPM-ML ability to produce intricate details and textures, making them effective for super-resolution, image inpainting, and medical imaging. However, despite their remarkable capabilities, DDPM-ML encounters several challenges that limit their broader use. The sequential nature of the denoising process renders DDPM-ML computationally intensive. Training DDPM-ML necessitates careful adjustment of noise schedules and network parameters, complicating the optimization process. Furthermore, handling high-resolution images requires significant memory, which can restrict their use with limited resources. These challenges highlight the necessity for additional research aimed at improving the efficiency and scalability of DDPM-ML. A well-structured model architecture can deliver strong performance with moderate image sizes. Many image processing models typically utilize input sizes such as (224×224) or (256×256) pixels to maintain efficiency while retaining important details. Convolutional layers are effective at extracting features from smaller images, enabling deep networks to process information in a hierarchical manner. Furthermore, even minor changes in input image size can greatly affect training time, highlighting the significance of selecting an appropriate resolution. 

Addressing the limitations and expanding the capabilities of DDPM-ML in the CARRA2 domain, which contains a substantial number of grid points (2869 x 2869), presents a challenge when using the images directly. To address this issue, the pixel dimensions 2869×2869 of CARRA2 can be reduced directly to 114x130 of ERA5-EDA. Instead of manually downscaling the CARRA2 spatial resolution from its original grid size to the lower resolution of ERA5-EDA (Figure 9), we employ a U-Net–based super-resolution architecture that is fully integrated into the model's training process. This methodology facilitates dynamic, data-driven resampling within the super-resolution framework itself, thereby maintaining fine-scale spatial details that are typically lost through traditional downscaling techniques. The proposed model encompasses all necessary components for both the training phase and the operational implementation of diffusion-based super-resolution systems conditioned on low-resolution ERA5-EDA input fields. The principal advantage of this approach is its capacity to generate high-resolution outputs while minimizing the loss of information. Additional methodological specifics concerning the downscaling strategy can be found in Appendix A3.3, describing the DDPM-ML implementation.


Figure 9: A visual depiction illustrating the conversion of various pixel sizes from the original grid size of (2869 x 2869) to (256 x 256) and to (64 x 64) pixel sizes for the uncertainty images of CARRA2 (shown in the top three panels) and ERA5-EDA (shown in the bottom three panels).

Reducing the number of diffusion steps without compromising image quality could significantly enhance computational efficiency. Integrating transformer-based architectures may further enhance DDPM-ML by capturing complex patterns through global context modeling. Additionally, researchers are exploring learned noise schedules and adversarial training to streamline the training process and reduce memory requirements. Addressing these challenges will make DDPM-ML more practical and accessible, bridging probabilistic modeling with deep learning-based feature extraction and expanding their role in generative modeling.
Semantic segmentation assigns a class to each pixel in an image, providing a detailed analysis that goes beyond object detection, which relies on bounding boxes. This technique is widely utilized in fields such as medical imaging to identify organs or tumors, as well as in satellite analysis to classify land types, water bodies, and urban areas (Wolleb et at, 2021) .

Proposed methodology using diffusion model

This work proposes the use of the DDPM-ML deep learning approach for uncertainty estimation, which capitalizes on ensemble information, including spread (standard deviation) and ensemble mean, to explore the potential relationship between the CARRA2 and ERA5 ensemble datasets. To achieve this, we propose a novel semantic segmentation method based on diffusion models (Wolleb et at, 2021). DDPM-ML utilizes Convolutional Neural Networks (CNNs) to learn the process of removing noise from data in multiple steps. The CNN effectively captures spatial patterns, making DDPM-ML well-suited for generating high-quality CARRA2 and ERA5-EDA data by progressively refining them through learned noise-reversing transformations. The pixel-wise segmentation-based DDPM-ML method generates uncertainty maps for the segmentation mask from the standard deviation (SD) and ensemble mean at the pixel level from the ERA5 ensemble. Each pixel is assigned to a predefined class, which corresponds to the uncertainty (SD) derived from the CARRA2 ensemble member during the training of the DDPM-ML model. A widely used approach in image segmentation involves applying a U-Net model to predict segmentation masks for ERA5 uncertainty map. Additionally, by refining the training and sampling methods, diffusion models can effectively segment structures in uncertain images. To generate a segmentation specific to an image, the model will be trained on ground truth segmentation data while using the image as a reference throughout both the training and sampling processes. Due to the stochastic nature of the sampling process, multiple segmentation masks can be generated, facilitating the computation of pixel-wise uncertainty maps. This also allows for an implicit ensemble of segmentations, thereby enhancing overall segmentation performance. Our approach with segmentation models guarantees pixel-wise accurate segmentations while delivering detailed uncertainty maps. Our network utilizes the PyTorch Lightning framework for the uncertainty quantification model. PyTorch Lightning serves as a high-level interface for the PyTorch library, enabling distributed training on Graphics processing units (GPUs) and tensor processing units (TPUs), automatic mixed-precision, and various optimizations, among other features.

Training and Validation details

To generate image-specific segmentation, we will train the model using ground truth segmentation while incorporating the image as a prior during both the training phase and each step of the sampling process. By employing a stochastic sampling approach, we will generate a distribution of segmentation masks, allowing for a more robust and diverse representation of the segmentation space. This method enables the computation of pixel-wise uncertainty maps, providing valuable insights into the model's confidence across different regions within the CARRA2 domain. Additionally, it facilitates an implicit ensemble of segmentations, thereby enhancing overall segmentation performance. By capturing multiple plausible segmentations, the model becomes more resilient to noise and variability in the input data, ultimately improving its generalization and reliability in CARRA2 uncertainty applications.

A schematic representation of the training flow of the Diffusion Model is shown in Figure 10. In supervised diffusion, the input data comprises the ERA5-EDA standard deviation and ensemble mean for each variable separately (e.g. SD and ensemble mean 2-meter temperature). The model is conditioned on specific labels or attributes, such as class labels and the standard deviation from the CARRA2 ensemble, to guide the generation process. Although the ML model begins its work in the uncertainty space derived from both the ERA5 and CARRA2 ensembles, during training, it establishes a mapping between the uncertainty of ERA5 and that of the CARRA2 ensemble. The trained model and its weighting functions are subsequently stored. Another schematic representation (Figure 11) illustrates the application of a pre-trained diffusion model in machine learning for inference applications.


Figure 10: Schematic representation of the training data flow of the Diffusion model. In supervised diffusion, the model is conditioned on specific labels or attributes (class labels) to guide the generation process. From the ML training period, the model will generate a relationship (mapping) between the ERA5 uncertainty and the CARRA2 ensemble uncertainty. Uncertainty is quantified in a form of ensemble spread measured through standard deviation (SD). We store the model and the weighting functions.

Figure 11: Schematic representation of utilizing a pre-trained diffusion model in ML for outcome prediction or generation.

To explore the significance of DDPM-ML, we utilized the code from Wolleb et al. (2021). They implemented the diffusion model in a supervised manner on the BRATS2020 dataset to generate uncertainty maps for brain tumor segmentation. Additionally, we developed a toy model and trained it in an unsupervised fashion using 100 random images of fruit, without providing any labels or prior knowledge about the objects. Remarkably, the model was able to identify primary segmentations present in the input data, successfully detecting the shapes of apples and bananas. The results of this unsupervised toy model are illustrated in Supplementary Figure S1.

In DDPM-ML, parameters such as noise schedules, learning rates, and network architectures play a crucial role in determining the quality of generated samples and convergence stability. Proper tuning of these parameters ensures effective denoising and realistic data generation.

Practical Implementation and output to users

Thanks to its exceptionally high grid resolution, the CARRA2 reanalysis provides climate reanalysis with higher fidelity compared to coarser resolutions in particular concerning the orographic impact. Due to resource constraints, the reanalysis is conducted only in deterministic mode, meaning that no uncertainty information is directly available in the form of a perturbed ensemble, as is the case with the ERA5 datasets. The latter is accompanied by a 10-member ensemble at half of the resolution (Hersbach et al., 2020).

In this investigation, we explored an innovative approach to measure uncertainty associated with the CARRA2 datasets by linking ensembles from ERA5 (62 km, 10 members) with mini-ensembles from CARRA2. The latter were generated from month-long ensemble runs linked to the derivation of background error covariances. Although the amount of high-resolution CARRA2 ensemble data is limited, the preliminary investigation discussed above has shown promising indications that, through machine learning, a connection between the ERA5 and CARRA2 ensembles can be established (see Figures 4, Figure 5 and the related discussions) This connection can be utilized to project ERA5 ensemble information onto the CARRA2 dataset, providing a proxy for uncertainty information. This may serve as a valuable reference for potential users of the CARRA2 dataset, even though a comprehensive uncertainty quantification based on the CARRA2 ensemble is not yet available.

Upon successfully modeling uncertainty information with an AI model using the aforementioned approach, various forms of uncertainty quantification related to the deterministic CARRA2 reanalysis can be generated. In principle, based on the ML model's outcomes, a flow-dependent uncertainty distribution map for selected ECVs can be created for each deterministic CARRA2 reanalysis time slot by downscaling the corresponding ERA5 ensemble data and deterministic CARRA2 field to a CARRA2 high-resolution orography using AI-based methodology described above. Even though achieving this for the entire duration of the 40-year CARRA2 reanalysis is challenging due to logistical constraints, we provide a one-year demonstration of the product. For the demonstration we generate uncertainty estimates for the full year of 2019 for the 2-m temperature spread, expressed as standard deviations. These uncertainty estimates are provided in NetCDF format, along with the corresponding plots in PNG format, for the entire pan-Arctic domain. We believe that, in the absence of direct EPS datasets, this will be a useful and meaningful step in providing uncertainty information to the users of the CARRA2. 



Figure 12: The climatology for the year 2022 regarding uncertainty estimation products generated using unsupervised learning with ERA5-EDA data is presented. The left figure displays the ensemble mean of the 2-meter temperature valid on June 1, 2022, at 00 UTC, which was utilized for training the machine learning model. The right side features a six-panel display of samples generated by the machine learning model during the training process.

Figure 12 presents the climatology for 2022 concerning uncertainty estimation products generated using unsupervised learning with ERA5-EDA data. The uncertainty samples produced by the machine learning model during the training process are visualized in this figure. Notably, there is less variability in most of the samples generated by the model over polar regions, while uncertainty is greater over land areas, particularly in Greenland area and over the northern parts of the Canada region. It is important to emphasize that this sample was created using a limited amount of data with coarser resolution from the ERA5-EDA dataset, demonstrating both the effectiveness and elegance of the methodology. During the training period, the diffusion model effectively captures the uncertainty over the CARRA2 domain. This brief demonstration highlights how machine learning models can generate hidden information between ensembles for uncertainty quantification. 


Figure 13: Demonstration of flow dependent uncertainty estimation products for a selection of Essential Climate Variables (ECVs) right two rows. Uncertainty quantification was carried out during the validation phases of the DDPM-ML model for the 2-meter temperature. The first column shows the uncertainty estimated from the ERA5-EDA input data, while the second panel illustrates the uncertainty derived from the CARRA2 ensemble. The third column represents the minimum uncertainty associated with the lowest weighted MSE, and the fourth column displays the maximum uncertainty among members with weighted MSE below 0.2, as generated by the DDPM-ML model. All data correspond to January 7, 2022, and include four analysis times: 00, 06, 12, and 18 UTC.

Uncertainty quantification is more pronounced in surface variables (Table 1), primarily due to the influence of surface orographic effects. Figure 13 presents a demonstration of the products that may be available to the users; for each selected ECV shown in Tables 1, the deterministic CARRA2 prediction is accompanied by an uncertainty distribution in the form of the DDPM-ML predicted standard deviation field (right two columns). The example shows the uncertainty estimates for four time periods valid on 7 January 2022. However, since the uncertainty distributions depend on the flow, these estimates will be available every six hours throughout the selected one-month demonstration period. In addition, the model can reconstruct the physical field of the reference variable across the domain, closely matching the corresponding CARRA2 fields (Appendix 2), further validating the proposed methodology. Additional results, along with detailed descriptions used to cross-validate the model, are now provided in Appendix 2. A comprehensive description of the large dataset is presented in Section A2.1 (Appendix 2). We also explain the organization of the full dataset used for both training and validation of the diffusion model in Section A2.2 (Appendix 2). Further validation results are included in Section A2.3 (Appendix 2). Additionally, the limitations, challenges, and future perspectives are discussed in the Summary and Conclusion section (A2.4, Appendix 2). Finally, a technical report and details regarding the installation of the DDPM-ML model for CARRA2 on the ECMWF high-performance computing system (ATOS) are provided in Appendix 3. 

In summary, we have developed and validated a diffusion-based machine learning framework (DDPM-ML) aimed at learning from uncertainty and reconstructing the uncertainty relationship between two ensemble datasets of differing resolutions: the coarser-resolution ERA5-EDA and the higher-resolution CARRA2. The proposed model successfully generates high-resolution gridded uncertainty fields that accurately capture flow-dependent and orography-related uncertainty patterns, exhibiting strong spatial coherence. These findings substantiate the robustness and reliability of the DDPM-ML approach in effectively representing realistic uncertainty structures.

The actual links to the 2019 demonstration uncertainty estimate data set along with a short guidance to its practical use is found here: Copernicus pan-Arctic Regional Reanalysis (CARRA2): User guide for the uncertainty estimation demonstration dataset.

References

Alexe, M., E. Boucher, P. Lean, E. Pinnington, P. Laloyaux, A. McNally et al., (2024): GraphDOP: Towards skilful data-driven medium-range weather forecasts learnt and initialised directly from observations. https://doi.org/10.48550/arXiv.2412.15687

Andrae, M., Landelius, T., Oskarsson, J., & Lindsten, F. (2024). Continuous Ensemble Weather Forecasting with Diffusion Models. https://arxiv.org/abs/2410.05431

Bojarova, J. et al. (2020). Uncertainty estimation method. C3S deliverable report C3S_D322_Lot2.1.1.2-202002. Copernicus Arctic Regional Reanalysis (CARRA): Uncertainty estimation method

Hersbach, H., Bell, B., Berrisford, P., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146, 1999–2049. https://doi.org/10.1002/qj.3803

Hiroshi Sasaki, Willcocks, C. G., & Breckon, T. P. (2021). Unit-DDPM: Unpaired image transition with denoising diffusion probabilistic models. arXiv preprint, https://arxiv.org/abs/2104.05358

Jiaming Song, Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv preprint, https://arxiv.org/abs/2010.02502

Lang, S., Alexe, M., Chantry, M., Dramsch, J., Pinault, F., Raoult, B., Clare, M. C. A., Lessig, C., Maier‑Gerber, M., Magnusson, L., Ben Bouallègue, Z., Prieto Nemesio, A., Dueben, P. D., Brown, A., Pappenberger, F., & Rabier, F. (2024, May). AIFS – ECMWF’s data‑driven forecasting system [Preprint]. https://arxiv.org/pdf/2406.01465

Mitros, J., & Mac Namee, B. (2019). On the validity of Bayesian neural networks for uncertainty estimation. arXiv preprint, https://arxiv.org/abs/1912.01530

Nichol, A., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. arXiv preprint, https://arxiv.org/abs/2102.09672

Olesen, M. et al. (2018). Robustness of high-resolution regional climate projections for Greenland: A method for uncertainty distillation. Climate Research, DOI:10.3354/cr01536

Olaf Ronneberger, Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, 234–241. Springer.

Pavelsky, T. M., Boé, J., Hall, A., & others. (2011). Atmospheric inversion strength over polar oceans in winter regulated by sea ice. Climate Dynamics, 36(5–6), 945–955. https://doi.org/10.1007/s00382-010-0756-8

Rasp, S., Pritchard, M. S., & Gentine, P. (2018). Deep learning to represent sub‑grid processes in climate models. Proceedings of the National Academy of Sciences of the United States of America, 115(39), 9684‑9689. https://doi.org/10.1073/pnas.1810286115

Shang, Shan, Z., Liu, G., & Zhang, J. (2024). ResDiff: Combining CNN and diffusion model for image super-resolution. arXiv preprint, https://doi.org/10.48550/arXiv.2303.08714

Watt, R. A., & Mansfield, L. A. (2024). Generative Diffusion-Based Downscaling for Climate. arXiv preprint, https://arxiv.org/abs/2404.17752

Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., & Cattin, P. C. (2021). Diffusion Models for Implicit Image Segmentation Ensembles. arXiv preprint, https://arxiv.org/abs/2112.03145

Yamashita, R., Nishio, M., Do, R. K. G., & Togashi, K. (2018). Convolutional neural networks: An overview and application in radiology. Insights into Imaging, 9, 611–629. https://doi.org/10.1007/s13244-018-0639-9

Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., & Yang, M. H. (2022). Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv preprint, https://doi.org/10.48550/arXiv.2209.00796

Yang, Xiaohua et al. (2020). C3S Arctic regional reanalysis – Full system documentation. C3S deliverable report C3S_D311_Lot2.1.2.2–201910. Copernicus Arctic Regional Reanalysis (CARRA): Full system documentation

Appendix 1: Supplementary Figures


Figure S1: Schematic representation of a toy model trained in an unsupervised manner using 100 random images of fruit, along with a few additional images, without providing any labels or prior knowledge about the objects. Remarkably, the model was able to identify the primary segmentations present in the input data, successfully detecting the shapes of apples and bananas (output).

Figure S2: Schematic representation of the training and sampling procedure of our method. At each step, anatomical information is incorporated by concatenating the brain MR images with the noisy data. The segmentation mask, along with examples of the generated mean and variance maps from 100 sampling runs, is presented. A diffusion model was employed in a supervised manner on the BRATS2020 dataset to produce uncertainty maps for brain tumor segmentation.

Appendix 2: Data preparation, training, validation results, summary and conclusion

This section outlines the data preparation process, the training methodology of the DDPM-ML model, and the validation procedures, along with the corresponding results for 2-meter temperature (K) and u- and v-wind components (m/s). The Summary and Conclusions section provides a detailed discussion of the challenges and complexities associated with implementing this approach to generate uncertainty estimates for the CARRA2 dataset under limited computational resources. Furthermore, Appendix 3 includes the complete technical documentation and installation guidelines for deploying the DDPM-ML model for CARRA2 on the ECMWF ATOS system. This documentation encompasses the installation of the DDPM-ML model code structure and the configuration of the Python environments. Furthermore, all relevant code and scripts are accessible via a GitHub repository (https://github.com/CARRA2/Uncertainty_Quantification/tree/Sep_2025).

The objective of this research effort was to develop a machine learning methodology for mapping and quantifying the uncertainty between two ensemble datasets, ERA5-EDA and the CARRA2 ensemble, within the context of limited resources and data availability. To achieve this, we present an innovative machine learning framework using the DDPM-ML method. As outlined in the main section, this methodology is specifically designed to accurately capture and characterize the uncertainty properties and distributional discrepancies between the two ensemble datasets, while functioning efficiently within the limitations of restricted resources. Given the constrained resources, stringent procedures were employed to prepare the extensive and complex input datasets, ensuring their suitability for both training and evaluation. The subsequent sections offer detailed descriptions of the data pre-processing protocols, the model training approach, and the validation methodology utilized. Furthermore, the validation results are comprehensively presented and analyzed to demonstrate the performance, accuracy, and robustness of the proposed DDPM-ML framework.

A2.1. Data Preparation

One of the major challenges in developing any machine learning model lies in the preparation of accurate and well-structured input data. This process is often time-consuming and computationally demanding, requiring meticulous handling to ensure that the data are suitable for both model training and validation. A schematic representation of the entire data preparation workflow, along with the training and validation processes of the proposed DDPM-ML model, is illustrated in Figure S3. For the CARRA2 ensemble dataset, the raw forecast output files (FA format) were processed through three key stages: (1) conversion from FA files to network Common Data Form (NetCDF) format, (2) transformation from NetCDF to ZARR (https://zarr.dev) format, and (3) generation of corresponding image datasets. Each FA file is approximately 13 GB in size, and a total of 96 TB of data, covering a six-month simulation period, were processed through this workflow. After appropriate pre-processing and data reduction, approximately 1 TB of ZARR-formatted data and 25 GB of image-formatted data were retained as the final inputs for training the DDPM-ML model. Similarly, for the ERA5-EDA dataset, the downloaded GRIB (GRIdded Binary) files were initially converted to NetCDF format and subsequently transformed into ZARR and image formats using a parallel workflow. To ensure robust model development and performance evaluation, the available six-month ensemble dataset was divided into training and validation subsets. Approximately 95% of the data (covering the period from 16 January 2022 to 30 September 2022) were used for model training, while the remaining 5% (from 5 January to 15 January 2022) were reserved for model evaluation and validation.


Figure S3: Diagrammatic illustration of the input data preparation, model training, and evaluation processes.

A2.2. DDPM-ML Training

As described in Section 3.1, the DDPM-ML framework conceptualizes both the forward (noising) and reverse (denoising) processes as Markov chains. In the forward diffusion process, Gaussian noise is progressively added to the data in a stepwise linear manner over T = 4000 discrete time steps, as illustrated in Figure 7. This gradual corruption of data introduces stochasticity into the system, enabling the model to learn the underlying data distribution during the reverse denoising phase. However, this process is computationally demanding, since it necessitates performing all 4000 iterations for each input sample (e.g., per image), leading to a significant computational load. To investigate the impact of different noise scheduling strategies on model performance, four distinct noise schemes were evaluated: linear, cosine, quadratic, and sigmoid. Multiple experiments were conducted using each of these schedules to analyze their effect on the diffusion and reconstruction quality. The experimental outcomes revealed that all four noise schemes produced comparable results in terms of model accuracy and stability. Consequently, the linear noise schedule was selected for the final implementation, as it is both computationally efficient and widely adopted in existing DDPM-based studies (Figure 7).


Figure S4: Schematic illustration of the computation of the weighted mse during the training process. At each diffusion step, the mse is calculated between the model's predicted noise and the true Gaussian noise added at that stage. Specifically, mse_q0 represents the error when the data contain little or no noise, mse_q1 corresponds to the error at a low-noise or early diffusion stage, mse_q2 denotes the error at an intermediate noise level (moderate corruption), and mse_q3 indicates the error at a high-noise level.

During the DDPM-ML training process, periodic checkpoints (model states) are saved to monitor model performance and track the gradual improvement in output quality. This strategy helps to identify the optimal training duration while ensuring efficient use of computational resources. At each training step, the weighted mean squared error (MSE) is calculated between the model's predicted noise and the true Gaussian noise added at that step (Figure S4). The MSE curve shows a sharp decrease within the first 1,000 steps, after which the error stabilizes and gradually converges. This behavior indicates that the model effectively learns to minimize noise prediction errors as training progresses. However, to produce high-quality outputs (samples), the model must be trained for up to 20,000 steps. In practice, the most accurate and stable results were typically achieved after around 10,000 steps. This extended training requirement is primarily due to the use of high-resolution, large-domain CARRA2 data, which demand longer training for proper convergence and accurate reconstruction. To preserve model progress, checkpoints are saved at 10,000, 12,000, 14,000 steps, and beyond. Each checkpoint file (e.g., model010000.pt) stores the trained model weights and biases, along with diffusion process hyperparameters used in both the forward and reverse processes. These files ensure reproducibility and allow further fine-tuning or analysis if required.


Figure S5: To evaluate model performance, uncertainty outputs were randomly generated from several trained models, and these outputs are valid for the entire training and for 2-meter temperature. The panels illustrate the uncertainty derived ML models, called sample or ensemble uncertainty quantification and mapped onto the CARRA2 grid resolution. A total of eight distinct ensembles (samples) were generated utilizing the outputs from all ML models.

For further evaluation of model performance during the training process, uncertainty outputs were randomly generated from several trained models, and these outputs are from the entire training period (Figure S5). The panels in Figure S5 illustrate the uncertainty estimated for 2-meter temperature by the DDPM-ML models, which are mapped onto the high-resolution CARRA2 grid. For this analysis, model outputs from 10,000 to 20,000 training steps were used, resulting in a total of eight samples, each randomly selected from 80 possible generated samples. Figure S5 clearly demonstrates that the machine learning model effectively identifies spatial regions with elevated uncertainty. The model successfully captures uncertainty patterns associated with orographic features, highlighting its capability to represent complex spatial variability. It should be noted that these samples illustrate the model's ability to generate high-resolution uncertainty maps that are representative of the entire training period. Furthermore, it is possible to produce multiple such samples (or ensembles) of uncertainty, randomly drawn from the full set of trained models.

A2.3. Validation Results

For validation or cross-validation of DDPM-ML performance on uncertainty quantification we use the same training model and the remaining 10 days of ERA5-EDA and CARRA2 deterministic forecast field of 2-meter temperature for the reanalysis production. Note that for validation we did not use any uncertainty information (standard deviation, SD) from CARRA2 ensemble. The trained model was tested using data outside the training period, specifically from 5 January 2022 to 15 January 2022. In terms of 2-meter temperature (Figure S6) presents the best uncertainty map, generated based on the grid-wise minimum RMSE and MSE values computed between the DDPM-ML outputs and the ERA5-EDA inputs. At each analysis time, 10 model realizations were produced, and the best-performing outputs were selected according to their minimum MSE values. The results clearly show that samples with errors below 0.2 represent more accurate predictions, while larger errors indicate outputs that deviate from reality. Importantly, the DDPM-ML model successfully demonstrates its capability to downscale the coarse-resolution ERA5-EDA uncertainty fields into high-resolution outputs consistent with the CARRA2 grid resolution, effectively bridging the scale gap between the two ensemble datasets.


Figure S6: For the purpose of cross-validation of the DDPM-ML model, the trained model was evaluated using data from periods outside the training timeframe, specifically for the 2-meter temperature valid on January 6, 2022, at 00, 06, 12, and 18 UTC. The figure illustrates the optimal uncertainty map, which was generated based on the grid-wise minimum weighted RMSE and MSE values calculated between the DDPM-ML outputs and the ERA5-EDA input. At each analysis time, ten model realizations were produced, and the outputs demonstrating the best performance were selected according to their minimum MSE values.

Figure S7: Same as Figure S6, but for the actual field of 2-meter temperature valid on January 6, 2022, at 00, 06, 12, and 18 UTC, generated by the DDPP-ML model. It is important to note that an additional normalized similarity metric, defined as SCORE=1/(1+RMSE), has been incorporated to further evaluate the model's performance.

One of the key advantages of the DDPM-ML model is its ability to retain comprehensive information at each step during the training process. We leveraged this capability to extract the actual physical fields of various parameters, such as the 2-meter temperature, across the domain. This was possible because the model was trained using the ERA5-EDA mean field values and the CARRA2 deterministic forecasts as targets, allowing it to learn the true field representations from both datasets. Without this information, the DDPM-ML model would not be able to discern the origins of uncertainty or understand how these uncertainties are related to the actual 2-meter temperature flow. Figure S7 presents the actual fields derived from each DDPM-ML model output. The derivation method follows the same approach used for uncertainty quantification based on MSE and RMSE values (see Figure S6). Each model output exhibits spatial patterns closely resembling the 2-meter temperature field, with particularly clear representations when RMSE values are below 0.7 or MSE values are below 0.4. Furthermore, Figure S7 demonstrates that 9 out of 10 generated images closely match the 2-meter temperature flow (in K) across the domain for all four UTC times, supporting the DDPM-ML model's capability to reproduce realistic field structures. It is important to emphasize that the temperature fields produced by the ML model do not constitute forecasts. Instead, they represent synthesized information reflecting the uncertainty mapping from ERA5-EDA to CARRA2, along with the learned relationships between ERA5-EDA and the CARRA2 deterministic forecasts. This provides additional insight into the DDPM-ML model's performance, which will be discussed further in the following sections.


Figure S8: Quantification of flow-dependent uncertainty was carried out during the validation and testing phases for the 2-meter temperature. The first column shows the uncertainty estimated from the ERA5-EDA input data, while the second panel illustrates the uncertainty derived from the CARRA2 ensemble. The third column represents the minimum uncertainty associated with the lowest MSE, and the fourth column displays the maximum uncertainty among members with an MSE below 0.2, as generated by the DDPM-ML model. All data correspond to January 6, 2022, and include four analysis times: 00, 06, 12, and 18 UTC.

Figure S9: The actual field of 2-meter temperature measurements produced by the ML model is presented in the right column, corresponding to January 6, 2022 and for all four analysis time (00, 06, 12 and 18 UTC). The left column display the ensemble mean of ERA5-EDA, which serves as the input data. The middle column illustrate the temperature field simulated by CARRA2.

Figure S10: Same as Figure S8, but for the uncertainty in 2-meter temperature valid on January 11, 2022, at 00, 06, 12, and 18 UTC, generated by the DDPP-ML model.

Figure S11: Same as Figure S9, but for the actual field of 2-meter temperature valid on January 11, 2022, at 00, 06, 12, and 18 UTC, generated by the DDPP-ML model.


Figure S12: An illustration of the optimal uncertainty map, which was generated based on the weighted minimum weighted RMSE and weighted MSE values calculated between the DDPM-ML outputs and the ERA5-EDA input and for the u-wind (m/sec) valid on January 6, 2022, at 00, and 12 UTC. At each analysis time, ten model realizations were produced, and the outputs demonstrating the best performance were selected according to their MSE values.

Figure S13: Quantification of flow-dependent uncertainty was carried out during the validation and testing phases for the u-wind (m/sec). The first column shows the uncertainty estimated from the ERA5-EDA input data, while the second panel illustrates the uncertainty derived from the CARRA2 ensemble. The third column represents the minimum uncertainty associated with the lowest MSE, and the fourth column displays the maximum uncertainty among members with an MSE below 0.2, as generated by the DDPM-ML model. All data correspond to January 6, 2022, and include four analysis times: 00, 06, 12, and 18 UTC.

Figure S14: same as Figure S14, but for meridional wind components (v-wind)




Figure S15: same as Figure S13, but for meridional wind components (v-wind)

Figures S8 and S10 illustrate the uncertainty estimates generated by the DDPM-ML model during the validation period for 2-meter temperature on January 6 and January 11, 2022, across all analysis times. The high-resolution, flow-dependent uncertainty quantification was conducted utilizing two primary input sources: the ERA5-EDA uncertainty information, expressed as standard deviation, and the deterministic forecast fields from CARRA2, in conjunction with the pre-trained DDPM-ML model. By integrating these three components, the DDPM-ML model effectively reproduces uncertainty patterns that are comparable to those observed in the CARRA2 uncertainty, maintaining a similar spatial resolution. It is noteworthy that no prior uncertainty information from CARRA2 was incorporated during the validation phase; the DDPM-ML model inferred these uncertainties solely based on knowledge acquired during training. The model's estimated uncertainty range is defined by the minimum uncertainty corresponding to the lowest MSE and the maximum uncertainty among ensemble members with MSE values below 0.2. These bounds delineate the interval within which forecast uncertainty is expected to reside.

From a scientific perspective, uncertainty quantified at a single time point offers limited insight, whereas characterizing the uncertainty range provides a more comprehensive and informative metric. Consequently, the minimum and maximum uncertainties across ensemble members were computed to yield a more meaningful interpretation of uncertainty over the CARRA2 domain. 

While the DDPM-ML model successfully captures flow-dependent uncertainty patterns analogous to those of the CARRA2 ensemble, it exhibits spatial variability in uncertainty magnitudes, with some regions showing higher and others lower values. Direct comparison of these magnitudes is complicated by differences in spatial resolution between the datasets. Ideally, the DDPM-ML uncertainty estimates should fall between those of ERA5-EDA and CARRA2, aligning more closely with CARRA2 but not identically. When the DDPM-ML uncertainty map closely resembles ERA5-EDA, this suggests an underestimation of uncertainty, whereas a near-identical match with CARRA2 indicates potential overestimation by the model. Comparable patterns are observed in the DDPM-ML-derived uncertainties for the u-wind (Figures S12 and S13) and v-wind (Figures S14 and S15) components, further corroborating the robustness of the proposed methodological framework. 

As previously discussed, a key advantage of the DDPM-ML model lies in its capacity to preserve detailed information throughout each stage of the training process. This characteristic was leveraged to extract the actual physical fields of 2-meter temperature. Figures S9 and S11 illustrate the 2-meter temperature physical fields obtained from the outputs of the DDPM-ML model. The spatial patterns exhibited by the model outputs at the analysis times closely correspond to those of the CARRA2 simulated values, thereby demonstrating the model's proficiency in reproducing 2-meter temperature fields valid at the same temporal instances. It is crucial to emphasize that the temperature fields produced by the machine learning model do not constitute forecasts. Instead, these synthesized fields represent uncertainty mappings from ERA5-EDA to CARRA2, as well as the learned associations between ERA5-EDA and the deterministic forecasts of CARRA2. This distinction offers additional insight into the performance of the DDPM-ML model. Furthermore, the figures imply that, with further optimization and an increased volume of training data, the DDPM-ML model holds potential for generating realistic forecast fields derived from uncertainty information, although this prospect extends beyond the scope of the current investigation.

A2.4. Summary and Conclusions 

As proposed in this research and development work, we developed a machine learning approach designed to learn from uncertainty and reconstruct the uncertainty relationship between two ensemble datasets, ERA5-EDA and CARRA2, using limited data sources. We successfully developed, tested, and validated an innovative diffusion-based machine learning model that generates high-resolution gridded uncertainty fields, demonstrating a strong capability to capture flow-dependent uncertainties during the testing period. The results further show that the model effectively identifies uncertainty patterns associated with orographic features, confirming its ability to represent complex spatial variability. Overall, these results demonstrate the robustness and reliability of the DDPM-ML model in capturing realistic and spatially consistent uncertainty patterns. In addition to uncertainty quantification, the DDPM-ML model is also capable of reconstructing the actual physical field of the reference variable across the domain, which closely aligns with the CARRA2 field at corresponding times, further reinforcing the validity and strength of the methodology.

The implementation of the DDPM-ML approach to quantify uncertainty for the CARRA2 dataset presents two principal challenges: the limited availability of training data and the constraints imposed by computational resources. Despite these limitations, the DDPM-ML model trained using only a six-month ensemble dataset demonstrates a remarkable ability to reproduce high-resolution uncertainty structures across the CARRA2 domain. Expanding the training dataset, potentially by doubling or further increasing its duration, is expected to enhance model accuracy and reliability, as the performance of machine learning models heavily depends on the quality, diversity, and volume of input data. However, as noted in the preceding section, preparing large, high-quality datasets is a time-intensive and computationally demanding process that requires careful data curation and substantial computing capacity. Consequently, such an extension lies beyond the scope and timeline of the present study but represents a promising direction for future research.

A second major challenge lies in the limited computational resources available for model training, model evaluation and validation. All DDPM-ML experiments were conducted on the ECMWF ATOS supercomputing system under constrained resource allocations. A maximum of five GPU-available nodes were available for each user, each equipped with four GPUs. Model training and evaluation was typically performed using four GPUs in parallel on a single node, although this configuration required long queue times. Moreover, increasing the number of nodes (for example, from one to two) resulted in a proportional increase in waiting time — from approximately two hours to four hours — further emphasizing the computational limitations encountered during this study. Training the DDPM-ML model for a single parameter (e.g. 2-meter temperature) up to 20,000 steps requires approximately 10 hours of runtime. The evaluation and validation phases are more computationally intensive, as the total processing time depends on the number of samples generated from each model for both uncertainty quantification and field reconstruction. For example, performing uncertainty quantification and actual field derivation of 2-meter temperature using 10 trained models, each producing 20 samples, requires more than six hours of computation per analysis time.

It is important to emphasize that each saved DDPM-ML model checkpoint encapsulates rich and comprehensive information, enabling a wide range of subsequent analyses and experimental extensions through the modification of hyperparameter configurations. As demonstrated in the results, the DDPM-ML model is not only capable of quantifying uncertainty but can also reconstruct the actual physical field of the reference variable with high spatial fidelity. Furthermore, the model framework presents the potential to generate short-term forecast fields and even produce ensemble members of such forecasts (Andrae, et at, 2024), thereby extending its applicability beyond pure uncertainty estimation. However, a detailed investigation of these capabilities would require substantially greater computational resources and a dedicated research effort. Consequently, such extensions are identified as promising directions for future work and new project proposals and are beyond the scope of the current work.

Appendix 3: Technical report and Installation of Denoising Diffusion Probabilistic Model Machine Learning for CARRA2 on the ECMWF ATOS System

A3.1. Introduction

This technical report presents a comprehensive overview of the procedures for input data preparation, processing, and installation of the probabilistic denoising diffusion model on the ECMWF ATOS supercomputer. It further details the methods for sample generation and evaluation of the Denoising Diffusion Probabilistic Model (DDPM-ML) within a machine learning context. The report highlights the following key aspects: to quantify uncertainty in the CARRA2 and ERA5 datasets, the DDPM-ML is applied in a supervised learning framework, wherein the uncertainty quantification derived from CARRA2 serves as the target for the model trained on ERA5 data. 

The DDPM-ML framework conceptualizes both the forward (noising) and reverse (denoising) processes as Markov chains. As a generative model, it estimates pixel-wise probability density functions (PDFs), necessitating pixel-level information from the input data during training. In the forward process, Gaussian noise is incrementally introduced over T discrete steps; empirical evidence suggests that T=4000 steps yield effective sampling results. During the reverse process, a U-Net architecture (Ronneberger et al, 2015) is trained to iteratively predict and remove noise at each step. This procedure is computationally intensive, as it requires executing all T steps (e.g., 4000 steps per image). The reverse process ultimately generates uncertainty samples corresponding to the CARRA2 dataset.

A3.2. Python Environment

The DDPM-ML model is designed to operate within the ECMWF ATOS scripting framework and leverages PyTorch with GPU acceleration for training purposes. Prior to execution, create and configure a Conda environment with the necessary Python packages by following these steps:

module load conda
conda create --name ddenv_v2 python=3.10 -y
conda activate ddenv_v2
conda install -c nvidia -c pytorch -c conda-forge pytorch torchvision torchaudio pytorch-cuda=12.1 numpy Pillow scikit-learn matplotlib setproctitle pandas pandasql blobfile mpi4py xarray zarr fsspec netcdf4 h5netcdf

A3.3. DDPM-ML System

a. Input data preparation

A detailed description of the input data preparation for the DDPM-ML model is provided in the first section of Appendix 2. 
All necessary scripts and available job configurations are included in the Git repository. https://github.com/CARRA2/Uncertainty_Quantification/tree/Sep_2025

b. DDPM-ML Model

This script serves as the primary driver program for training diffusion models on image datasets within the CARRA2 domain, incorporating Uncertainty Quantification (UQ). It manages the end-to-end process including implementation instructions, data loading, and iterative training, while accommodating distributed training configurations through the torchrun tool and facilitating comprehensive output logging. The program dynamically constructs command-line arguments by integrating default settings from the model and diffusion components with user-defined parameters, thereby enabling flexible customization of key training hyperparameters such as device selection, learning rate, number of training steps, and microbatch size. 

The Python script is presented below, which is used in the implementation of the DDPM-ML model. 

Table 1: List of Python Scripts and Job Files

Category / Folder

File Name

Description / Purpose

Main Scripts

Train_Main.py

Main script for training the diffusion model.

Evaluation Scripts

evaluate.py

Script for evaluating model performance across datasets (for UQ and each variable)


evaluate_FIELD.py

Field-specific evaluation script, likely used for variable-based assessment (t2m, sp, u10, v10…).

Job Submission Scripts

Run_Training.job

Job submission script for launching training in ATOS.


Run_evaluation.job

Job submission script for running evaluation tasks.

Source Folder: src_diffusion/

diffusion_dist.py

Handles distributed training setup for parallel computation.


diffusion_fp16.py

Manages mixed-precision (FP16) computation for efficiency.


diffusion_gaussian.py

Implements Gaussian diffusion processes and noise modeling.


diffusion_train.py

Core training logic for the diffusion model.


image_datasets.py

Dataset loader and pre-processing utilities for image inputs.


logger.py

Logging utility for training and evaluation progress.


losses.py

Defines and computes loss functions used during training.


nn.py

Neural network components and layer definitions.


resample.py

Implements resampling strategies in the diffusion process.


respace.py

Defines timestep spacing or schedule adjustment functions.


unet.py

Contains the UNet model architecture used for diffusion and super-resolution tasks.


The Python script Train_Main.py will be executed via the batch job Run_Training.job to perform the training process. The model configuration parameters are detailed in Table 2. To monitor error statistics, it is necessary to extract values from the log file, which is located in the same directory as the output files.

Table 2: Summary of the model configuration parameters used for diffusion-based training and generation (evaluation).

Parameter

Description

Value / Setting

--diffusion_steps

Number of diffusion and denoising iterations each image undergoes during training.

4000

--image_size

Maximum image dimension used during training.

256

--noise_schedule

Type of noise schedule applied; defines how noise levels change during diffusion. Can be modified during tuning.

linear

--lr

Learning rate used for model optimization.

1e-4

--batch_size

Number of images processed in each training batch.

8

--microbatch

Subdivision of batch for memory efficiency; typically based on available GPU memory.

4

--class_cond

Enables supervised learning by conditioning on class labels.

True

--steps

Total number of training iterations.

20,000

Model checkpoint saving frequency.

Every 2000 steps

Empirical performance note: model accuracy tends to improve notably after this point (varies with dataset size).

~10000 steps


It is important to highlight that the ERA5-EDA dataset possesses a coarser spatial resolution, characterized by grid dimensions of 130 by 114, whereas the CARRA2 dataset features a significantly finer resolution of (2880 x 2880) grid points. To reconcile these differences and generate outputs at the CARRA2 resolution, a specific approach was integrated within the training loop to facilitate appropriate sampling within the Super-Resolution Model, specifically the U-Net Model. This model encompasses essential components for both the training and deployment of diffusion-based super-resolution frameworks conditioned on low-resolution ERA5-EDA input maps. Key features of the model include advanced sampling techniques tailored for diffusion training, U-Net inspired architectural designs, incorporation of residual and attention mechanisms, cross-attention conditioning, as well as optional mixed-precision training capabilities to enhance computational efficiency.

c. Training Output

During the training of diffusion models, checkpoints are systematically saved at predetermined intervals to facilitate the monitoring of model performance and the assessment of incremental advancements. The frequency of checkpoint saving is specified within the training configuration, thereby enabling the examination and validation of intermediate model states without necessitating a complete retraining from the initial state. The outputs produced by the model at various stages of training (for instance, at steps 0, 2000, 4000, 6000, 8000, and continuing up to 20,000) offer critical insights into the model's learning progression. These checkpoints document the model's gradual enhancement in reconstructing high-resolution outputs from noisy or low-resolution inputs. Typically, each subsequent checkpoint exhibits improved output quality and more precise spatial representations, reflecting the increasing stability and data-adaptiveness of the diffusion and denoising processes. This observed progressive refinement is instrumental in determining the optimal training duration, thereby achieving a balance between model accuracy and computational resource expenditure. 

Table 3: The key files generated during model training, including checkpoint files, log outputs, and progress tracking data.

File Name

File Type

Description

log.txt

Log File

Contains training logs, including losses, metrics, and system information.

progress.csv

Progress File

Records performance metrics over training steps for plotting or analysis.

model000000.pt

Model Checkpoint

Initial model state before training begins.

model002000.pt

Model Checkpoint

Saved after 2,000 training steps.

model004000.pt

Model Checkpoint

Saved after 4,000 training steps.

model006000.pt

Model Checkpoint

Saved after 6,000 training steps.

model008000.pt

Model Checkpoint

Saved after 8,000 training steps.

model010000.pt

Model Checkpoint

Saved after 10,000 training steps.

model012000.pt

Model Checkpoint

Saved after 12,000 training steps.

model014000.pt

Model Checkpoint

Saved after 14,000 training steps.

model016000.pt

Model Checkpoint

Saved after 16,000 training steps.

model018000.pt

Model Checkpoint

Saved after 18,000 training steps.

model020000.pt

Model Checkpoint

Final trained model after 20,000 steps.


d. Diffusion Sampling and Evaluation Overview

To sample and assess the output of the DDPM-ML training model, it is essential to run two principal Python scripts (evaluate.py and evaluate_FIELD.py) via the bash job script Run_evaluation.job. This script enables a comprehensive evaluation of diffusion-based CARRA2/ERA5 models on the ATOS system, which is managed by SLURM. It automates the dynamic linking of the required model checkpoints and the transfer of corresponding ERA5 and CARRA2 input images for each evaluation stage. The execution is parallelized across multiple GPUs using torchrun, thereby improving computational efficiency and reducing processing time. Additionally, users can modify parameters such as image size, batch size, and the number of samples to balance evaluation accuracy with computational resource utilization. 
The evaluation framework employs a two-stage process for each date under consideration:


Table 4: UQ Evaluation Output Files

Category

Files

Evaluation Type

Uncertainty Quantification

Logs

rank_0.log, rank_1.log, rank_2.log, rank_3.log

Model Outputs (SD Evaluations)

UQ_ckpt_model012000.pt.png
UQ_ckpt_model014000.pt.png
UQ_ckpt_model016000.pt.png
UQ_ckpt_model018000.pt.png
UQ_ckpt_model020000.pt.png

Main Output Image

UQ.png



Table 5. Filed Evaluation Output Files

Category

Files

Evaluation Type

Uncertainty Quantification

Logs

rank_0.log, rank_1.log, rank_2.log, rank_3.log

Model Outputs (SD Evaluations)

FIELD_ckpt_model012000.pt.png
FIELD_ckpt_model014000.pt.png
FIELD_ckpt_model016000.pt.png
FIELD_ckpt_model018000.pt.png
FIELD_ckpt_model020000.pt.png

Main Output Image

TARGET_CARRA2.png



Figure 2.a.:Uncertainty estimation associated with higher-resolution uncertainty quantification, as depicted in the file UQ.png. Panel b) presents the corresponding 2-meter temperature (k) field for a single UTC.More information is available in the (C3S2_D361a.1.4.1_UncertaintyEstimation_v1).

In summary, the final result of the higher resolution uncertainty quantification is presented in the file UQ.png. The file TARGET_CARRA2.png serves as a reference for comparison or evaluation against the actual field data.

A3.4. Github Link

The git repository CARRA2/ Uncertainty Quantification is available in 
https://github.com/CARRA2/Uncertainty_Quantification/tree/Sep_2025

Related articles

Related articles appear here based on the labels you select. Click to edit the macro and add or change labels.

This document has been produced in the context of the Copernicus Climate Change Service (C3S).

The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation Agreement signed on 11/11/2014 and Contribution Agreement signed on 22/07/2021). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose.

The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view.