Contributors: Xiaohua Yang (DMI), Sebastian Pelt (DMI), Kristian Møller (DMI), Kristian Pagh Nielsen (DMI), Kasper Tølløse (DMI), Harald Schyberg (MET Norway), Per Dahlgren (MET Norway), Jostein Blyverket (MET Norway), Yurii Batrak (MET Norway), Åsmund Bakketun (MET Norway), Bolli Palmason (IMO), Swapan Malick (SMHI), Patrick Samuelsson (SMHI)
Issued by: DMI / Xiaohua Yang
Date: 24/04/2025
Ref: C3S2_D361a.1.1.2_Test_Evaluation_v3.docx
Official reference number service contract: 2022/C3S2_361a_METNorway/SC1
The reanalysis system of the Copernicus Arctic Regional ReAnalysis (CARRA) in Copernicus phase 2, (hereafter referred to as CARRA2) builds on the HARMONIE-AROME 46h1 reference system with additional porting of reanalysis-related features from the Copernicus Arctic Regional ReAnalysis in Copernicus phase 1 (hereafter referred to as CARRA1) is aimed to facilitate the pan-Arctic reanalysis for a much larger area than those covered in CARRA1 (Figure 1) and also for a longer reanalysis period, while retaining a comparable high resolution as in CARRA1 (2.5 km). The system takes advantage of many years of scientific advances achieved within the ACCORD/HARMONIE development community. At the same time, it has been necessary to make adaptations and optimizations in the system configurations to fit the reanalysis production to the available computing budget. The combination of these requirements necessitated a preparation phase in which implementation, test and tuning with additional features beyond the reference CY46h1 configurations were conducted.
This test and evaluation report provides highlights from the verification for the first years of the CARRA2 reanalysis, as well as impact or sensitivity tests on configuration and tuning options, leading the way towards the final selected configuration. A companion full system documentation (Yang et al, 2025) describes the main elements of the CARRA2 system with references, in particular the extensions and adaptations done relative to the HARMONIE-AROME reference system.
Figure 1: The CARRA2 domain vs CARRA1 (CARRA-West and CARRA-East) domains.
After the development period where successive preliminary system versions were tested, the final system configuration is achieved with the following main features:
For data assimilation the system includes:
A quality-assured near complete version of the CARRA2 system, ready for spin-up production, was finalized (tagged) in April 2024. The production of the CARRA2 reanalysis data set is carried out with independent streams for parallel execution with the aim of achieving speedy delivery. The 40-year reanalysis period is split into 8 6-year time slices starting from Sept 1 of the year 1984, 1989, 1994, 1999, 2004, 2009, 2014 and 20198, respectively, each with the first 'warm-up' year to spin-up soil properties. After completion of the 1-year spin-up runs for each of the eight CARRA2 production streams in July, some minor adaptations as well as implementation of the
CARRA2 MARS archive routines were done. These changes did not compromise the meteorological quality of the production. On 3 September 2024 this work was completed, the final version of the system was tagged, and the CARRA2 production with longer forecasts and data storage (archiving in the MARS system) for future publication to users was started.
In the following sections we present test and evaluation results for the full reanalysis system as well as for individual system components. The latter tests were mostly done in the development phase using different versions of the provisional CARRA2 system (we think that these tests would be also representative for the entire final system). While assessment of quality of reanalysis datasets may take many forms, this report concentrates on highlighting results from observation verification using the regular CARRA2 monitoring utility based on the HARMONIE MONITOR verification tool. Here, for assessment of overall system skills, intercomparison of observation verification statistics against those with ERA5 datasets are done. For some of the assessments where there is overlap with the CARRA1 domains (CARRA-East and CARRA-West), CARRA1 datasets are also included. The intercomparison between CARRA1 and CARRA2 in this report can be seen as additional materials on top of the dedicated report on the same subject by Køltzow et al (2025).
Throughout the configuration and preparation phase and also in the reanalysis production, continuous quality assurance was conducted through case examinations, assimilation monitoring as well as validation against in-situ surface and upper air measurements. While the report by Køltzow et al (2025) concentrate on some in-depth case studies as well as intercomparison to the CARRA1 and ERA5 datasets over the European Arctic regions, we focus here more on intercomparison of verification statistics over the broader pan-Arctic regions. For observation data, our regular quality assurance monitoring relies mainly on in-situ surface observations and upper-air radiosonde measurements, to a large extent the same source of data as used in data assimilation of the reanalysis itself (with the exception of 10m wind speed observations over land, which are not assimilated).
When using assimilated observations as a reference in the statistics, we should note that this will not be a completely independent reference, in particular where we focus on the verification at analysis time and for the sites with assimilated observations, hence the quality of the system fidelity may be somewhat overestimated by such comparisons. In data assimilation, the fit of the analysis to the observations is partly determined by the relative weighting of background and observations selected in the assimilation scheme. This weighting is not tuned to fit each observation, but to give an overall best possible analysis also in non-observed spatial and temporal spaces, helping to produce the best possible description of the overall atmospheric states both at analysis time and for forecast (first guess) for the next analysis time.
As the CARRA2 production has at the time of writing finished about one third of the reanalysis datasets, we have chosen to select some highlights from the available time series to present verification reflecting different time-slices, seasons and regions. An update of these verification intercomparisons is scheduled at the final phase of the project when the full, continuous reanalysis time series are completed.
In subsections in 2.1, we compare CARRA2 to the ERA5 datasets over the entire model domain: 2.1.1 presents a set of time series of verification results for key weather parameters for the month of January for each of the 8 reanalysis time slices. In 2.1.2, we focus on the seasonal variability by showing the time series for temperature for three additional months, April, July, and October. Finally, 2.1.3 presents the time series of observation fit for key parameters for a selected month for six different geographic regions. In section 2.2, we add the CARRA1 data set in the intercomparison for the overlapped European Arctic regions. In 2.3, we give an overall summary and discussion for all these time series verifications.
In Figures 2-5, the daily averaged time series of observation fit for the January month of all the eight CARRA2 reanalysis streams are shown for the analyzed output quantities: 2-metre temperature (T2m), 2-metre relative humidity (RH2m), 10-metre wind speed (W10m) and mean sea level pressure (MSLP), respectively. These averages are performed for all available in-situ observation stations within the CARRA2 domain (with a total of ca 1800 stations) at all of the analysis base time (8 times per day, every 3 hours).
Figure 2: Time series of daily standard deviation and bias errors for CARRA2 (green) vs ERA5 (magenta) validated against all land SYNOP stations within the CARRA2 domain for 2m temperature in Celsius for January month in 1986 (upper left), 1991 (upper right), 1996 (second row left), 2001 (second row right), 2006 (third row left), 2011 (third row right), 2016 (bottom left) and 2021 (bottom right). The number of available observations used in the statistics is displayed by dashed lines (see the scale at the right sides of the frames).
Figure 2 shows that, for T2m (one of the most important weather and climate parameters), the CARRA2 time series has in general a substantially smaller departure in terms of standard deviation error (STD) for all the time slices, roughly 0.5 degree less in the STD error, on average. Overall, the bias values in CARRA2 are also lower than ERA5, although the differences are relatively small and there are also occasional exceptions. For the screen level relative humidity at 2 meters (Figure 3) while the advantage of CARRA2 in STD error is still valid for most of the periods, the bias in CARRA2 tend to suffer more fluctuations and for most of the periods, CARRA2 has a relatively small but persistent dry bias of 2 to 3 percentage points.
From Figure 2 and 3, the advantage in grid resolution for the km-scale CARRA2 reanalysis over the synoptic scale ERA5 dataset seems clear (from the standard deviation error point of view) for screen level properties. High resolution is seen to be effective to improve reanalysis fidelity for quantities like temperature and humidity which are sensitive to the surface heterogeneity (which is described better by high resolution models). On the other hand, this does not hold for the bias in temperature and humidity, for which the advantage is either minor (for temperature) or none (for humidity, which shows rather a minor but systematic dry bias). From operational experiences, the biases in screen level quantities are not strongly related to grid resolution. Rather, it is typically a reflection of deficiencies and error compensations of a variety of contributing factors from the entire assimilation-forecast workflows, especially those on low level physical representation (physiographic database, surface scheme and/or upper air physical parameterization), or dependencies on tuning parameters chosen in operational applications.
It is probably worth noting that there needs to be a higher degree of caution in the interpretation of verification results on average bias involving humidity parameters, as there usually is more uncertainty about the quality assurance of the humidity measurements as well as about their representativity. Having said this, the results shown here about the tendency of dry bias seem consistent to the similar findings from the HIRLAM research community about the multi-layer SURFEX scheme introduced in CY46h1 for continental Europe (Patrick Samuelsson, SMHI, personal communication).
Figure 3: Same as Figure 2, but for RH2m in %.
For the 10m wind as shown in Figure 4, the resolution advantage also seems clear for the observation fit in terms of STD errors for all time slices throughout the reanalysis period. This demonstrates the added value more clearly than for T2m and RH2m, since these wind observations were not assimilated and form an independent reference. In comparison to the ERA5 wind, CARRA2 seems also to have in general slightly weaker winds. While CARRA2 verifies better against the observations for a majority of the periods when ERA5 tends to suffer from a positive wind bias, for some of the other periods such as January 1997 (second from top left panel in Figure 4), winds from both CARRA2 and ERA5 are too weak, and CARRA2 suffers more from a negative wind bias for the second half of that month. It shall be noted again that screen level wind measurements from land stations are not assimilated due to issues of representativeness related to the land surface complexity and heterogeneity, as such, the trend in the CARRA2 verification characteristics reflects not only grid resolution but also physical parameterization.
Figure 4: Same as Figure 2, but for W10m in m/s.
Figure 5: Same as Figure 2, but for MSLP in hPa.
For Mean Sea Level Pressure (Figure 5), interestingly, the STD errors between CARRA2 and ERA5 are roughly comparable, whereas the bias is more in favor of CARRA2. As MSLP is a quantity dominated by large scale, kilometer scale models like CARRA2 do not have an obvious advantage relative to the coarser model used for ERA5. As for the bias, while ERA5 seems to have a systematic negative bias for the entire CARRA2 pan-Arctic domain, the CARRA2 results seem to have a smaller bias, although it tends to suffer stronger jumpiness in the bias time series.
The results shown in 2.1.1 were those for the month of January. In this section, we select the parameter T2m to examine possible seasonal characteristics or changes for other representative months, covering each season, April (Figure 6), July (Figure 7) and October (Figure 8). This is again within each of the 8 reanalysis streams, so covering evenly spread years within the reanalysis period as in the previous section.
Figure 6: Same as for Figure 2, but for April.
Figure 7: Same as for Figure 2, but for July.
Figure 8: Same as for Figure 2, but for October.
Considering Figures 6-8, the verification behavior seen in Figure 2 for the selected January month, looks rather similar in all the reanalysis streams for the representative periods in spring, summer and autumn seasons. These figures again demonstrate a significant added value of CARRA2 vs ERA5 in representation of the surface characteristics for the pan-Arctic regions. In terms of bias, the observation fits between CARRA2 and ERA5 are rather comparable for all the seasons. Thus, this comparison reveals no obvious seasonal variation in the performance.
Overall, from the verification time series comparisons presented so far, we see that CARRA2 has a better fit to observations than ERA5 in screen level quantities T2m, RH2m and W10m, especially for the STD scores. For RH2m, on the other hand, CARRA2 has a dry bias of roughly 2% throughout the comparisons. In summary, besides CARRA2's slightly dry tendency, it generally performs better than ERA5 for the periods considered.
While the verification intercomparison so far focused on averaged time series for the full pan-Arctic domain, hereafter we examine further the verification characteristics for different geographic regions within the pan-Arctic domain with key weather parameters including T2m (2-metre temperature), RH2m (2-metre relative humidity) and W10m (10-metre wind speed). Two station-lists have been selected to represent each continent, with "EWGLAM" (high quality baseline station list as selected by the European Working Group on Limited Area Modelling) and "Greenland Ice Sheet" collections for Europe, "Alaska" and "Canadian Mainland" stations for North America, "Russian West" and "Northern Siberia" for Asia. Among these selected station lists, as CARRA1 data set also cover 'Greenland Ice Sheet", it is also included in the inter-comparison for that region. The selected period is one of the various January months as those selected in Figures 2-5, Jan 2011, primarily for the relatively higher number of available observation data for some of the data-sparse regions.
Figure 9: Time series of daily averaged standard deviation and bias errors validated against land SYNOP stations for T2m in January 2011 over Europe (139 stations, upper left panel), Greenland Ice Sheet (17 stations, upper right panel), Alaska (16 stations, middle left), Canada mainland (32 stations, middle right), Russia West (54 stations, lower left) and North Siberia (37 stations, lower right). ERA5 data set in magenta, CARRA2 in green. For Greenland Ice Sheet, dataset from CARRA1 is also compared in cyan color.
Figure 10: Same as Figure 9, but for RH2m.
Figure 11: Same as Figure 9, but for W10m.
Compared to the verification results averaged over the full pan-Arctic model domain as shown in the lowest left panels in Figure 2, Figure 3 and Figure 4, the statistics for different geographic regions as shown in Figures 9-11 exhibit considerable variability between sub-regions and between parameters. For T2m and RH2m, while the scores for Europe, Greenland Ice Sheet and Russian West are significantly in favor of the CARRA2 reanalysis, the relative advantage with CARRA2 vs ERA5 appears to be less pronounced for some of the other regions (Canada mainland, Canada Arctic Archipelago, Russian Siberia). For relative humidity, the tendency of dry bias in the CARRA2 dataset seems to be common for all regions. For wind speed the CARRA2 scores for most regions are favorable, with one major exception for the Greenland Ice Sheet stations (upper right panel in Figure 11). This however may not be of real concern, as the wind measurements on the Greenland ice sheet stations are from 3m heights instead of 10m recommended by WMO, implying a significant negative bias in the measurement data when comparing to wind data from model which are defined for 10m heightThe wind measurements by PROMICE and GCNET, the primary station network performing weather observations on the Greenland ice sheet, are typically done at 3m height as the conditions there does not allow standard 10m wind masts..
As shown in Figure 1, CARRA2 covers an extensive pan-Arctic domain, fully embedding the two European Arctic domains covered by the CARRA1 reanalysis: CARRA-West for Greenland-Iceland regions and CARRA-East for northern Scandinavia, Barents Sea and Svalbard. Hereafter, we present verification intercomparisons for key parameters with all three datasets of CARRA1, CARRA2 and ERA5 for the overlapped domains over CARRA-West and CARRA-East during January (Figure 12), April (Figure 13), July (Figure 14) and October (Figure 15) of 2011 (2011 is selected as being one of the first full years that has been completed and with relatively higher number of observation data).
Figure 12. Time series of the daily standard deviation and bias errors validated against land SYNOP stations in January 2011 over CARRA-West (left side panels) and CARRA-East domains (right side panels), for T2m (upper), relative humidity (middle) and W10m (lower). The CARRA2 data are in green, CARRA1 in cyan, ERA5 in magenta.
Figure 13: Same as Figure 12, but for April.
Figure 14: Same as Figure 12, but for July.
Figure 15: Same as Figure 12, but for October.
From Figures 12-15 it seems clear that, in the big picture, common for all these periods, both of the kilometer scale CARRA2 and CARRA1 reanalyses show significant advantages over that of ERA5.
For the CARRA-West domain, significant advantages are seen in standard deviation scores for temperature and humidity of the regional reanalyses. For wind speed, the advantage is not obvious in STD, but quite pronounced in bias for all seasons, where the ERA5 winds tend to be too weak. The verification scores for wind speed are affected not only by model resolution, but also physical parameterization.
For the CARRA-East domain, the relative advantages of CARRA1 and CARRA2 reanalyses over ERA5 are pronounced in most STD scores, although no datasets have significant advantage in bias. On the other hand, differences between verification scores of CARRA1 and CARRA2 are in general smaller than those relative to the ERA5 scores.
For the CARRA-West domain, the fit for wind speeds between the two regional reanalysis datasets are rather alike. While CARRA2 in general tends to have a higher STD in T2m than CARRA1, it is lower for RH2m except for July. For CARRA-East the score differences between CARRA1 and CARRA2 in T2m and W10m are minor in general. For RH2m, CARRA2 in general has somewhat larger STD error than in CARRA1. For the month of January, CARRA1 and CARRA2 both suffer large RH2m biases, but to opposite directions. All in all, both CARRA2 and CARRA1 have advantages in screen level parameters over those of ERA5 over the CARRA1 domains.
From scientific literature, non-hydrostatic effects are perceived to become significant when vertical acceleration becomes non-negligible in comparison to gravity and vertical pressure gradient forces. However, it is unclear as to how far the validity of the hydrostatic approximation holds as a function of model grid scale.
HARMONIE-AROME is a kilometer scale model supporting both hydrostatic and non-hydrostatic dynamic configuration options. When used for operational forecasting at a grid resolution of 2.5 km or finer, non-hydrostatic option is normally set as the default. In terms of computational cost, the non-hydrostatic option in HARMONIE-AROME is associated with two additional prognostic equations describing the non-hydrostatic pressure departure and divergence, and it is about 20% more costly than the hydrostatic option for the same time stepping. This significant impact on computational cost makes it natural to consider either having less high horizontal resolution or preferably the use of the hydrostatic option. In this connection, the experiences documented during CARRA1 about the possible setup choices of a pan-Arctic regional reanalysis (see report by Yang et al, 2021) is of direct relevance.
In this 2021 pilot study on optimal configuration for reanalysis over a pan-Arctic domain, Yang et al performed parallel experiments to examine the impact of using the hydrostatic simplification in runs with grid resolutions of 2.5 km and 3.75 km. Several month-long episode runs were performed to examine the statistical behavior of the non-hydrostatic vs hydrostatic configurations. Furthermore, with the anticipation that, for a given grid scale, the non-hydrostatic effects will be clearest in some extreme weather conditions when vertical motion becomes dominant, several weather events associated with extreme wind, temperature and precipitation were selected to examine the sensitivity of introducing the hydrostatic approximation.
In arriving at the final configuration of CARRA2, tests about hydrostatic and non-hydrostatic options have only been repeated for a short episode over the CARRA-West domain using CARRA2 to confirm that similar equivalence between hydrostatic and non-hydrostatic options, as observed during CARRA1, are still valid using the CARRA2 system, without repeating long validation runs over the entire CARRA2 domain. For completeness, we quote here some of the selected figures from the CARRA1 report, which are verification scores from month-long parallel tests and examples of comparisons for the peak of the extreme rainfall events to show some flavors of these tests.
Figure 16 presents the averaged errors as a function of forecast lead time for both of the CARRA1 reanalysis domains (CARRA-East and CARRA-West), comparing the CARRA data using hydrostatic and non-hydrostatic dynamics, respectively, for some of the essential meteorological parameters at screen level. The results indicate that, at a grid size of 2.5 km, the differences between the two options, as implemented in the HARMONIE-AROME models, are virtually negligible.
Figure 16: Averaged errors in standard deviation and bias along forecast lead time with CARRA1 analysis-forecast using hydrostatic (green) and non-hydrostatic (red) dynamics for temperature at 2 meters (upper panels), wind at 10 meter height (middle panels) and mean sea level pressure (lower panels) for CARRA-East domain (left panels) and CARRA-West domain (right panels) for the parallel test period of January 2017. The dark dashed curves denote the number of observations used in the statistics. (This figure is Figure 3 from Yang et al 2021.)
As an example of case study for extreme events, Figure 17 compares the 24h precipitation simulation for a November 2016 extreme rainfall case over Svalbard, with the use of non-hydrostatic and hydrostatic dynamics in CARRA-West at 2.5 km. It is clear that for this extreme case the two runs are virtually identical. Furthermore, the difference fields between the two show no indication of systematic trends (not shown here).
Figure 17: Simulation of the extreme rainfall event over Svalbard on November 7, 2016, by CARRA reanalysis with hydrostatic dynamics (left panel) and non-hydrostatic (right panel). The plotted fields are the 24-hour accumulated precipitation between 20161107 06 and 20161108 06. (This is Figure 4 from Yang et al 2021.)
Figure 18: Simulation of the two extreme rainfall events over the southwest Greenland coast using CARRA1 with hydrostatic dynamics (left panels) and the non-hydrostatic (standard) CARRA1 (right panels). Upper panels with 24 h precipitation accumulation between 20170913 18 UTC and 20170914 18 UTC, lower panels between 20180926 18 UTC and 20180927 18 UTC.
As shown in Figure 18, qualitatively similar results are obtained in comparison for two extreme rainfall cases over south of Greenland, one on September 13th, 2017, another on September 27th, 2018, where the non-hydrostatic effects for these extreme cases are hard to identify and negligible.
From these experiments, it is concluded that both in terms of verification score intercomparisons and case studies for extreme weather events, no differences of significance have been identified between the runs made with either hydrostatic or non-hydrostatic dynamics. It may be argued that these results do not exclude exceptional cases in which such effects may be significant, but it is considered that in terms of climate trends, a precise identification of such cases would anyway be less important.
It is also interesting to add that Polichtchouk et al (2023) also reported lack of sensitivity with hydrostatic simplification when testing the ECMWF km scale IFS model down to a grid scale of around 2.8 km.
Based on the above reasoning, the hydrostatic option has been selected as a main cost-cutting measure in configuration of the CARRA2 system.
The soil analysis in CARRA2 uses two steps to propagate screen level observations into the soil variables. In the first step, in-situ observations of 2-meter temperature and humidity are interpolated horizontally onto the model grid using the optimal interpolation (OI) method. The second step updates the soil variables (moisture and temperature) using the Simplified Extended Kalman filter (SEKF) (Albergel et al. 2017). The SEKF uses perturbed "offline" surface model simulations to approximate the linearization of the observation operator using the finite difference method. Since the surface model is highly non-linear, the perturbations are performed in positive and negative pairs. This approach thus requires two offline simulations per control variable and a full state control vector, which is a considerable computational cost. However, since the deep soil layers are decreasingly sensitive to the screen level parameters, it was reasonable to restrict the assimilation control variables to the uppermost active layer.
The system is tested and tuned over a smaller domain as well as for the full CARRA2 domain. Tests showed vanishingly small impact from including layers below 4 cm for soil temperature and 60 cm for soil moisture. It was therefore decided to remove deeper layers from the assimilation. To evaluate the impact of the scheme, experiments using this soil data assimilation were compared to a free running (no data assimilation) experiment. Overall, the soil analysis produced small increments relative to the variable variability. Compared to the free run, consistent improvements in 2-meter temperature (significant in the first 3 hours) and humidity (significant for more than 18 hours lead time) were seen in the following forecasts shown in Figure 19.
Figure 19: Normalized mean root mean square error (RMSE) differences between SEKF run and open loop (free) run for 2-meter specific humidity (left) and 2-meter temperature (right). Negative values indicate improvements by the SEKF run. Significance bars are shown as each forecast range.
We tested and evaluated the snow data assimilation using offline (not coupled to the atmosphere) and inline runs, mainly over smaller domains because of the large computational cost by running the full CARRA2 system.
In the surface scheme of CARRA1, the snow was represented in terms of Snow Water Equivalent (SWE), and therefore snow depth observations needed to go through a conversion to SWE before assimilation. We tested how the use of Snow Depth (SD) as assimilation control variable instead of SWE affected the snow analysis. This was done offline starting from 1 October 2022 until 1 March 2023 assimilating SYNOP snow depth observations and blacklisting some of the observing stations (i.e. not assimilating), see Figure 20. The blacklisted stations serve as independent validation observations, as they are not included in the data assimilation. This was done to evaluate the effect of snow depth assimilation in unobserved regions.
Figure 20: (Left) assimilated SYNOP snow depth observations, (right) blacklisted SYNOP snow depth observations over southern Norway. Color coding is based on station latitude.
We computed summary verification scores (Table 1) using not assimilated stations, therefore only indirectly influenced by the snow data assimilation, through the horizontal OI. Here we show OL (open-loop, i.e. no snow DA), updating snow water equivalent (SWE) and snow depth (SD). In summary working in snow depth space improved the summary verification scores.
Table 1: Summary verification scores for SYNOP snow depth stations not assimilated in the experiments. OL is the open-loop (no snow data assimilation), SWE is the updating of snow water equivalent and SD is the experiment where we updated the snow depth. #59 is the number of observations (independent stations) used in the statistics. Summary statistics computed are: bias, mse (mean-squared-error), rmse (root-mean-squared-error) and nse (Nash-Sutcliffe efficiency, see Nash and Sutcliffe, 1970).
Exp name | Bias (#59) [m] | mse (#59) [-] | rmse (#59) [m] | nse (#59) [-] |
OL | 0.053 | 0.0245 | 0.108 | 0.32 |
SWE | 0.024 | 0.0147 | 0.088 | 0.45 |
SD | 0.019 | 0.0129 | 0.082 | 0.49 |
The downstream effects of snow data assimilation on river discharge were also evaluated. We compared modeled river discharge with observed discharge data from the Norwegian Water Resources and Energy Directorate (NVE) during the spring of 2023. Modelled river discharge is computed by summarizing the surface runoff and sub-surface drainage within a given catchment area. The catchments are illustrated in Figure 21 (bottom panels) and outlined in red. An example of time-series is also shown in Figure 21 (top panel). Here the observed river discharge is black, open-loop (i.e. no snow data assimilation) is blue and with SYNOP snow depth data assimilation is orange. We can see that the discharge with snow data assimilation follows the observations rather well though the small peaks are sometimes not represented and the large magnitudes are mostly under-estimated. We see that the Kling-Gupta efficiency (KGE) score increases as does the correlation coefficient (called "cc" in Figure 21, see the legends at the top panel). The KGE score is often used in hydrology to evaluate model performance. KGE equals to 1 indicates perfect agreement between the model and observations. In general, -0.41 < KGE <= 1 indicates a reasonable model performance.
Figure 21: (Top) Example discharge time-series for Storeskar (Hemsedal), bottom (left) Storeskar catchment, bottom (right) all catchments included in the summary verification.
In Table 2 we show summary scores for all the catchments shown in Figure 21 (bottom right panel). We see an increase in Kling-Gupta Efficiency (KGE) scores and an increase in correlation coefficient. We also note that this is without filtering for larger catchments (where our way of calculating the river discharge has some limitations).
Table 2: Summary verification for river discharge when computed against observed discharge from 1st January until 1st June 2023. Open-loop (no snow data assimilation) and snow DA assimilation of SYNOP snow depth. N=83 is the number of catchments. KGE is the Kling-Gupta Efficiency score, R is the correlation coefficient.
N=83 | KGE | R | Bias (mm/day) |
open-loop | -0.05 | 0.6 | 6.7 |
snow DA | 0.22 | 0.68 | 4.8 |
To evaluate the effect of assimilating CryoClim satellite snow cover observations in the coupled system we performed two experiments, named "noCryo" and "Cryo". These experiments were run from 15th April 2023 until 21st May 2023 on a domain covering the Nordic countries. In the noCryo experiment we only assimilated SYNOP snow depth observations at 06 UTC, while in the Cryo experiment, we assimilated both SYNOP snow depth and CryoClim snow cover observations at 06 UTC. The effect on mean sea level pressure (MSLP) is shown in Figure 22. Although the differences are small, the assimilation of CryoClim has a slight positive impact on the MSLP scores. The slight positive impact is seen in Figure 22 right, where we have computed the normalized mean root-mean-squared-error difference between the experiment assimilating CryoClim (cryo) and not assimilating CryoClim (noCryo). A negative value indicates positive impact. Figure 23 shows the normalized RMSE difference for 2m temperature and humidity. It also indicates a slight positive impact of assimilating CryoClim observations. The impact is largest for 2m temperature, the significance is illustrated by the vertical error bars.
Figure 22: (Upper panel) Mean sea level pressure time-series for Cryo (purple), noCryo (green) and observations (orange). The line overlap indicates that Cryo vs noCryo does not influence the MSLP to a large degree. (Lower panel) normalized rmse differences between Cryo and noCryo. Negative values indicate improved score. The vertical purple lines indicate the significance of the scores for each forecast range. The number of cases is represented by the dashed black line in both figures.
Figure 23: (Left) 2m temperature normalized RMSE difference between Cryo and noCryo, (right) same but for specific humidity. The vertical purple lines indicate the significance of the score for each forecast range.
For Figure 24 we compute the absolute value of the bias between observations and model (observation-minus-forecast, OmF) and then plot their differences between the noCryo and Cryo experiments (|OmF|noCryo - |OmF|Cryo). Positive (red) means larger 2m temperature bias in noCryo than in Cryo. Overall, we found that around 325 stations had neutral differences, 124 improved and 57 was degraded.
Figure 24: Differences between noCryo and Cryo experiments for the absolute values of the 2m temperature bias between observations and model. Positive (red) means larger 2m temperature bias in noCryo than in the Cryo experiment.
To confirm the impact of data assimilation in the final configured CARRA2 system, we have conducted two additional experiments to examine the sanity of the reanalysis system; one where only surface observations are assimilated in the surface assimilation scheme, but with no upper air 3D-VAR assimilation, and one with neither upper air nor surface data assimilation, hence a pure dynamical downscaling of ERA5. Note that in the latter case, the sub-surface states are only updated with forecasts from previous cycles, hence with no constraints from surface observations. The experiments are conducted over two different two-month periods from September to October in 1990 and 2020, respectively, with results compared to the standard CARRA2 reanalysis (reference).
Figure 25 shows the time series of observation fit for the 1990 experiment for MSLP (upper left) and T2m (upper right) as well as the verification against radiosonde measurements of temperature (lower left) and wind speed (lower right). Figure 26 shows the same verification scores for the 2020 experiment. These results illustrate the benefit of upper air assimilation in large scale analyses parameters such as MSLP Note that in the CARRA2 system, as is the case with CARRA1, upper air data assimilation is supplemented with adjustment on large scale (so-called "Large Scale Mixing") towards the analysis from host model, ERA5, which contributes to the skill level of the CARRA2 data sets on large scale parameters such as MSLP.. On the other hand, the benefit of upper air data assimilation tends to be less clear for early years such as 1990 when the observation data in general were more limited with fewer aircraft and satellite data. Surface data assimilation, on the other hand, appears to be particularly beneficial to prevent the drifting away of key parameters as shown in the error time series.
Figure 25: Selected verification scores for the 1990 experiments. Time series of observation fit in terms of RMSE and bias for MSLP (upper left) and T2m (upper right). The lower figures show RMSE and bias for temperature (lower left) and wind speed (lower right) against radiosonde measurements averaged over selected pressure levels. CARRA2 data sets are in magenta color, CARRA2 with no upper air assimilation in cyan, pure downscaling runs (with neither upper air nor surface assimilation) in green.
Figure 26: Same as in Figure 25, but for the 2020 experiments.
Observation quality control is essential in (variational) data assimilation, ensuring data reliability by identifying and managing errors. Traditional methods modeled gross errors with a flat probability distribution and random errors with a Gaussian distribution, through this approach often failed to capture outliers effectively. The Huber norm distribution, which combines a Gaussian core with exponential tails, has proven to be robust in handling outliers, improving data assimilation accuracy and allowing a relaxation of prior background quality control (Tavolato and Isaksen, 2015). Studies show that the Huber Norm Quality Control (HN-QC) method enhances forecasts for extreme events, such as extratropical storms, by effectively integrating significant deviations from the model background.
To ensure good use of high-quality observations, the HN-QC has been successfully implemented in the CARRA2 3D-Var assimilation system for in-situ observations, including measurements of wind, temperature, geopotential and surface pressure. The effects of VarQC in 3D-Var data assimilation on the CARRA2 domain are hereafter demonstrated in terms of surface temperature differences (in Kelvin) and temperature differences at model level 50. The results are compared between scenarios with (VarQC) and without VarQC (CNTL) as illustrated in Figure 27. The analysis is valid on June 1, 2023, at 18:00 UTC. An example showing a vertical cross-section illustrates simulated temperature differences in Kelvin, comparing a control run without VarQC (CNTL) and a run incorporating VarQC for the same date (Figure 28). The cross-section extends from (0.0° E, 64.5° N) to (21.5° E, 67.0° N), with the x-axis representing distance in meters and the y-axis depicting model levels from the surface (level 65) to the top of the atmosphere (level 1).
Figure 27: Impact of VarQC in 3D-Var data assimilation on the CARRA2 pan-Arctic domain. (a) Surface temperature differences (in Kelvin) and (b) temperature differences at model level 50, comparing results without VarQC (CNTL) and with VarQC. The analysis is valid on June 1, 2023, at 18:00 UTC.
Figure 28: Vertical cross-section depicting simulated temperature differences (in Kelvin). From the point (0.0° E, 64.5° N) to the point (21.5° E, 67.0° N), the temperature difference is calculated between the control run, which does not apply VarQC (CNTL), and the run that incorporates VarQC. The y-axis represents the model levels, starting from level 65 (near the surface) up to the top of the atmosphere (level 1). The x-axis indicates the distance in meters from the left endpoint (0.0° E, 64.5° N). The 3D-Var assimilation analyzed temperature field is valid on June 1, 2023, at 18:00 UTC.
The cross-section was chosen to analyze the impact of VarQC on both over sea and land. The line begins at the location (0.0° E, 64.5° N), situated over the Norwegian Sea, and extends northeast toward northern Sweden, ending at (21.5° E, 67.0° N). Approximately 630 km along this line, the path transitions from sea to land area. Figure 28 clearly illustrates the temperature differences with and without VarQC in both regions, especially at levels above 50 in this instance. Over land (630 km to 1000 km distance), the effect is more pronounced, highlighting the role of VarQC in 3D-Var assimilation. The geopotential time series for June 2023 illustrates a 3-hour assimilation cycle per day for both the control run (CNTL) and the VarQC method (Figure 29). The top panel shows the number of assimilated observations, while the middle and bottom panels represent the mean analysis error (O-A) and mean background error (O-B), respectively. The figure highlights the impact of VarQC in the 3D-Var assimilation cycle. The mean absolute error for background departure (O–B) in the VarQC experiment is –1.14 m²/s², which is lower than that of the CNTL experiment (–2.0 m²/s²). Additionally, the mean analysis departures (O–A) are also smaller in VarQC (0.02 m²/s²) compared to CNTL (0.50 m²/s²), averaged over the entire month of June 2023. In summary, although the total number of assimilated observations is similar in both the CNTL and VarQC experiments, the mean departures (O–A and O–B) are consistently lower in the VarQC experiment. This clearly demonstrates the importance of VarQC in improving the variational assimilation system. A summary of the impact of VarQC on the standard deviation and biases for Mean Sea Level Pressure (MSLP) across 1 236 stations within the CARRA2 domain indicates a generally similar but slightly positive effect when compared to the control run (Figure 30).
Figure 29: The time series of geopotential (Z, m2/s2) for the month of June 2023 illustrates a 3-hour assimilation cycle per day for both the control run without VarQC (CNTL, blue) and the one with VarQC method (red). The top panel displays the number of assimilated observations, the middle panel depicts the mean analysis error (O-A), and the bottom panel shows the mean background error (O-B). The number in the legend of the top panel indicates the total number of assimilated observations, while the values in the middle and bottom two panels represent the mean absolute errors for (O-A) and (O-B), averaged over the entire month.
Figure 30: A summary of the impact of VarQC on the standard deviation and biases for Mean Sea Level Pressure (MSLP) across 1 236 stations within the CARRA2 domain indicates a similar but slightly positive effect when compared to the control run. The time series data is valid from July 2, 2023, to July 11, 2023 and averaging over a 6-hour window. The bottom two lines represent the mean bias with VarQC (magenta line) and without VarQC (green line), while the top two lines display the standard deviation with VarQC (magenta) and without VarQC (green). The gray dotted line indicates the number of cases used to calculate the biases and standard deviations for MSLP.
The CARRA2 system uses the BRAND ensemble-based method to generate background error statistics, effectively addressing seasonal variability in forecast error structures. In the full CARRA2 system documentation (section 3.2.2 of Yang et al., 2025), we detail the process of generating background error statistics using the BRAND method and emphasize the importance of seasonally varying structure functions. In summary, to compute the B-matrix, 10-member ensemble forecasts were generated using the standard CARRA2 setup, which features a resolution of 2.5 km and 65 vertical levels. Utilizing provisional structure functions from these runs, month-long forecasts were conducted with 3D-Var cycling and B-random perturbations to derive the final structure functions. The B-matrices were computed to capture variations in weather regimes, interpolating monthly between winter and summer structure functions. This approach enhances the accuracy of the reanalysis by incorporating gradual seasonal adjustments by taking the weighted average of the winter and summer B-matrices.
Observation space statistics were calculated based on all 10 ensemble members. The mean absolute error in (O-A) is significantly lower compared to (O-B) for both the control and ensemble members. The error distribution, initially ranging from -10 to 10 ms -1, is reduced to -5 to 5 ms -1 after assimilation (Figure 31). Furthermore, the error (O-A) is normally distributed within the reduced range of (-5, 5) ms -1. This clearly demonstrates the importance of high-resolution 3D-Var data assimilation over the CARRA2 domain (Figure 31). This is because we are assimilating observations from all ensemble members as well as the control. The primary goal of data assimilation is to enhance the analysis fields. To evaluate this process, we typically compare the statistics of background departures (O–B), which are calculated prior to assimilation, with analysis departures (O–A) for all assimilated observations, calculated after assimilation. In principle, the (O–A) departures should exhibit lower errors than the (O–B) departures. This indicates that data assimilation is functioning effectively across all ensemble members and the control, thereby improving the analysis fields. 3D-Var data assimilation plays a crucial role in enhancing forecast accuracy across all model parameters and levels, as well as in generating dynamically consistent initial conditions.
Figure 31: Histogram of background error (O-B, dark yellow) and analysis error (O-A, blue) in observation space for zonal wind (U in m/s) across 10 ensemble members for 20220111 at 00 UTC. The left panel displays error statistics for the control member, while the adjacent nine panels illustrate error statistics for each of the remaining ensemble members
The cost function (or objective function, J) is essential for understanding the performance of the minimization algorithm and its convergence behavior. If the cost function diverges, it indicates that the assimilation process is not functioning correctly, either within the system or for individual ensemble members. Figure 32 presents the statistics of the minimization of the cost function during a single assimilation cycle, which is valid on January 11, 2022, at 00 UTC and includes all 10 ensemble members. The figure clearly demonstrates that all ensemble members converge appropriately during the minimization iterations, with all members exhibiting similar convergence patterns after 50 iterations. Notably, the total number of iterations is not uniform across all members; this variation distinctly highlights the differences in spread among the ensembles. Additionally, a significant challenge in ensemble analysis is maintaining adequate ensemble spread among its members. Since the same set of observations is assimilated for all ensemble members, preserving this spread during the assimilation process can be difficult. Interestingly, the results indicate that even during minimization, each ensemble member retains a reasonable degree of spread, demonstrating the robustness of the approach.
Figure 32: Statistics of the minimization of the cost function (J) during the 3D-Var data assimilation cycle and for all 10 ensemble members. This figure illustrates how the cost function behaves during a single assimilation cycle, which is valid on January 11, 2022, at 00 UTC. The top panel displays the cost function of the observations ({}J_Obs), while the bottom panel shows the cost function of the background ({}J_Bg). The x-axis represents the number of iterations.
The importance of seasonal variation in the B-matrix is illustrated in Figure 33. The ratio of specific humidity in the mid-troposphere and the wave number ranging from 25 to 60 is considerably higher for the summer B-matrix. In contrast, the ratio for the winter B-matrix is significantly lower, approaching zero at the mid-tropospheric level, where cloud information is more abundant. A summary of the impact of seasonal variation in the B-matrix, specifically regarding the standard deviation and biases of the vertical profile (pressure level) averaged relative humidity verification against observations, is presented in Figure 34. This is an example calculated for the period from December 25 to December 28, 1999. Notably, the standard deviation (STDV) and biases (BIAS) for the winter B-matrix (B1) are lower compared to those of the summer B-matrix (B2). This difference arises because the assimilation and forecast were conducted using two distinct B-matrices (winter and summer) for the winter month of December 1999. This observation underscores the significance of seasonal impacts on the B-matrix, i.e. since the data assimilation system performs better when the seasonally correct B matrix is used. Additionally, this statement is supported by the time series of mean sea level pressure (MSLP) verification across a selection of 18 stations in the Denmark area, as shown in Figure 35.
Figure 33: Percentage of the variance in specific humidity (q) explained by unbalanced temperature (Tu) as a function of model level (height) and wave number (n). The B-matrix is computed for a winter month (a.) and a summer month (b.). The winter B-matrix is calculated using all available 6-hour forecast files valid for the month of January 2022. Similarly, the summer B-matrix is calculated using the same number of files, but for the month of June 2022.
Figure 34: A summary of the impact of seasonal variation in the B-matrix, specifically regarding the standard deviation (STDV) and biases (BIAS) of the vertical profile (pressure level) averaged relative humidity for 88 stations. The average relative humidity is calculated for the period from December 25th to December 28th, 1999. B1 (represented by the magenta line) corresponds to the winter B-matrix, while B2 (represented by the green line) corresponds to the summer B-matrix.
Figure 35: A summary of the impact of seasonal variation in the B-matrix, specifically regarding the standard deviation (STDV) and biases (BIAS) of the time series-averaged mean sea level pressure (MSLP) across a selection of 18 stations in the Denmark area. B1 (represented by the magenta line) corresponds to the winter B-matrix, while B2 (represented by the green line) corresponds to the summer B-matrix. The time series data is valid from December 25th to December 29th, 1999.
The CARRA2 reanalysis covers a geographically extensive pan-Arctic region. In comparison to other land areas in the Northern Hemisphere, the region has extremely poor coverage by the in-situ observation network. Because of this lack of good observation data, the reanalysis system for this area has to rely more heavily on the land-surface scheme for a realistic description of near-surface atmospheric states. With this in mind, efforts have been spent in CARRA2 to activate the multi-layer SURFEX surface scheme involving use of two patches over land tile, forest and open land. The option includes the 14-layer diffusion soil scheme (Decharme et al., 2011), the 12-layer explicit snow scheme (Decharme et al., 2016) and the multi-energy balance (MEB) for forest canopy (Boone et al., 2017). Accordingly, the surface assimilation scheme has also been adapted to use Simplified Extended Kalman Filter (SEKF). Additional efforts have also been devoted to the correction and enhancement of the ECOCLIMAP Second Generation physiographic database to further improve the surface description in the CARRA2 system. All these updates are detailed in the CARRA2 full system documentation (Yang et al, 2025). These steps, especially the activation of the multi-layer SURFEX scheme and change of surface assimilation method are significant departures from the previous versions of the HARMONIE-AROME (including those used in CARRA1) system, which were all based on a much simpler surface parameterization using force-restore approach. Although the multi-layer surface scheme is still not used operationally by any meteorological service, it has long been under evaluation within the ACCORD research community, it has e.g. been applied by the HARMONIE-Climate community for several years and has also been used by MET Norway for its pre-operational AROME-Arctic setup with positive experiences.
Quality assurance has been carried out throughout the preparation phase of the CARRA2 reanalysis system, followed by the regular monitoring during the reanalysis production including also the one-year spin-up runs. These measures include sanity checks with verification statistics and case studies for individual components as well as full systems along with the evolution of the development versions of the CARRA reanalysis system. In these investigations, datasets from first batches of CARRA2 runs using both the provisional and tagged reanalysis system have been examined, including also the dataset from one-year spin-up runs and the first year of the CARRA2 reanalysis data. As reference verification truth, relevant datasets from observations, ERA5, CARRA1 have been used. In this report we have presented some of these results, mostly from a statistical verification point of view. In another separate report (Køltzow et al 2025), we focus more on intercomparison to CARRA1 and on performance examination for individual extreme events.
Through these examinations, it is confirmed that the 2.5 km CARRA2 reanalysis dataset, as also shown in the smaller domain CARRA1 dataset, has a clear added value over the coarser resolution reanalysis products in description of the essential Arctic climate variables such as screen level weather parameters. No doubt, the advantages shown by the CARRA2 reanalysis datasets in detailed and high-resolution simulation for past weather events with high fidelity would make it a good asset for users in different contexts, including those on climate research and analysis, especially for the Arctic climatology. A reanalysis time series with this degree of detail and accuracy is also highly relevant for other applications, including training machine learning systems.
Reanalysis systems are complex in the sense that the ultimate quality of the resulting datasets is a combined reflection of both strengths and weaknesses with a large number of contributing system components, such as core model dynamics and physics, input observations and lateral boundary data, physiographic database, data assimilation algorithm and computation aspects. However, it is believed that, in comparison to CARRA1, the update in the surface parameterization and surface assimilation scheme in the CARRA2 system is likely the most important contributor defining the characteristics of the CARRA2 dataset. The verification materials as collected in this evaluation report, together with the report in the accompanying validation report (Køltzow et al 2025) provides evidence that CARRA2 appear to deliver a high-resolution regional reanalysis dataset that fulfil the basic quality target: that demonstrates an overall added value over the global, coarser resolution reanalysis products like ERA5. In terms of verification intercomparisons over the overlapped CARRA1 regions, the score differences tend to be mixed but, in most cases, less significant compared to the relative differences to ERA5, suggesting a comparable quality between the two same-resolution datasets. For most applications, this means that CARRA2 data set would be more preferred due to the advantage of a long time series with wider domain coverage, hence a better consistency both in temporal and spatial dimensions.
Albergel, C., Munier, S., Leroux, D. J., Dewaele, H., Fairbairn, D., Barbu, A. L., Gelati, E., Dorigo, W., Faroux, S., Meurey, C., Le Moigne, P., Decharme, B., Mahfouf, J.-F., & Calvet, J.-C. (2017). Sequential assimilation of satellite-derived vegetation and soil moisture products using SURFEX_v8.0: LDAS-Monde assessment over the Euro-Mediterranean area. Geosci. Model Dev. Discuss., 2017, 1–53. https://doi.org/10.5194/gmd-2017-121
Boone, A., Samuelsson, P., Gollvik, S., Napoly, A., Jarlan, L., Brun, E. and Decharme, B., (2017). The interactions between soil–biosphere–atmosphere land surface model with a multi-energy balance (ISBA-MEB) option in SURFEXv8–Part 1: Model description. Geoscientific Model Development, 10(2), pp.843-872. https://doi.org/10.5194/gmd-10-843-2017doi: 10.5194/gmd-10-843-2017
Decharme, B., Boone, A., Delire, C. and Noilhan, J., (2011). Local evaluation of the Interaction between Soil Biosphere Atmosphere soil multilayer diffusion scheme using four pedotransfer functions. Journal of Geophysical Research: Atmospheres, 116(D20). https://doi.org/doi: 10.1029/2011JD016002
Decharme, B., Brun, E., Boone, A., Delire, C., Le Moigne, P. and Morin, S. (2016). Impacts of snow and organic soils parameterization on northern Eurasian soil temperature profiles simulated by the ISBA land surface model. The Cryosphere, 10(2), pp.853-877. https://doi.org/doi: 10.5194/tc-10-853-2016
Køltzow, M. et al, (2025). Evaluation report on CARRA1 and CARRA2 data sets on CARRA1 domains. Copernicus C3S2 deliverable report C3S2_D361a.4.1.1.
Nash, J. E. and Sutcliffe, J. V. (1970). River flow forecasting through conceptual models: Part 1. A discussion of principles. J. Hydrol. 10:282–290. https://doi.org/doi:10.1016/0022-1694(70)90255-6
Polichtchouk I et al (2023). The current forecasting equations are still valid near km-scale resolution. ECMWF blog post, 22 November 2023.
https://www.ecmwf.int/en/about/media-centre/news/2023/current-forecasting-equations-are-still-valid-near-km-scale
Tavolato, C. and Isaksen, L. (2015). On the use of a Huber norm for observation quality control in the ECMWF 4D-Var. Q.J.R. Meteorol. Soc., 141: 1514-1527. https://doi.org/10.1002/qj.2440
Yang X. et al (2021). Alternative design options and recommendation for a future pan-Arctic reanalysis system. Copernicus C3S deliverable report C3S_D322_Lot2.1.5.1 (C3S_D322_Lot2.1.5.1-202103_Pan-Arctic_v3.docx).
Yang X., et al (2025). Full CARRA2 reanalysis system documentation. Copernicus C3S2 deliverable C3S2_D361a.1.1.3.
This document has been produced in the context of the Copernicus Climate Change Service (C3S). The activities leading to these results have been contracted by the European Centre for Medium-Range Weather Forecasts, operator of C3S on behalf of the European Union (Delegation agreement signed on 11/11/2014). All information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The users thereof use the information at their sole risk and liability. For the avoidance of all doubt , the European Commission and the European Centre for Medium - Range Weather Forecasts have no liability in respect of this document, which is merely representing the author's view. |
Related articles appear here based on the labels you select. Click to edit the macro and add or change labels.