Verification methodology

Any forecast is more usable when accompanied by an assessment of its expected quality. This page presents the verification procedures we have employed to assess the forecast quality of the different prediction systems. We highly recommend reading and understanding these technical details before interpreting the verification charts and using the products.

The verification was performed using the ExtraVERT software, following the WMO guidelines on verifying Sub-seasonal Predictions.

The forecast performance is assessed through the hindcasts, i.e. retrospective forecasts made with the same model version employed for the real-time forecasts. The contributing centres employ different hindcast configurations (reforecast period, ensemble size, start dates, etc... See Models for details), impacting the forecast quality estimates. The verification employs the longest reforecast period available for each system. In this way, the scores obtained include all the available information for each system and can be considered the best estimate of their performance. However, since the years and start dates analyzed differ between centres, the results are not directly comparable across centres, and any attempt to rank the quality of the prediction systems must consider this uncertainty.

For some prediction systems, the reforecast ensemble size is smaller than the forecast ensemble size. This leads to a general underestimation of the skill and an underestimation of the reliability in particular. For some performance metrics, we employed analytical adjustments that correct for the limited number of ensemble members (e.g. for the Continuous Ranked Probability Skill Score).

For some systems with small ensemble sizes, the real-time forecasts are produced by lagging a few recent forecasts initialized in previous days. However, the verification does not include lagged reforecasts. This also results in an underestimation of skill for those systems.

The observational reference for all analyses is the ERA5 reanalysis.

Grids and construction of forecast anomalies and tercile probabilities

According to WMO regulations, all the forecast and observation data are interpolated to a regular lat-lon grid of 1.5x1.5 degrees, regardless of the native model grid and resolution. In the regionally aggregated metrics, the uneven area of the grid boxes has been taken into account.

Weekly mean anomalies have been constructed with respect to an out-of-sample climatology, both for the observations and the forecasts. A 9-day window approach ensures that the reference climatology does not change abruptly between consecutive start dates.

The probabilities for tercile categories also employ out-of-sample climatologies with the 9-day window approach. They are computed by counting the number of ensemble members exceeding the tercile thresholds without any kind of ensemble dressing.

Performance metrics

The verification charts show five distinct performance metrics that assess different aspects of the forecast quality, such as accuracy, discrimination or reliability. Moreover, they target different ways to present the information, such as tercile probabilities or ensemble means. Some metrics are presented in the form of maps, i.e. for each grid box. Other metrics show regionally-aggregated results for the tropics (between 20ºN and 20ºS), the northern (>=20ºN) and the southern (<=20ºS) hemispheres.

Maps

Discrimination of tercile category forecasts (ROC AUC)

The ability of the probabilistic forecasts for the tercile categories to discriminate between occurrence and non-occurrence has been assessed with the Area under the ROC curve (ROC AUC). This measure of discrimination gives the probability that any two distinct observed events can be correctly distinguished by the corresponding forecast probabilities. This means counting, for each pair of occurrence and non-occurrence cases, how many times the largest of the two forecast probabilities corresponds to the observed event. A value of 1 indicates perfect discrimination, and random forecasts have an AUC of 0.5. The plots show an AUC value for each tercile category. See additional information on the ROC curve below.

Accuracy of the ensemble mean (MSSS)

The ensemble mean accuracy has been analyzed with the Mean Squared Skill Score (MSSS). The MSSS is a normalization of the Mean Squared Error by the standard deviation of the observations, or equivalently the MSE of a climatological forecast. A value of 1 indicates a perfect forecast, with 0 being an MSE equal to a climatological forecast. The figures show the MSSS of the ensemble mean. The MSSS can be decomposed into three non-negative terms that account for levels of bias, correlation and standard deviation. The Unexplained Variance term penalizes any lack of correlation and is the optimal MSSS that a linearly post-processed forecast could achieve by removing bias and inflating or deflating the forecasts. The Miscalibration term penalizes a lack or excess of smoothness compared to the correlation level, and can be reduced to zero by inflating or deflating the forecasts. The Squared Bias term, similarly, can be reduced to zero by subtracting the bias from the forecasts. Since we verify anomalies here, it is zero by definition and it is not shown in the plots.

Accuracy of the ensemble distribution (FCRPSS)

The Fair Continuous Ranked Probability Skill Score (FCRPSS) is a skill score that indicates the performance of the full forecast ensemble compared to a climatological forecast. The underlying score, the CRPS, measures the squared distance between the cumulative distribution of the ensemble and the observations. It includes a correction term for the limited number of ensemble members, making comparisons across ensembles of different sizes fair. An FCRPSS of 1 indicates a perfect forecast, and 0 indicates the same error level as a climatological forecast.

Regional aggregates

Reliability of tercile category forecasts (Reliability Diagram)

The Reliability Diagram shows whether the forecast probabilities are calibrated or not, i.e. if, for a certain probability level, the events occur with the frequency indicated by the probability. The plots show reliability diagrams for each tercile category. The side panels show the histograms of the probabilities for each category. Reliable (or calibrated) predictions lie close to the diagonal line. Please, note that even perfectly reliable prediction systems produce unreliable ensembles when only a small number of ensemble members are available. The number of bins to plot the reliability diagrams has been selected automatically with a non-parametric pool-adjacent-violators technique that prevents the appearance of non-isotonic curves and makes the results reproducible and robust.

Discrimination of tercile category forecasts (ROC curves)

ROC curves show the trade-off between the hit rate and the false alarm rate for different probability thresholds. Every decision-maker should consider the uneven consequences of misses and false alarms to weigh in these percentages. Forecasts with some discrimination ability lie above the diagonal line, with perfect forecasts drawing a step function passing through the upper-left corner. ROC curves evaluate the ability to discriminate between events and non-events, but do not assess the calibration of forecast probabilities. The plot shows ROC curves for each tercile category. The Area under the ROC curve (AUC) is indicated in the legend.

Page tree

Verification

Verification methodology

Grids and construction of forecast anomalies and tercile probabilities

Performance metrics