Page History

...

The distinction between validation (characteristics of a forecast system’’s characteristicssystem) and verification (forecast system’’s predictive skill of a forecast system) is as relevant in probabilistic as in deterministic forecasting. Some statistical concepts to facilitate the use and interpretation of probabilistic medium-range forecasts - ensemble forecasts - are given below. As with the deterministic forecasting system, probability verification can address the accuracy (how close the forecast probabilities are to the observed frequencies), the skill (how the probability forecasts compare with some reference system) and utility (the economic or other advantages of the probability forecasts).

Sceptics of probability forecasts argue that forecasters might exaggerate uncertainty ““to “to cover their backs””backs”. However, as will be shown, the verification of probabilistic forecasts takes the ““reliability”” “reliability” of the probabilities into account and will detect any such misbehaviour. Indeed, one of the most used verification scores, the Brier score, is mathematically constructed in a way that encourages forecasters to state the probability they really believe in, rather than some misperceived ““tactical”” “tactical” probability.

This chapter discusses only the most commonly used probabilistic validation and verification methods. For a full presentation the reader is referred to Nurmi (2003).

...

The probabilities from an over-confident forecast system could be calibrated to increase “hidden skill”” skill” in probability forecasts that are biased in this way but this is not really practicable. For example, in this hypothetic case:

...

The probabilities from an under-confident forecast system may be calibrated to increase “hidden skill”” skill" in probability forecasts that are biased in this way. For example, in this hypothetic case:

...

Resolution is the degree to which the forecasts can discriminate between more or less probable events. It should not be confused with sharpness which is the tendency to have predictions close to 0% and 100%. Resolution and Sharpness are independent of one another.

The same ““uncertainty”” "uncertainty" can be understood from the familiar fact that it is easier to predict the outcome of tossing a coin if it is heavily biased. In the same way, if it rains frequently in a region and rarely stays dry, forecasting rain can be said to be ““easier”” "easier" than if rain and dry events occur equally often. The uncertainty is purely dependent on the observations, just as the A_a term in the RMSE decomposition. It is also the Brier Score of the sample climatology forecast and plays the same role with the Brier Score as the A_a term with the RMSE (see Forecast Error Baseline). Comparisons of Brier Scores for different forecast samples can only be made if the uncertainty is the same.

...

A more detailed way of validating the spread is by a rank histogram (sometimes called a Talagrand diagram). It is constructed from the notion that in an ideal ensemble system the verifying analysis is equally likely to lie in any ““bin”” "bin" defined by any two ordered adjacent members, including when the analysis is outside the ensemble range on either side of the distribution. This can be understood from induction, if we consider an ideal ensemble with one, two or three members:

With one ensemble member ( I ) verifying observations () will always (100%) fall ““outside”” "outside" I

With two ensemble members (I I), verifying observations will for this ideal ensemble fall outside in two cases out of three II

...

In contrast to deterministic forecasts, an individual probabilistic forecast can never be ““right”” or ““wrong”” "right" or "wrong" except when 0% or 100% have been stated. Probability forecasts can therefore, only be verified from large samples of forecasts.

...

Similar to the MSE, the BS can be decomposed into three terms, the most often quoted was suggested by Allan Murphy (1973, 1986) who used ““binned”” "binned" probabilities:

where n_k is the number of forecasts of the same probability category k.

The first term measures the reliability (i.e. how much the forecast probabilities can be taken at face value). On the reliability diagram this is the n_k weighted sum of the distance (vertical or horizontal) between each point and the 45° diagonal (see Fig12.B.10).
The second term measures the resolution (i.e. how much the predicted probabilities differ from a climatological average and therefore contribute information). On the reliability diagram this is the weighted sum of the distances to a horizontal line defined by the climatological probability reference (see Fig12.B.10).
The third term measures the uncertainty (i.e. the variance of the observations). It takes its highest, most ““uncertain””"uncertain", value when õ = 0.5 (see Fig12.B.11).

Fig12.B.10: Summary of Allan Murphy’’s reliability and resolution terms.

Fig12.B.11: Uncertainty is at its maximum for a climatological observed probability average of 50%.

The Brier score is a

““proper””

"proper" score

The Brier score is strictly ““proper”” "proper" (i.e. it encourages forecasters to really try to find out the probability, without thinking about whether the forecast value is ““tactical”” "tactical" or not). Indeed, if forecasters deviate from their true beliefs, the BS will ““punish”” "punish" them! This sounds strange. How can an abstract mathematical equation know someone’’s the inner beliefs of someone?

Assume forecasters honestly think the probability of an event is p but have, for misguided ““tactical”” "tactical" reasons, instead stated r. If the event occurs, the contribution to the BS (first term) is (1 - r)2 weighted by the probability for the outcome to occur. If the event does not occur the contribution to the BS (first term) is (r - 0)² weighted by the probability for the outcome not to occur. For these weightings, the ““honest”” "honest" probability must be used; p when the event occurs, (1 - p) when the event does not occur. This is where the forecaster’’s true beliefs are revealed!

The expected contribution to the BS is therefore:

Differentiating with respect to r yields

with a minimum for r = p. Therefore to minimize the expected contribution to the Brier Score the honestly believed probability value should be used.

The Brier Skill Score (BSS)

A Brier Skill Score (BSS) is conventionally defined as the relative probability score compared with the probability score of a reference forecast.

““Uncertainty”” "Uncertainty" plays no role in the BSS.

Rank Probability Scores (RPS)

Probabilities often refer to the risk that some threshold might be exceeded, for example that the precipitation >1 mm/12hr or that the wind >15 m/s. However, when evaluating a probabilistic system, there are no reasons why these thresholds are particularly significant. For the Rank Probability Score (RPS) the BS is calculated for different (one-sided) discreet thresholds and then averaged over all thresholds.

...

A powerful way to verify probability forecasts and in particular to compare their performance with deterministic forecast systems, is the two-dimensional ““Relative "Relative Operating Characteristics”” or ““ROC”” Characteristics" or "ROC" diagram. These categorical forecasts will produce a set of pairs of ““Hit Rate”” and ““False Alarm Rate”” "Hit Rate" and "False Alarm Rate" values to be entered into the ROC diagram: False Alarm Rate (FR) on the x-axis and Hit Rate (HR) value on the y-axis (derived from the Contingency Table). The upper left corner of the ROC diagram represents a perfect forecast system (no false alarms, only hits). The closer any verification is to this upper left corner, the higher the skill. The lower left corner (no false alarms, no hits) represents a system which never warns of an event. The upper right corner represents a system where the event is always warned for (see Fig12.B.12).

...

An efficient way to improve the ensemble forecast, both the EM and the probabilities, is by Statistical Post-Processing (SPP), which is an advanced form of calibration of the output from the deterministic ensemble members. The most commonly used SPP method is ““model "model output statistics”” statistics" (MOS).

The MOS equation

Deterministic NWP forecasts are statistically matched against a long record of verifying observations though a linear regression scheme. The predictand (Y) is normally scalar (for example 2m temperature) and the predictors (X) one or several forecast parameters, selected by a linear regression system using the parameters which provide the most information (e.g. forecasts of 2m temperature use 850hPa temperature, 500hPa geopotential etc):

...

The MOS equation not only minimizes the RMSE, it also corrects simultaneously for both systematic mean errors and for the variability. In the above equation X₁ represents the mean error correction and X₂ the variability correction. There is therefore no necessity to apply two different schemes, one for reducing the systematic error (““bias””"bias") and one for correcting the spread.

...

MOS also improves forecasts in the medium range but, with increasing forecast range, less and less of this improvement is due to the MOS equation’’s ability to remove systematic errors. In the medium range, the dominant errors are non-systematic. These non-systematic errors (e.g. false model climate drift) can appear as false systematic errors (e.g. see Fig12.A.4). They will thus be ““corrected”” "corrected" by the MOS in the same way as true systematic errors. By this means MOS is essentially dampening the forecast anomalies and thereby minimizing the RMSE. This might be justified in a purely deterministic context but not in an ensemble context, where the most skilful damping of less predictable anomalies is achieved by ensemble averaging through the EM. It is therefore recommended that MOS equations are calculated in the short range, typically at D+1, based on forecasts from the CTRL, and then applied to all the members in the ensemble throughout the whole forecast range, as long as any genuine model drift can be discarded.

...

About every few years, NWP models undergo significant changes that make the MOS regression analysis obsolete. There are, however, techniques whereby the MOS can be updated on a regular (monthly or quarterly) basis, although this does not completely eliminate the drawback of historic inertia. Alternatively, adaptive methods have increasingly come into use. Here the coefficients X₁ and X₂ in the error equation are constantly updated in the light of daily verification (Persson, 1991).
Fig12.B.25 shows forecasts and observations for the location with severe systematic 2m temperature errors depicted in Fig12.B.24. It is not a case of ““plain bias”” "plain bias" but of ““conditional bias””"conditional bias", since mild forecasts are less at error than cold forecasts. A simple mean-error correction would therefore not be optimal.

...

where T_fc is the verified forecast. The coefficients are updated from a variational principle of ““least effort””"least effort", whereby the equation line is translated (by modifying X₁) and rotated (by modifying X₂), so that it takes the verification into account, considering the uncertainties in the verification and the coefficients (see Fig12.B.26).

...

Space shortcuts

Page tree

Versions Compared

Old Version 8

New Version 9

Key

The Brier score is a

"proper" score

The Brier Skill Score (BSS)

Rank Probability Scores (RPS)

The MOS equation