# Page History

...

The distinction between validation (forecast system’s system’’s characteristics) and verification (forecast system’s system’’s predictive skill) is as relevant in probabilistic as in deterministic forecasting. Some statistical concepts to facilitate the use and interpretation of probabilistic medium-range forecasts - ensemble forecasts - are given below. As with the deterministic forecasting system, probability verification can address the accuracy (how close the forecast probabilities are to the observed frequencies), the skill (how the probability forecasts compare with some reference system) and utility (the economic or other advantages of the probability forecasts).

Sceptics of probability forecasts argue that forecasters might exaggerate uncertainty “to ““to cover their backs”backs””. However, as will be shown, the verification of probabilistic forecasts takes the “reliability” ““reliability”” of the probabilities into account and will detect any such misbehaviour. Indeed, one of the most used verification scores, the Brier score, is mathematically constructed in a way that encourages forecasters to state the probability they really believe in, rather than some misperceived “tactical” ““tactical”” probability.

This chapter discusses only the most commonly used probabilistic validation and verification methods. For a full presentation the reader is referred to Nurmi (2003).

...

*The probabilities from an over-confident forecast system could be calibrated to increase “hidden skill” skill”” in probability forecasts that are biased in this way but this is not really practicable. For example, in this hypothetic case:*

...

*The probabilities from an under-confident forecast system may be calibrated to increase “hidden skill” skill”” in probability forecasts that are biased in this way. For example, in this hypothetic case:*

...

Resolution is the degree to which the forecasts can discriminate between more or less probable events. It should not be confused with sharpness which is the tendency to have predictions close to 0% and 100%. Resolution and Sharpness are independent of one another.

The same “uncertainty” ““uncertainty”” can be understood from the familiar fact that it is easier to predict the outcome of tossing a coin if it is heavily biased. In the same way, if it rains frequently in a region and rarely stays dry, forecasting rain can be said to be “easier” ““easier”” than if rain and dry events occur equally often. The uncertainty is purely dependent on the observations, just as the A_{a} term in the RMSE decomposition. It is also the Brier Score of the sample climatology forecast and plays the same role with the Brier Score as the A_{a} term with the RMSE (see Forecast Error Baseline). Comparisons of Brier Scores for different forecast samples can only be made if the uncertainty is the same.

...

A more detailed way of validating the spread is by a rank histogram (sometimes called a Talagrand diagram). It is constructed from the notion that in an ideal ensemble system the verifying analysis is equally likely to lie in any “bin” ““bin”” defined by any two ordered adjacent members, including when the analysis is outside the ensemble range on either side of the distribution. This can be understood from induction, if we consider an ideal ensemble with one, two or three members:

With one ensemble member** ( I )** verifying observations () will always (100%) fall “outside” ““outside”” **I**

With two ensemble members **(I I)**, verifying observations will for this ideal ensemble fall outside in two cases out of three **I****I**

...

An ideal ensemble system might, however, display a U-shape distribution due to observation uncertainties. For example, with 50 ensemble members an ensemble spread of 20°C yields an average bin width of 0.4°C, an ensemble spread of 5°C yields an average bin width of only 0.1°C, smaller than the observation uncertainty (see Fig12.B.9). The small bin size introduces an element of chance with respect to which bin the observation will fall into. The bin sizes, due to the normal distribution, will increase with increasing distance from the centre, and an observation is more likely to end up in a bin further away from the centre than closer to the centre. This will result in a misleading U-shaped distribution.

**Fig12.B.9:** *An observation (a filled circle) and its uncertainty assumed symmetric (the arrows). The forecast bins widen their intervals away from the centre (the mean of the distribution, here off the diagram to the right), so an observation is more likely, for random reasons, to fall into an outer and wider bin than an inner and narrower one.*

...

In contrast to deterministic forecasts, an individual probabilistic forecast can never be “right” ““right”” or “wrong” ““wrong”” except when 0% or 100% have been stated. Probability forecasts can therefore, only be verified from large samples of forecasts.

...

The most common verification method for probabilistic forecasts is the Brier score (BS) or the Mean Square Error of Probability Forecasts. It has a mathematical structure similar to the Mean Square Error (MSE).

BS measures the difference between the forecast probability of an event (p), and its occurrence (o) expressed as 0 (event did not occur) or 1 (event did occur). As with RMSE, the BS is negatively orientated (i.e. the lower, the betterbetter”).

#### Brier Score (BS)

The Brier Score (BS) measures the mean squared probability error of the forecast against a reference dataset (observations, or analyses, or climatology)

...

Similar to the MSE, the BS can be decomposed into three terms, the most often quoted was suggested by Allan Murphy (1973, 1986) who used “binned” ““binned”” probabilities:

where n_{k} is the number of forecasts of the same probability category k.

- The first term measures the reliability (i.e. how much the forecast probabilities can be taken at face value
**)**. On the reliability diagram this is the n_{k}weighted sum of the distance (vertical or horizontal) between each point and the 45° diagonal (see Fig12.B.10). - The second term measures the resolution (i.e. how much the predicted probabilities differ from a climatological average and therefore contribute information). On the reliability diagram this is the weighted sum of the distances to a horizontal line defined by the climatological probability reference (see Fig12.B.10).
- The third term measures the uncertainty (i.e. the variance of the observations). It takes its highest, most “uncertain”““uncertain””, value when õ = 0.5 (see Fig12.B.11).

**Fig12.B.10:** *Summary of Allan Murphy’s Murphy’’s reliability and resolution terms.*

**Fig12.B.11:** *Uncertainty is at its maximum for a climatological observed probability average of 50%.*

#### The Brier score is a

“proper”#### ““proper”” score

The Brier score is strictly “proper” ““proper”” (i.e. it encourages forecasters to really try to find out the probability, without thinking about whether the forecast value is “tactical” ““tactical”” or not). Indeed, if forecasters deviate from their true beliefs, the BS will “punish” ““punish”” them! This sounds strange. How can an abstract mathematical equation know someone’s someone’’s inner beliefs?

Assume forecasters honestly think the probability of an event is p but have, for misguided “tactical” ““tactical”” reasons, instead stated r. If the event occurs, the contribution to the BS (first term) is (1 - r)2 weighted by the probability for the outcome to occur. If the event does not occur the contribution to the BS (first term) is (r - 0)^{2} weighted by the probability for the outcome not to occur. For these weightings, the “honest” ““honest”” probability must be used; p when the event occurs, (1 - p) when the event does not occur. This is where the forecaster’s forecaster’’s true beliefs are revealed!

The expected contribution to the BS is therefore:

Differentiating with respect to r yields

with a minimum for r = p. Therefore to minimize the expected contribution to the Brier Score the honestly believed probability value should be used.

#### The Brier Skill Score (BSS)

A Brier Skill Score (BSS) is conventionally defined as the relative probability score compared with the probability score of a reference forecast.

“Uncertainty” ““Uncertainty”” plays no role in the BSS.

### Rank Probability Scores (RPS)

Probabilities often refer to the risk that some threshold might be exceeded, for example that the precipitation >1 mm/12hr or that the wind >15 m/s. However, when evaluating a probabilistic system, there are no reasons why these thresholds are particularly significant. For the Rank Probability Score (RPS) the BS is calculated for different (one-sided) discreet thresholds and then averaged over all thresholds.

...

A powerful way to verify probability forecasts and in particular to compare their performance with deterministic forecast systems, is the two-dimensional “Relative ““Relative Operating Characteristics” Characteristics”” or “ROC” ““ROC”” diagram. These categorical forecasts will produce a set of pairs of “Hit Rate” ““Hit Rate”” and “False ““False Alarm Rate” Rate”” values to be entered into the ROC diagram: False Alarm Rate (FR) on the x-axis and Hit Rate (HR) value on the y-axis (derived from the Contingency Table). The upper left corner of the ROC diagram represents a perfect forecast system (no false alarms, only hits). The closer any verification is to this upper left corner, the higher the skill. The lower left corner (no false alarms, no hits) represents a system which never warns of an event. The upper right corner represents a system where the event is always warned for (see Fig12.B.12).

**Fig12.B.12:** *The principle of the ROC diagram: a large number of probability forecasts are turned into categorical forecasts depending on whether the probability values of individual forecasts are above or below a certain threshold. The false alarm rate and the hit rate are calculated, thus determining the position in the diagram (red filled circle).*

...

Probabilistic forecasts are transformed into categorical yes/no forecasts defined by thresholds varying from 0% to 100% (see Fig12.B.13).

**Fig12.B.13:** *The same as Fig12.B.12, but repeated for several thresholds between 0% and 100%. The hit rates and false alarm rates of the deterministic model, although not providing probabilistic predictions, can be represented on the diagram by its typical hit rate and false alarm rate (green filled circle).*

The ROC score is the area underneath the forecast curve (see Fig12.B.14).

**Fig12.B.14:** *The area underneath the points, joined by straight lines, defines the ROC area, which is, ideally, 1.0 and at worst 0.0. Random forecasts yield 0.5, the triangular area underneath the 45° line. **There are two schools on how to calculate the area: either with a smooth spline or linearly, connecting the points.*

...

Further reading: ECMWF Newsletter Number 80 – Summer 1998. https://www.ecmwf.int/sites/default/files/elibrary/1998/14644-newsletter-no80-summer-1998.pdf

**Statistical Post-processing **

...

**–– Model Output Statistics (MOS)**

An efficient way to improve the ensemble forecast, both the EM and the probabilities, is by Statistical Post-Processing (SPP), which is an advanced form of calibration of the output from the deterministic ensemble members. The most commonly used SPP method is “model ““model output statistics” statistics”” (MOS).

### The MOS equation

Deterministic NWP forecasts are statistically matched against a long record of verifying observations though a linear regression scheme. The predictand (Y) is normally scalar (for example 2m temperature) and the predictors (X) one or several forecast parameters, selected by a linear regression system using the parameters which provide the most information (e.g. forecasts of 2m temperature use 850hPa temperature, 500hPa geopotential etc):

**Y = X _{1} + X_{2} ·T2m +X_{3} ·T850hPa +X_{4} ·Z500 +
…etc……etc**

where the coefficients X_{i=1, 2, 3 ... n} are estimated by the regression scheme. For this discussion it is sufficient to consider the simple MOS equation:

...

The MOS equation not only minimizes the RMSE, it also corrects simultaneously for both systematic mean errors and for the variability. In the above equation X_{1} represents the mean error correction and X_{2} the variability correction. There is therefore no necessity to apply two different schemes, one for reducing the systematic error (“bias”““bias””) and one for correcting the spread.

...

However, the corrections imposed by MOS have different emphases in the short and medium range. In the short range, where most synoptic features are forecast with realistic variability, the MOS equation mainly corrects true systematic errors and representativeness errors.

**Fig12.B.24:** *A scatter diagram of forecast errors versus forecast for Tromsö in northern Norway, November 2010 - February 2012. Cold temperatures are too cold and, as a whole, the forecasts overestimate the variability of the temperature.*

...

MOS also improves forecasts in the medium range but, with increasing forecast range, less and less of this improvement is due to the MOS equation’s equation’’s ability to remove systematic errors. In the medium range, the dominant errors are non-systematic. These non-systematic errors (e.g. false model climate drift) can appear as false systematic errors (e.g. see Fig12.A.4). They will thus be “corrected” ““corrected”” by the MOS in the same way as true systematic errors. By this means MOS is essentially dampening the forecast anomalies and thereby minimizing the RMSE. This might be justified in a purely deterministic context but not in an ensemble context, where the most skilful damping of less predictable anomalies is achieved by ensemble averaging through the EM. It is therefore recommended that MOS equations are calculated in the short range, typically at D+1, based on forecasts from the CTRL, and then applied to all the members in the ensemble throughout the whole forecast range, as long as any genuine model drift can be discarded.

...

About every few years, NWP models undergo significant changes that make the MOS regression analysis obsolete. There are, however, techniques whereby the MOS can be updated on a regular (monthly or quarterly) basis, although this does not completely eliminate the drawback of historic inertia. Alternatively, adaptive methods have increasingly come into use. Here the coefficients X_{1} and X_{2} in the error equation are constantly updated in the light of daily verification (Persson, 1991).

Fig12.B.25 shows forecasts and observations for the location with severe systematic 2m temperature errors depicted in Fig12.B.24. It is not a case of “plain bias” ““plain bias”” but of “conditional bias”““conditional bias””, since mild forecasts are less at error than cold forecasts. A simple mean-error correction would therefore not be optimal.

**Fig12.B.25:** *Adaptive Kalman filtering of 2-metre temperature forecasts for Tromsö in northern Norway during winter 2012. The forecasts are too cold and over-variable, both of which are remedied by X _{1} and X_{2} in a 2-parameter error equation.*

By a daily verification, the Kalman filter estimates the coefficients X

_{1}and X

_{2}in the error equation:

...

where T_{fc} is the verified forecast. The coefficients are updated from a variational principle of “least effort”““least effort””, whereby the equation line is translated (by modifying X_{1}) and rotated (by modifying X_{2}), so that it takes the verification into account, considering the uncertainties in the verification and the coefficients (see Fig12.B.26).

**Fig12.B.26:** *A schematic illustration of the workings of an adaptable MOS by Kalman filtering. At a given time the error equation has a certain orientation (full red line) with a certain estimated uncertainty (red dashed lines). A forecast is verified and yields an error (red filled circle) that does not normally fall on the error line. Depending on the interplay between the equation uncertainty and the verification uncertainty (dashed red circle), the error equation line is translated and rotated to take the new information into account, after which this information is discarded.**Note that the system keeps information only about the error equation and its uncertainty and the last, not yet verified forecast. When the forecast is verified and the verification has affected the error equation, the verifying observation is discarded.*

Fig12.B.27 shows an ensemble forecast for the same location with severe systematic 2m temperature errors.

**Fig12.B.27:** *A plume diagram for Tromsö, 12 February 2012. The forecast is too cold with 50-100% probabilities of temperatures < -15°C.*

The two-dimensional error equation is able to apply corrections which are different for different forecast temperatures and thus take the flow dependence into account to some degree. The error equation is applied to all ensemble members at all ranges, assuming no significant model drift (see Fig12.B.28).

**Fig12.B.28:** *shows an ensemble forecast for the same location with severe systematic 2m temperature errors.: The same as Fig12.B.27 but after the Kalman-filtered errors equation has been applied. Mild forecasts have hardly been modified, whereas cold ones have been substantially warmed, leading to less spread and more realistic probabilities (e.g. 0% probabilities for 2m temperature <- 15°C).*

...