Section 12.B Statistical Concepts - Probabilistic Data

Introduction

The distinction between validation (forecast system’’s characteristics) and verification (forecast system’’s predictive skill) is as relevant in probabilistic as in deterministic forecasting. Some statistical concepts to facilitate the use and interpretation of probabilistic medium-range forecasts - ensemble forecasts - are given below. As with the deterministic forecasting system, probability verification can address the accuracy (how close the forecast probabilities are to the observed frequencies), the skill (how the probability forecasts compare with some reference system) and utility (the economic or other advantages of the probability forecasts).

Sceptics of probability forecasts argue that forecasters might exaggerate uncertainty ““to cover their backs””. However, as will be shown, the verification of probabilistic forecasts takes the ““reliability”” of the probabilities into account and will detect any such misbehaviour. Indeed, one of the most used verification scores, the Brier score, is mathematically constructed in a way that encourages forecasters to state the probability they really believe in, rather than some misperceived ““tactical”” probability.

This chapter discusses only the most commonly used probabilistic validation and verification methods. For a full presentation the reader is referred to Nurmi (2003).

The Reliability or Discrimination Diagram

The most transparent way to illustrate the performance and characteristics of a probabilistic forecast system is the reliability diagram, where the x-axis is the predicted probability and the y-axis the frequency with which the forecasts verify. It serves both as a way to validate the system and to verify its forecasts (see Fig12.B.1).

Fig12.B.1: Schematic explanation of the reliability diagram. The aim for the proportion of forecasts with a given probability to match the proportion of occasions when the event actually occurs - these are not necessarily coincident; there will be some successes and some failures. It's the proportions that should match. For a sample of ENS forecasts (e.g. made daily during a period of a year) plot the number of forecasts showing a given probability that an event will occur against the number of occasions that the event actually occurred. As an example, if there were 365 ENS forecasts in a sample, and an event was predicted with 20% probability on a number of occasions then if the event actually occurred during the sample period :

on 20% of the time (i.e. 365 x 20% = 73 occasions). This matches the 20% proportion of the times expected when a forecast of 20% probability is made (not more, not less). Reliability is good.
on 50% of the time (i.e. 365 x 50% = 183 occasions). This is a greater proportion of times than the 20% of the times expected when a forecast of 20% probability is made. Reliability is poor and the forecast system is under-forecasting.
on 10% of the time (i.e. 365 x 50% = 37 occasions). This is a lower proportion of times than the 20% of the times expected when a forecast of 20% probability is made. Reliability is poor and the forecast system is over-forecasting.

A similar exercise is done for other forecast probabilities (e.g.40%, 60%, 80%, 100%) and the forecast probabilities plotted against the observed probabilities.

Fig12.B.2: A fictitious example reliability diagram. The size of a green circle represents the number of forecasts for each probability of an event during the sample period. Ideally the circles should lie along the 45° diagonal. Here, during the period, an event occurred on ~38% of the time which is greater than the 20% of the occasions when 20% probability of an event is predicted. An event occurred on ~80% of the time which is less than the 100% of the occasions when 100% probability of an event is predicted.

When the forecast probabilities agree with the frequency of events for this particular probability the distribution should lie along the 45° diagonal. In such a case the probability forecasts are considered reliable.

Sharpness

Sharpness is the ability of a probabilistic forecast system to spread away from the climatological average. Climatological probability averages used as forecasts give perfect reliability, since the distribution would be exactly on the 45° diagonal. They would not, however, be very useful since most probabilities cluster towards the climatological mean (Fig 12.B.3). Ideally, we want the forecast system, to be mainly reliable, but also to span as wide a probability interval as possible, with as many forecasts as possible away from the climatological average and as close to 0% and 100% as possible (Fig 12.B.4).

Fig12.B.3: An example of a reliable forecast system with poor sharpness. The probabilities cluster around the climatological frequency (here assumed to be 50%). This means that forecasts have shown low "certainty" in their predictions of an event or of no event. When the predictability of a certain weather parameter is low, the forecasts might still be reliable but will tend to cluster around the climatological average.

Fig12.B.4: An example of a reliable forecast system with good sharpness. The probabilities cluster as far away as possible from the climatological frequency (here assumed to be 50%). This means that forecasts have shown greater "certainty" in their predictions of an event or of no event.

Improvements in probability forecasts, provided they are reliable, will be accompanied by improved sharpness until, ultimately, only 0% and 100% forecasts are issued and verify, corresponding to a perfect deterministic forecast system. However, an improvement in sharpness does not necessarily mean that the forecast system has improved.

Under- and Over-confident Probability Forecasts

Most probabilistic forecast systems, both subjective and objective, tend to give distributions flatter than 45°. This means that low risks are underestimated and high risks overestimated - the forecast system is overconfident. Fig12.B.5 shows an example of overconfidence.

Fig12.B.5: Probability forecasts with good sharpness but over-confident. Forecasts of 0% probability of rain are not always verified as dry; forecasts of 100% probability of rain are not always derified as wet.

The less common case: the distribution is steeper than 45°, low risks have been overestimated and high risks underestimated. The forecast systems are then under-confident (see Fig12.B.6).

Fig12.B.6: Probabilities indicating reluctance to use very high or low probabilities. The probability forecasts are under-confident.

Under- or overconfidence can be corrected by the calibration of probabilities.

Reliability - Calibration of Probabilities

For operational purposes the reliability can be improved by calibration using verification statistics (e.g. if it is found that in cases when 0% has been forecast, the event tends to occur in 30% of the cases, and when 100% has been forecast, the event tends to occur only in 70% of the cases). If the misfit is linearly distributed in between these two extremes, the reliability can be made perfect by calibration - but at the expense of reduced sharpness, since very low and very high probabilities are never forecast (see Fig12.B.7).

Fig12.B.7: This example compares the proportion of model forecasts predicting rain over a (large) number of locations and times within an area and time period (e.g. locations within Germany during 24 hours) against the proportion of observations of rain within that area and time period. Ideally the proportion of forecasts of rain should match the proportion of locations (and times) that rain actually occurs (i.e. lie on the diagonal). The size of the green dots is indicative of the number of forecasts. The diagram shows that:

the model forecasts rain on 100% of occasions within the area and time period as against the proportion occasions where rain actually occurred (70%). It is too sure of itself and over-forecasts rain events.
the model forecasts rain on 0% of occasions within the area and time period as against the proportion occasions where no rain actually occurred (30%). It is too sure of itself, under-forecasts and never forecasts dry events.
the model forecasts rain on e.g. ~20% of occasions within the area and time period as against the proportion of occasions where rain actually occurred ~40%). This matches the observed frequency rain observations better but there are fewer forecasts at this probability (smaller green dots) implying lower confidence.

The probabilities from an over-confident forecast system could be calibrated to increase “hidden skill”” in probability forecasts that are biased in this way but this is not really practicable. For example, in this hypothetic case:

a forecast of 0% of rain occurring may be taken to mean rain is more unlikely than suggested by the model (perhaps towards 30%),
a forecast of 100% of rain occurring may be taken to mean rain is less likely than suggested by the model (perhaps towards ~75%).

Neither calibration is very good as it gives very poor indication of high or low probability of rain.

On the other hand, when the probability forecasts are under-confident, calibration might restore the reliability without giving up the sharpness (see Fig12.B.8).

Fig12.B.8: This example compares the proportion of model forecasts predicting rain over a (large) number of locations and times within an area and time period (e.g. locations within Germany during 24 hours) against the proportion of observations of rain within that area and time period. Ideally the proportion of forecasts of rain should match the proportion of locations (and times) that rain actually occurs (i.e. lie on the diagonal). The size of the green dots is indicative of the number of forecasts. The diagram shows:

the model forecasts rain on 70% of occasions within the area and time period as against the proportion of occasions where rain actually occurred (100%). It is too unsure of itself, under-forecasts rain events.
the model forecasts rain on 30% of occasions within the area and time period as against the proportion of occasions where no rain actually occurred (0%). It is too unsure of itself and over-forecasts dry events.
the model forecasts rain on e.g. ~45% of occasions within the area and time period as against the proportion of occasions where rain actually occurred (40%). This matches the observed frequency rain observations better but there are fewer forecasts at this probability (smaller green dots) implying lower confidence.

The probabilities from an under-confident forecast system may be calibrated to increase “hidden skill”” in probability forecasts that are biased in this way. For example, in this hypothetic case:

a forecast of 30% of rain occurring may be taken to mean rain is more unlikely than suggested by the model (perhaps towards 0%),
a forecast of 70% of rain occurring may be taken to mean rain is more likely than suggested by the model (perhaps towards 100%).

Resolution

Resolution is the degree to which the forecasts can discriminate between more or less probable events. It should not be confused with sharpness which is the tendency to have predictions close to 0% and 100%. Resolution and Sharpness are independent of one another.

The same ““uncertainty”” can be understood from the familiar fact that it is easier to predict the outcome of tossing a coin if it is heavily biased. In the same way, if it rains frequently in a region and rarely stays dry, forecasting rain can be said to be ““easier”” than if rain and dry events occur equally often. The uncertainty is purely dependent on the observations, just as the A_a term in the RMSE decomposition. It is also the Brier Score of the sample climatology forecast and plays the same role with the Brier Score as the A_a term with the RMSE (see Forecast Error Baseline). Comparisons of Brier Scores for different forecast samples can only be made if the uncertainty is the same.

Rank Histogram (Talagrand Diagram)

A more detailed way of validating the spread is by a rank histogram (sometimes called a Talagrand diagram). It is constructed from the notion that in an ideal ensemble system the verifying analysis is equally likely to lie in any ““bin”” defined by any two ordered adjacent members, including when the analysis is outside the ensemble range on either side of the distribution. This can be understood from induction, if we consider an ideal ensemble with one, two or three members:

With one ensemble member ( I ) verifying observations () will always (100%) fall ““outside”” I

With two ensemble members (I I), verifying observations will for this ideal ensemble fall outside in two cases out of three II

With three ensemble members (I I I), verifying observations will fall for this ideal ensemble outside in two cases out of four III

In general, if N = number of members, the verification will in two cases out of N + 1 always fall outside, yielding a proportion of 2 / (N + 1) outside. For the same reasons the HRES and the ENS Control should lie outside the ensemble 2 / (N + 1) of the time. For a 50-member ensemble system this means 4%. This is consistent with the consideration of probability that, due to the limited number of ensemble members, it would be unrealistic to assume that the probability was 0% or 100% just because none or all of the members forecast the event.

In an ideal ensemble, the rank histogram distribution should, on average, be flat with equal numbers of verifying observations in each interval. If there is a lack of spread, this will result in a U-shaped distribution with an over-representation of cases where the verifications fall outside the ensemble and under-representation of cases when they fall within the ensemble centre. If the system has a bias with respect to the verifying parameter, the U-shape might degenerate into a J-shape.

An ideal ensemble system might, however, display a U-shape distribution due to observation uncertainties. For example, with 50 ensemble members an ensemble spread of 20°C yields an average bin width of 0.4°C, an ensemble spread of 5°C yields an average bin width of only 0.1°C, smaller than the observation uncertainty (see Fig12.B.9). The small bin size introduces an element of chance with respect to which bin the observation will fall into. The bin sizes, due to the normal distribution, will increase with increasing distance from the centre, and an observation is more likely to end up in a bin further away from the centre than closer to the centre. This will result in a misleading U-shaped distribution.

Fig12.B.9: An observation (a filled circle) and its uncertainty assumed symmetric (the arrows). The forecast bins widen their intervals away from the centre (the mean of the distribution, here off the diagram to the right), so an observation is more likely, for random reasons, to fall into an outer and wider bin than an inner and narrower one.

Verification Measures

In contrast to deterministic forecasts, an individual probabilistic forecast can never be ““right”” or ““wrong”” except when 0% or 100% have been stated. Probability forecasts can therefore, only be verified from large samples of forecasts.

Brier Score (BS) is a measure, over a large sample, of the correspondence between the each forecast probability against the frequency of occurrence of the verifying observations (e.g. on average, when rain is forecasted with probability p, it should occur with the same frequency p). Observation frequency is plotted against forecast probability as a graph. A perfect correspondence means the graph will lie upon the diagonal; the area between the graph and the diagonal measures the Brier Score - values lie between 0 (perfect) and 1 (consistently wrong).

Brier Score (BS)

The most common verification method for probabilistic forecasts is the Brier score (BS) or the Mean Square Error of Probability Forecasts. It has a mathematical structure similar to the Mean Square Error (MSE).

BS measures the difference between the forecast probability of an event (p), and its occurrence (o) expressed as 0 (event did not occur) or 1 (event did occur). As with RMSE, the BS is negatively orientated (i.e. the lower, the better”).

Brier Score (BS)

The Brier Score (BS) measures the mean squared probability error of the forecast against a reference dataset (observations, or analyses, or climatology)

Brier Score for the forecast (BS_forecast) is calculated as the mean squared probability error of the forecast against observations (or analyses) over a given period.
Brier Score for the climatology (BS_climat) is calculated as the mean squared probability error of the forecast against climatology over the same period.

The Brier Score (BS) is a measure of how good forecasts are in matching observed outcomes. Where:

BS = 0 the forecast is wholly accurate;
BS = 1 the forecast is wholly inaccurate.

Brier Skill Score (BSS)

The Brier Skill Scores (BSS) measures the relative skill of the forecast compared to the skill of using climatology.

BSS is evaluated for each forecast centre by calculating the function BSS = 1 - BS_forecast / BS_climat

Where:

BSS = 1 the forecast has perfect skill compared to the reference (observations or climatology) - forecast beneficial;
BSS = 0 the forecast has no skill compared to the reference (observations or climatology) - forecast has no benefit over climatology;
BSS = a negative value the forecast is less accurate than the reference (observations or climatology) - forecast misleading.

Decomposition of the Brier score

Similar to the MSE, the BS can be decomposed into three terms, the most often quoted was suggested by Allan Murphy (1973, 1986) who used ““binned”” probabilities:

where n_k is the number of forecasts of the same probability category k.

The first term measures the reliability (i.e. how much the forecast probabilities can be taken at face value). On the reliability diagram this is the n_k weighted sum of the distance (vertical or horizontal) between each point and the 45° diagonal (see Fig12.B.10).
The second term measures the resolution (i.e. how much the predicted probabilities differ from a climatological average and therefore contribute information). On the reliability diagram this is the weighted sum of the distances to a horizontal line defined by the climatological probability reference (see Fig12.B.10).
The third term measures the uncertainty (i.e. the variance of the observations). It takes its highest, most ““uncertain””, value when õ = 0.5 (see Fig12.B.11).

Fig12.B.10: Summary of Allan Murphy’’s reliability and resolution terms.

Fig12.B.11: Uncertainty is at its maximum for a climatological observed probability average of 50%.

The Brier score is a ““proper”” score

The Brier score is strictly ““proper”” (i.e. it encourages forecasters to really try to find out the probability, without thinking about whether the forecast value is ““tactical”” or not). Indeed, if forecasters deviate from their true beliefs, the BS will ““punish”” them! This sounds strange. How can an abstract mathematical equation know someone’’s inner beliefs?

Assume forecasters honestly think the probability of an event is p but have, for misguided ““tactical”” reasons, instead stated r. If the event occurs, the contribution to the BS (first term) is (1 - r)2 weighted by the probability for the outcome to occur. If the event does not occur the contribution to the BS (first term) is (r - 0)² weighted by the probability for the outcome not to occur. For these weightings, the ““honest”” probability must be used; p when the event occurs, (1 - p) when the event does not occur. This is where the forecaster’’s true beliefs are revealed!

The expected contribution to the BS is therefore:

Differentiating with respect to r yields

with a minimum for r = p. Therefore to minimize the expected contribution to the Brier Score the honestly believed probability value should be used.

The Brier Skill Score (BSS)

A Brier Skill Score (BSS) is conventionally defined as the relative probability score compared with the probability score of a reference forecast.

““Uncertainty”” plays no role in the BSS.

Rank Probability Scores (RPS)

Probabilities often refer to the risk that some threshold might be exceeded, for example that the precipitation >1 mm/12hr or that the wind >15 m/s. However, when evaluating a probabilistic system, there are no reasons why these thresholds are particularly significant. For the Rank Probability Score (RPS) the BS is calculated for different (one-sided) discreet thresholds and then averaged over all thresholds.

Ranked Probability Score (RPS)

Discrete Ranked Probability Score (RPS) is the deviation of the forecast values being placed within a category against corresponding observations that actually lie within that category (e.g. tercile, quintile, etc.). The words "discrete" and "ranked" refer to the discrete nature of the categories.

The Ranked Probability Score (RPS) is a measure of how good forecasts are in matching observed outcomes. Where:

RPS = 0 the forecast is wholly accurate;
RPS = 1 the forecast is wholly inaccurate.

Ranked Probability Skill Score (de-biased) (RPSS-D)

Ranked Probability Skill Score (RPSS) is evaluated for by calculating the function RPSS = 1 − RPS_{forecast /}RPS_reference where:

Ranked Probability Skill Score (de-biased) (RPSS-D) against observations is defined as (RPSS-D_obs) = 1 - RPS(_forecast) / (RPS(obs) + D).
Ranked Probability Skill Score (de-biased) (RPSS-D) against climatology is defined as (RPSS-D_climat) = 1 - RPS(_forecast) / (RPS(_climat) + D).
where D represents the intrinsic (un)reliability of the ENS.

Where:

RPSS = 1 the forecast has perfect skill compared to the reference (observations or climatology) - forecast beneficial;
RPSS = 0 the forecast has no skill compared to the reference (observations or climatology) - forecast has no benefit over climatology;
RPSS = a negative value the forecast is less accurate than the reference (observations or climatology) - forecast misleading.

Weigel, A.P., D. Baggenstos, M. A. Liniger, F. Vitart, and F. Appenzeller, 2008: Probabilistic verification of monthly temperature forecasts. Mon. Wea. Rev., 136, 5162‐5182.

Continuous Ranked Probability Scores (CRPS)

Continuous Ranked Probability Score (CRPS)

A generalisation of Ranked Probability Score (RPS) is the Continuous Rank Probability Score (CRPSS) where the thresholds are continuous rather than discrete (see Nurmi, 2003; Jollife and Stephenson, 2003; Wilks, 2006). The Continuous Ranked Probability Score (CRPS) is a measure of how good forecasts are in matching observed outcomes. Where:

CRPS = 0 the forecast is wholly accurate;
CRPS = 1 the forecast is wholly inaccurate.

CRPS is calculated by comparing the Cumulative Distribution Functions (CDF) for the forecast against a reference dataset (observations, or analyses, or climatology) over a given period.

Continuous Ranked Probability Skill Score (CRPSS)

The Continuous Ranked Probability Skill Score (CRPSS) is a measure of how good forecasts are in matching observed outcomes. Where:

CRPSS = 1 the forecast has perfect skill compared to climatology - forecast beneficial;
CRPSS = 0 the forecast has no skill compared to climatology) - forecast has no benefit over climatology;
CRPSS = a negative value the forecast is less accurate than climatology - forecast misleading.

CRPSS is evaluated by calculating the function CRPSS = 1 − CRPS_{forecast /}CRPS_climat where:

Continuous Ranked Probability Score for the forecast (CRPS_forecast) is calculated comparing the Cumulative Distribution Functions (CDF) for the forecast against observations (or analyses) over a given period.
Continuous Ranked Probability Score for climatology (CRPS_climat) is calculated comparing the Cumulative Distribution Functions (CDF) for the forecast against climatology over the same period.

Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559-570.

Measurement of model performance

It is useful to have some measures of the current performance and biases of the IFS. Users can assess from Reliability and ROC diagrams whether the forecast model is:

effective in capturing an event (e.g. upper tercile rainfall),
tending to over- or under-forecast with different probabilities of an event,
tending to forecast events that actually happen while minimising those that don't.

The Relative Operating Characteristics (ROC) diagram

A powerful way to verify probability forecasts and in particular to compare their performance with deterministic forecast systems, is the two-dimensional ““Relative Operating Characteristics”” or ““ROC”” diagram. These categorical forecasts will produce a set of pairs of ““Hit Rate”” and ““False Alarm Rate”” values to be entered into the ROC diagram: False Alarm Rate (FR) on the x-axis and Hit Rate (HR) value on the y-axis (derived from the Contingency Table). The upper left corner of the ROC diagram represents a perfect forecast system (no false alarms, only hits). The closer any verification is to this upper left corner, the higher the skill. The lower left corner (no false alarms, no hits) represents a system which never warns of an event. The upper right corner represents a system where the event is always warned for (see Fig12.B.12).

Fig12.B.12: The principle of the ROC diagram: a large number of probability forecasts are turned into categorical forecasts depending on whether the probability values of individual forecasts are above or below a certain threshold. The false alarm rate and the hit rate are calculated, thus determining the position in the diagram (red filled circle).

Probabilistic forecasts are transformed into categorical yes/no forecasts defined by thresholds varying from 0% to 100% (see Fig12.B.13).

Fig12.B.13: The same as Fig12.B.12, but repeated for several thresholds between 0% and 100%. The hit rates and false alarm rates of the deterministic model, although not providing probabilistic predictions, can be represented on the diagram by its typical hit rate and false alarm rate (green filled circle).

The ROC score is the area underneath the forecast curve (see Fig12.B.14).

Fig12.B.14: The area underneath the points, joined by straight lines, defines the ROC area, which is, ideally, 1.0 and at worst 0.0. Random forecasts yield 0.5, the triangular area underneath the 45° line. There are two schools on how to calculate the area: either with a smooth spline or linearly, connecting the points.

The ROC diagram gives a measure of the effectiveness of the IFS in forecasting an event that actually happens (Probability of Detection or Hit Rate) while balancing this against the undesirable cases of predicting an event that fails to occur (False Alarm Rate). Where a ROC graph:

arches towards the top left of the diagram then model is effective in forecasting events that occur without warning of events that don't.
follows the diagonal then the model is forecasting as many events that occur as warning of events that don't.
lies below the diagonal then the model is forecasting few events that occur while mostly warning of events that don't.

Effect of the distribution of forecast probabilities.

The distribution of forecast probabilities gives an indication of the tendency of the forecast towards uncertainty. These are plotted as a histogram to give an indication of confidence in model performance:

A U-shaped distribution (i.e. higher proportion of forecast probabilities occur at each end of the histogram) implies a clearer decision on whether an event will or won't occur and gives a higher confidence in model performance.
A peaked distribution (i.e. higher proportion of forecast probabilities occur in the centre of the histogram) implies more equivocal decision on whether an event will or won't occur and gives much less confidence in model performance.

Note where there are only a few entries for a given probability on the histogram then confidence in the Reliability diagram is reduced for that probability.

Thus in Fig12.B.14 the predominance of probabilities below 0.2 and above 0.9 suggests there can be some confidence that when predicting lower tercile climatological temperatures at 2m; IFS tends to be over confident that the event will occur and under confident that it won't. However, there are few probabilities on the histogram between 0.2 and 0.9 which suggests that it would be unsafe to confidently draw similar deductions from the Reliability diagram within this probability range.

Conversely, in Fig12B.15 the majority of probabilities lie between 0.2 and 0.5 and reliability within this range appears fairly good while there is much less confidence in model performance for over- or under-forecasting an event. This is as expected as the forecast range becomes longer.

Reliability and ROC diagrams (presented on ECMWF web)

It is useful to have some measures of the current performance and biases of the IFS. Users can assess from Reliability and ROC diagrams whether the forecast model is:

effective in capturing an event (e.g. rainfall in upper tercile),
tending to over- or under-forecast with different probabilities of an event,
tending to forecast events that actually happen while minimising those that don't.

Brier Score (BS) is a measure, over a large sample, of the correspondence between the each forecast probability against the frequency of occurrence of the verifying observations (e.g. on average, when rain is forecasted with probability p, it should occur with the same frequency p). Observation frequency is plotted against forecast probability as a graph. A perfect correspondence means the graph will lie upon the diagonal; the area between the graph and the diagonal measures the Brier Score - values lie between 0 (perfect) and 1 (consistently wrong).

The distribution of forecast probabilities gives an indication of the tendency of the forecast towards uncertainty. These are plotted as a histogram to give an indication of confidence in model performance:

A U-shaped distribution (i.e. higher proportion of forecast probabilities occur at each end of the histogram) implies a clearer decision on whether an event will or won't occur and gives a higher confidence in model performance.
A peaked distribution (i.e. higher proportion of forecast probabilities occur in the centre of the histogram) implies more equivocal decision on whether an event will or won't occur and gives much less confidence in model performance.

The ROC diagram gives a measure of the effectiveness of the IFS in forecasting an event that actually happens (Probability of Detection or Hit Rate) while balancing this against the undesirable cases of predicting an event that fails to occur (False Alarm Rate). Where a ROC graph:

arches towards the top left of the diagram then model is effective in forecasting events that occur without warning of events that don't.
follows the diagonal then the model is forecasting as many events that occur as warning of events that don't.
lies below the diagonal then the model is forecasting few events that occur while mostly warning of events that don't.

The ROC score is the area beneath the graph on the ROC diagram and lies between 1 (perfect capture of events) and 0 (consistently warning of events that don't happen). Fig12.B.14 shows high effectiveness in forecasting events (ROC score 0.859) while Fig12.B.15 shows reduced effectiveness (ROC score 0.593). This is as expected as the forecast range becomes longer.

Fig12.B.14: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week1 (day5-11), DT:20 Jun 2019.

Fig12.B.15: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week5 (day19-32), DT:20 Jun 2019.

In the above diagrams:

BrSc=Brier Score (BS), LCBrSkSc = Brier Skill Score (BSS).,
BS_REL = Forecast reliability and BS_RSL = Forecast resolution with respect to observations.
BSS_RSL = Forecast resolution and, BSS_REL = Forecast reliability with respect to climatology.

The ROC score is the area beneath the graph on the ROC diagram and lies between 1 (perfect capture of events) and 0 (consistently warning of events that don't happen). Fig12.B.14 shows high effectiveness in forecasting events (ROC score 0.859) while Fig12.B.15 shows reduced effectiveness (ROC score 0.593). This is as expected as the forecast range becomes longer.

Cost/Loss Diagrams and Economic Value

The cost-loss model is a simple model of economic decision-making that demonstrates some of the differences between skill and value. It shows how different users can get contrasting benefits from the same forecast and how appropriate use of probability forecasts can improve the user's decisions. The user decides whether or not to take mitigating action in advance of a potential adverse weather event; this can be based on either deterministic or probabilistic forecast information.Forecasts can only have economic benefits (or value) if, as a consequence of the forecast, a user takes a course of action that would otherwise not have been taken.

There is no simple relationship between skill and value of a forecasting system. A system with no skill has little value, but may have value to some users who need to make a decision whatever the outcome. However, a skilful system may not necessarily have value to at least some users, perhaps because of external constraints that prevent making use of even excellent forecast guidance.

It is useful to understand the potential value to the user of a forecast technique (subjective, deterministic or EPS). Calculating the value of a system is based on past performance and gives some idea of the usefulness of a forecast technique in making decisions, particularly when forecasting high or low probability events.

The concept of Cost/Loss.

Consider a weather-related event and the action to be taken to lessen the impact.

If action is taken it will incur a cost C irrespective of whether the event occurs (e.g. the cost of buying insurance to offset damage due to the event).
If no action is taken, and the event occurs, then there will be a loss L (e.g. paying for damage due to the occurrence of the uninsured event).
If no action is taken and the event does not happen then no cost or loss is incurred.

The expense associated with each combination of action and occurrence is shown in Table1.

		Weather event occurs
		Yes	No
Take action	Yes	Cost	Cost
Take action	No	Loss	No Cost No Loss

Table1: Expense matrix: Cost and Loss for different outcomes.

The cost C and Loss L values depend entirely on the context of the decision and the user, and are not related to the meteorology. One practical problem is that C and L are often poorly known but can be at least least estimated for a given use of the forecast.

If the cost of protection is greater than the potential loss (i.e. C/L>1) there is no benefit in taking protective action. Also in the trivial case where the cost of protection is zero protective action should always be taken. Thus the Cost/Loss ratio (C/L) need only be considered within the range 0.0 to 1.0. The cost/loss ratio is the ratio of the cost of taking action (e.g. buying insurance against a forecast event) against the potential loss should an event occur.

The concept of Mean Expense.

Forecasts have economic benefits (or value) if, as a consequence of the forecast, a user takes a course of action that would otherwise not have been taken. Either action is taken (incurring a cost but no loss if the event occurs) or action is not taken (incurring no cost and but loss if the event occurs).

The mean expense (ME) is the average expense of using a forecast and depends upon both C and L. It includes the cost C of taking action when the event is forecast and the loss L on those occasions when event occurs but was not forecast. It can be shown that mean expense (ME) is a function of Hit Rate (HR), False Alarm Rate (FAR), the Cost/Loss ratio (C/L), and the climatological probability of an event ρ_c (see below).

The mean expense per unit loss (ME / L) can be plotted against Cost/Loss ratio C/L on an expense diagram (see Fig12.B.16).

Strategies to minimise expense.

Using climatology alone.

Climatology gives the proportion ρ_c of occasions that a given event occurs (e.g. rainfall at the location is greater than 5mm/hr occurs on 20% of occasions – i.e. ρ_c = 0.2). If the decision to take an action is based upon climatology alone there are just two options:

always take action. This incurs a cost C on each occasion, irrespective of whether the event occurs – hence the mean expense (ME) of always taking action is C. There will be no loss L.
never take action. This incurs a loss L but only on that proportion ρ_c of occasions when the event occurs – hence the mean expense (ME) of never taking action is ρ_cL. The mean expense per unit loss is thus ME/L = ρcL/L = ρ_c.

Hence if only climatology is available then the best course is always to take action if C < ρ_cL and never to act otherwise.

The important features are:

Where Cost/Loss values lie between 0 and ρ_c the mean expense per unit loss is equal to the Cost/Loss ratio C/L.
Where Cost/Loss values are greater than ρ_c the mean expense per unit loss is equal to the proportion ρ_c of occasions that a given event occurs.

Thus it is necessary to assess the values, and the relationship between, cost C and loss L. Note, where different climatology proportions ρ_c apply (e.g. rainfall at the location is greater than 25mm/hr ρ_c2 rather than 5mm/hr as used above ρ_c1) the graph is displaced to reflect ρ_c2rather than ρ_c1although the shape of the graph remains the same. See Fig12.B.16.

Fig12.B.16: Relationship of Mean Expense against Cost/Loss. Three examples are shown for climatological probabilities of an event (ρ_c=0.7, green, ρ_c=0.4, red, ρ_c=0.2, blue)

For a given event where:

Cost/Loss ratio is below the climatological probability the Mean Expense per unit loss equates to the Cost/Loss ratio.
Cost/Loss ratio is above the climatological probability the Mean Expense remains constant at the value attained at the climatological probability.

Using a Perfect Forecast.

A perfect forecast gives perfect knowledge of future conditions – it is never wrong. Thus action is taken:

only when the event is forecast to occur. Thus cost is never unnecessarily incurred (there will never be a loss because action is always taken before the event happens).
never when the event is forecast not to occur. Thus a cost is never incurred and there is no loss because the event doesn’t happen.

This appears on an expense diagram as a dashed line (See Fig12.B.17).

The important features are:

When Cost and Loss values are the same (i.e. where C/L = 1), the mean expense per unit loss is equal to ρ_c which is the proportion of occasions that a given event occurs. If C=L then expense of using a perfect forecast is same as expense of just using the climate.
When Cost is less than Loss (i.e. where C/L < 1) the mean expense is equal to the Cost/Loss value on the proportion of occasions that a given event occurs (i.e. ME = ρ_cC/L).

Fig12.B.17: Comparing the relationship between Mean Expense against Cost/Loss when using:

climatological probability of a given event (red line).
using a Perfect Forecast (dashed line).

Using deterministic forecasts.

Any practical forecasting system will not achieve a perfect forecast, but nevertheless should show an improvement over using climatology alone. The reduction in mean expense by using a deterministic forecast over climatology compared with that by using a perfect forecast over climatology may be used as a measure of value. Define the Value V as the reduction in mean expense (ME) by use of a forecast system as a proportion of the reduction that would be achieved by use of a perfect forecast. Thus:

a maximum value V = 1 is obtained from a perfect forecast.
a minimum value V = 0 is obtained from a forecast based on climate alone.
a value V > 0 indicates that the user will benefit by using the deterministic forecast.

Skill and Value for a deterministic system.

Measures of the performance of a forecast include the Hit Rate (HR), False Alarm Rate (FAR) and the Peirce Skill Score (PSS). But these give little information on the value of the forecast. It can be shown that ME is a function of HR and FAR, as well as ρ_c and C/L. For a given weather event and forecast system, the values of ρ_c, HR and FAR are known characteristics, and therefore the economic value V depends only upon the Cost/Loss ratio (C/L).

The Economic Value V is the reduction in mean expense by use of a forecast system (as opposed to use only of climatology) as a proportion of the reduction that would be achieved by use of a perfect forecast (were such a thing possible). The economic value depends on the Cost/Loss ratio (C/L).

a maximum value = 1 is obtained from a perfect forecast.
a value between 0 and 1 indicates that the user will benefit by using the deterministic forecast.
a minimum value = 0 is obtained where the forecast is no better than using climate. The user gains no benefit by using the deterministic forecast over a forecast based on climate alone.

Fig12.B.18: Comparing the reduction of Mean Expense from values using climatological probability (here ρ_c=0.4) of a given event (red line). The reduction of the Mean Expense using:

a deterministic forecast (blue arrows).
a Perfect Forecast (black arrows).

Measurements of Value V for each Cost/Loss may be plotted on a Value diagram (See Fig12.B.18). As the Cost/Loss approaches the limits of 0 and 1, climatology becomes harder to beat. When C/L is high, the high expense resulting from even occasional incorrect False Alarm forecasts outweighs the losses avoided when the forecast was correct. When C/L is low, the low expense of providing constant protection as the default is less than the loss incurred from an occasional incorrect Miss forecast (where the weather event occurred but was not forecast). The maximum value always occurs at C/L=ρ_c. At this point the expense of taking either climatology option (always or never protect) is the same and gives no guidance to the user while the deterministic forecast can potentially give the greatest benefit to decision-making. This is shown diagrammatically in Fig12.B.18; at the point where C/L = ρ_c the vertical distance between the red line and the dashed line is greatest, indicating that this is where the greatest potential for delivering value exists.

Fig12.B.19: Value of a deterministic forecast. Value is given by the ratio of the reduction in Mean Expenses per unit loss by using a forecast to the reduction in Mean Expenses per unit loss using a perfect forecast (i.e. the ratio of the lengths of blue to black arrows in Fig12.B.18). Here the deterministic forecast is the ensemble control member and results have been taken over the period of a month. More extreme events have low climatological probability and maximum Value occurs when Cost/Loss ratio is low. Less extreme events have higher climatological probability and maximum Value occurs when Cost/Loss ratio is higher.

In Fig12.B.19, note how the more extreme events (in this theoretical case, temperatures anomalies greater than 8 deg shown green or blue) have a lower climatological probability (are more unusual) and therefore the maximum potential Value occurs when C/L is low. The less extreme events (temperature anomalies greater than 4 deg shown red. pale blue) occur more often (higher climatological probability) and thus the potential Value that can be delivered by a forecast peaks at higher C/L ratios.

Using ensemble forecasts.

Ensemble forecasts deliver a probability ρ_e that an event will occur. But at which probability threshold of an event ρ* should action be taken? Should action be taken if the event is forecast with a moderate probability (say ρ_e= 20%) or should this be delayed until the forecast shows more confidence (perhaps ρ_e= 60%)? And is there an optimum probability ρ* of an event (for each individual user) above which action should be taken?

Measures of the performance of ensemble forecasts can be derived in the same way as for deterministic forecasts using the Hit Rate (HR), False Alarm Rate (FAR) and the Peirce Skill Score (PSS). For a given weather event and forecast system, the values of ρ_c, HR and FAR are known characteristics, and therefore the economic value V depends only upon the Cost/Loss ratio (C/L).

The optimal threshold probability can be objectively defined as p*=C/L.

At this threshold of probability, decisions taken will be cost-neutral, with the Costs expended from use of a forecast that proves correct equal to the Losses sustained when the forecast does not verify. For occasions when the threshold probability is less than C/L, the Costs expended on incorrect "False Alarms" will outweigh the Losses for incorrect "Misses". When the threshold probability is greater than C/L, it makes sense to take action from the forecast guidance as this will reduce Costs (fewer False Alarms) and Losses (fewer Misses) and deliver positive Value.

After selection of a threshold probability ρ* then the Value V of the system may be determined in the same way as for a deterministic system. Note ρ* will be different for different users and different weather events, or even different seasons. It is for the user to choose the appropriate p* for their own C/L.

Further, varying the threshold probability ρ* from 0 to 1 produces a sequence of expense diagrams (See Fig12.B.20). From these a sequence of values of Value V corresponding to each threshold probability ρ* may be obtained and plotted on a Value diagram. See Fig12.B.21.

Fig12.B.20: A hypothetical Mean Expenses against Cost/Loss diagram. Two sample EPS probability thresholds are shown (p*=0.2 in blue, p*=0.6 in orange, curves for any other probability thresholds could be drawn.) The climate frequency of the event is here taken as ρ_c=0.4. Values for each are given by the ratio of the reduction in Mean Expenses per unit loss by using each forecast (blue or orange double ended arrows) to the reduction in Mean Expenses per unit loss using a perfect forecast (black double ended arrows) as in Fig12.B.19. Lower EPS probability thresholds (p*) gives a better Value for low Cost/Loss but over a smaller range, using higher EPS probability thresholds give a better Value at higher Cost/Loss over a wider range. The maximum Value occurs for all EPS probability thresholds (p*) where Cost/Loss = ρ_c(in the green ellipse).

Fig12.B.21: Values of EPS plotted for different EPS probability thresholds (p*). The Values derived from the curves such as in Fig12.B.20 are plotted against Cost/Loss. (ρ*= 0.1 in green, ρ*=0.9 in red, other sample ρ* in pale blue). The envelope of maximum Values using all EPS probability thresholds (p*) is shown in black.

Skill and Value for a ensemble system.

Diagrams Fig12.B.20 and Fig12.B.21 show:

selection of a threshold probability ρ* similar to the Cost/Loss ratio C/L yields the best Value.
selection of a high threshold probability ρ*, might intuitively be expected to capture the majority of events. But choosing a high threshold probability means all the events with a lower probability are missed. And the higher the threshold, the more events that will be missed. To capture more events choose a lower threshold ρ*.

In cases of:

small Cost/Loss ratios (i.e. relatively large potential losses) users gain maximum benefit by taking action even when the forecast probability ρ_e is low.
high Cost/Loss ratios (e.g. relatively high insurance cost) users gain maximum benefit by taking action only when the forecast probability ρ_e is high.

An inappropriate selection of EPS probability thresholds (ρ*) can result in substantial reduction in forecast Value.

Value diagrams for an ENS forecast and a deterministic forecast alone (e.g. the Control) may be compared. These show (in general) useful Value extends through a wider range of Cost/Loss ratios using an ensemble of forecasts (e.g. ENS) than using a deterministic forecast alone. Nevertheless, there may still be little or no Value at higher Cost/Loss ratios C/L. See Fig12.B.22.

Fig12.B.22: Value against Cost/Loss diagram comparing the envelope of Values derived from the EPS with the envelope of Values from a single deterministic forecast (here an ensemble control member).

Fig12.B.23: Example Value against Cost/Loss diagrams, for at week1 (left); and week5 (right), illustrating in particular:

potential economic value for all users reduces as lead time increases
the range of users for whom the forecasts can have some intrinsic economic value reduces markedly as lead time increases.

Diagnostic cost/loss ratio diagrams for weather parameters are available for 24hr precipitation (>1.0mm, >5.0mm, >10.0mm, >20.0mm), 2m temperature anomalies (<8°C below climatology, <4°C below climatology, >4°C above climatology, >8°C above climatology,) and 10m wind speed (>10m/s, >15m/s) for forecast days 4, 6, and 10. These are updated bi-monthly as an aid to users to make decisions on the value or otherwise of specific forecasts based on ENS or ENMS Control and a knowledge of potential Cost and Loss.

Cost/Loss (C/L) is a simple model – real world decisions are often more complex. However the Cost/Loss model highlights some general aspects of the utility of forecasts.

Key points:

Different users get different benefits from the same forecast, depending on the costs/losses involved in their decisions.
Difference can be large – no benefit to some users, but high value to others.
Ensemble forecasts give generally higher value, especially because users can choose probability threshold p* relevant for their particular situation (so the Cost/Loss model generally benefits a wider range of users).
Skill scores show a single number for a given forecasting system – in some cases this can be interpreted as average value over a set of users.

Further reading: ECMWF Newsletter Number 80 – Summer 1998. https://www.ecmwf.int/sites/default/files/elibrary/1998/14644-newsletter-no80-summer-1998.pdf

Statistical Post-processing –– Model Output Statistics (MOS)

An efficient way to improve the ensemble forecast, both the EM and the probabilities, is by Statistical Post-Processing (SPP), which is an advanced form of calibration of the output from the deterministic ensemble members. The most commonly used SPP method is ““model output statistics”” (MOS).

The MOS equation

Deterministic NWP forecasts are statistically matched against a long record of verifying observations though a linear regression scheme. The predictand (Y) is normally scalar (for example 2m temperature) and the predictors (X) one or several forecast parameters, selected by a linear regression system using the parameters which provide the most information (e.g. forecasts of 2m temperature use 850hPa temperature, 500hPa geopotential etc):

Y = X₁ + X₂ ·T2m +X₃ ·T850hPa +X₄ ·Z500 + ……etc

where the coefficients X_{i=1, 2, 3 ... n} are estimated by the regression scheme. For this discussion it is sufficient to consider the simple MOS equation:

Y = X₁ + X₂·T2m

where X₁ and X₂ have been estimated from a large amount of representative historical material. This is often quite an effective correction equation, since the errors in many meteorological forecast parameters in a first approximation tend to be linearly dependent on the forecast itself (except perhaps precipitation and cloudiness).

Simultaneous Corrections of Mean Error and Variability

The MOS equation not only minimizes the RMSE, it also corrects simultaneously for both systematic mean errors and for the variability. In the above equation X₁ represents the mean error correction and X₂ the variability correction. There is therefore no necessity to apply two different schemes, one for reducing the systematic error (““bias””) and one for correcting the spread.

Short-range MOS

However, the corrections imposed by MOS have different emphases in the short and medium range. In the short range, where most synoptic features are forecast with realistic variability, the MOS equation mainly corrects true systematic errors and representativeness errors.

Fig12.B.24: A scatter diagram of forecast errors versus forecast for Tromsö in northern Norway, November 2010 - February 2012. Cold temperatures are too cold and, as a whole, the forecasts overestimate the variability of the temperature.

The scatter diagram in Fig12.B.24 depicts the errors at D+1 and therefore shows true systematic errors: the colder the forecast, the larger the mean error, which is equivalent to over-forecast variability.

Medium-range MOS

MOS also improves forecasts in the medium range but, with increasing forecast range, less and less of this improvement is due to the MOS equation’’s ability to remove systematic errors. In the medium range, the dominant errors are non-systematic. These non-systematic errors (e.g. false model climate drift) can appear as false systematic errors (e.g. see Fig12.A.4). They will thus be ““corrected”” by the MOS in the same way as true systematic errors. By this means MOS is essentially dampening the forecast anomalies and thereby minimizing the RMSE. This might be justified in a purely deterministic context but not in an ensemble context, where the most skilful damping of less predictable anomalies is achieved by ensemble averaging through the EM. It is therefore recommended that MOS equations are calculated in the short range, typically at D+1, based on forecasts from the CTRL, and then applied to all the members in the ensemble throughout the whole forecast range, as long as any genuine model drift can be discarded.

Adaptive MOS methods

About every few years, NWP models undergo significant changes that make the MOS regression analysis obsolete. There are, however, techniques whereby the MOS can be updated on a regular (monthly or quarterly) basis, although this does not completely eliminate the drawback of historic inertia. Alternatively, adaptive methods have increasingly come into use. Here the coefficients X₁ and X₂ in the error equation are constantly updated in the light of daily verification (Persson, 1991).
Fig12.B.25 shows forecasts and observations for the location with severe systematic 2m temperature errors depicted in Fig12.B.24. It is not a case of ““plain bias”” but of ““conditional bias””, since mild forecasts are less at error than cold forecasts. A simple mean-error correction would therefore not be optimal.

Fig12.B.25: Adaptive Kalman filtering of 2-metre temperature forecasts for Tromsö in northern Norway during winter 2012. The forecasts are too cold and over-variable, both of which are remedied by X₁ and X₂ in a 2-parameter error equation.

By a daily verification, the Kalman filter estimates the coefficients X₁ and X₂ in the error equation:

Err = X₁ + X₂·T_fc

where T_fc is the verified forecast. The coefficients are updated from a variational principle of ““least effort””, whereby the equation line is translated (by modifying X₁) and rotated (by modifying X₂), so that it takes the verification into account, considering the uncertainties in the verification and the coefficients (see Fig12.B.26).

Fig12.B.26: A schematic illustration of the workings of an adaptable MOS by Kalman filtering. At a given time the error equation has a certain orientation (full red line) with a certain estimated uncertainty (red dashed lines). A forecast is verified and yields an error (red filled circle) that does not normally fall on the error line. Depending on the interplay between the equation uncertainty and the verification uncertainty (dashed red circle), the error equation line is translated and rotated to take the new information into account, after which this information is discarded.
Note that the system keeps information only about the error equation and its uncertainty and the last, not yet verified forecast. When the forecast is verified and the verification has affected the error equation, the verifying observation is discarded.

Fig12.B.27 shows an ensemble forecast for the same location with severe systematic 2m temperature errors.

Fig12.B.27: A plume diagram for Tromsö, 12 February 2012. The forecast is too cold with 50-100% probabilities of temperatures < -15°C.

The two-dimensional error equation is able to apply corrections which are different for different forecast temperatures and thus take the flow dependence into account to some degree. The error equation is applied to all ensemble members at all ranges, assuming no significant model drift (see Fig12.B.28).

Fig12.B.28: shows an ensemble forecast for the same location with severe systematic 2m temperature errors.: The same as Fig12.B.27 but after the Kalman-filtered errors equation has been applied. Mild forecasts have hardly been modified, whereas cold ones have been substantially warmed, leading to less spread and more realistic probabilities (e.g. 0% probabilities for 2m temperature <- 15°C).

A two- or multi-dimensional error equation is able not only to correct for mean errors, but also systematic over- and under-forecasting of the variability, thereby providing realistic probabilities.

Ensemble Model Output Statistics (EMOS)

This is a technique where model output statistics are optimised for the Continuous Ranked Probability Score associated with an ensemble of forecasts rather than with deterministic forecasts where model output statistics are appropriate.

Space shortcuts

Page tree

Section 12.B Statistical Concepts - Probabilistic Data

Introduction

The Reliability or Discrimination Diagram

Sharpness

Under- and Over-confident Probability Forecasts

Reliability - Calibration of Probabilities

Resolution

Rank Histogram (Talagrand Diagram)

Verification Measures

Brier Score (BS)

Brier Score (BS)

Brier Skill Score (BSS)

The Brier score is a ““proper”” score

The Brier Skill Score (BSS)

Rank Probability Scores (RPS)

Ranked Probability Score (RPS)

Ranked Probability Skill Score (de-biased) (RPSS-D)

Continuous Ranked Probability Scores (CRPS)

Continuous Ranked Probability Score (CRPS)

Continuous Ranked Probability Skill Score (CRPSS)

Measurement of model performance

The Relative Operating Characteristics (ROC) diagram

Effect of the distribution of forecast probabilities.

Reliability and ROC diagrams (presented on ECMWF web)

Cost/Loss Diagrams and Economic Value

The concept of Cost/Loss.

The concept of Mean Expense.

Strategies to minimise expense.

Using climatology alone.

Using a Perfect Forecast.

Using deterministic forecasts.

Skill and Value for a deterministic system.

Using ensemble forecasts.

Skill and Value for a ensemble system.

Statistical Post-processing –– Model Output Statistics (MOS)

The MOS equation

Simultaneous Corrections of Mean Error and Variability

Short-range MOS

Medium-range MOS

Adaptive MOS methods

Ensemble Model Output Statistics (EMOS)