Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction

The distinction between validation (forecast system’’s characteristics) and verification (forecast system’’s predictive skill) is as relevant in probabilistic as in deterministic forecasting.  Some statistical concepts to facilitate the use and interpretation of probabilistic medium-range forecasts - ensemble forecasts - are given below.  As with the deterministic forecasting system, probability verification can address the accuracy (how close the forecast probabilities are to the observed frequencies), the skill (how the probability forecasts compare with some reference system) and utility (the economic or other advantages of the probability forecasts).

...

When the forecast probabilities agree with the frequency of events for this particular probability the distribution should lie along the 45° diagonal.  In such a case the probability forecasts are considered reliable. 

Sharpness

Sharpness is the ability of a probabilistic forecast system to spread away from the climatological average.  Climatological probability averages used as forecasts give perfect reliability, since the distribution would be exactly on the 45° diagonal.  They would not, however, be very useful since  most probabilities cluster towards the climatological mean (Fig 12.B.3).   Ideally, we want the forecast system, to be mainly reliable, but also to span as wide a probability interval as possible, with as many forecasts as possible away from the climatological average and as close to 0% and 100% as possible (Fig 12.B.4).

...

Improvements in probability forecasts, provided they are reliable, will be accompanied by improved sharpness until, ultimately, only 0% and 100% forecasts are issued and verify, corresponding to a perfect deterministic forecast system.  However, an improvement in sharpness does not necessarily mean that the forecast system has improved.

Under- and Over-confident Probability Forecasts

Most probabilistic forecast systems, both subjective and objective, tend to give distributions flatter than 45°. This means that low risks are underestimated and high risks overestimated - the forecast system is overconfident.  Fig12.B.5 shows an example of overconfidence.

...

Under- or overconfidence can be corrected by the calibration of probabilities.

Reliability - Calibration of Probabilities

For operational purposes the reliability can be improved by calibration using verification statistics (e.g. if it is found that in cases when 0% has been forecast, the event tends to occur in 30% of the cases, and when 100% has been forecast, the event tends to occur only in 70% of the cases). If the misfit is linearly distributed in between these two extremes, the reliability can be made perfect by calibration - but at the expense of reduced sharpness, since very low and very high probabilities are never forecast (see Fig12.B.7).

...

  • a forecast of 30% of rain occurring may be taken to mean rain is more unlikely than suggested by the model (perhaps towards 0%),
  • a forecast of 70% of rain occurring may be taken to mean rain is more likely than suggested by the model (perhaps towards 100%).


Resolution

Resolution is the degree to which the forecasts can discriminate between more or less probable events.  It should not be confused with sharpness which is the tendency to have predictions close to 0% and 100%.  Resolution and Sharpness are independent of one another.

The same ““uncertainty”” can be understood from the familiar fact that it is easier to predict the outcome of tossing a coin if it is heavily biased.  In the same way, if it rains frequently in a region and rarely stays dry, forecasting rain can be said to be ““easier”” than if rain and dry events occur equally often.  The uncertainty is purely dependent on the observations, just as the Aa term in the RMSE decomposition.  It is also the Brier Score of the sample climatology forecast and plays the same role with the Brier Score as the Aa term with the RMSE (see Forecast Error Baseline).  Comparisons of Brier Scores for different forecast samples can only be made if the uncertainty is the same.

Rank Histogram (Talagrand Diagram)

A more detailed way of validating the spread is by a rank histogram (sometimes called a Talagrand diagram). It is constructed from the notion that in an ideal ensemble system the verifying analysis is equally likely to lie in any ““bin”” defined by any two ordered adjacent members, including when the analysis is outside the ensemble range on either side of the distribution. This can be understood from induction, if we consider an ideal ensemble with one, two or three members:

...

Fig12.B.9: An observation (a filled circle) and its uncertainty assumed symmetric (the arrows).  The forecast bins widen their intervals away from the centre (the mean of the distribution, here off the diagram to the right), so an observation is more likely, for random reasons, to fall into an outer and wider bin than an inner and narrower one.



Verification Measures

In contrast to deterministic forecasts, an individual probabilistic forecast can never be ““right”” or ““wrong”” except when 0% or 100% have been stated.  Probability forecasts can therefore, only be verified from large samples of forecasts.  

Brier Score (BS) is a measure, over a large sample, of the correspondence between the each forecast probability against the frequency of occurrence of the verifying observations (e.g. on average, when rain is forecasted with probability p, it should occur with the same frequency p).  Observation frequency is plotted against forecast probability as a graph.  A perfect correspondence means the graph will lie upon the diagonal; the area between the graph and the diagonal measures the Brier Score - values lie between 0 (perfect) and 1 (consistently wrong).


Brier Score (BS)

The most common verification method for probabilistic forecasts is the Brier score (BS) or the Mean Square Error of Probability Forecasts.  It has a mathematical structure similar to the Mean Square Error (MSE).

...

BS measures the difference between the forecast probability of an event (p), and its occurrence (o) expressed as 0 (event did not occur) or 1 (event did occur).  As with RMSE, the BS is negatively orientated (i.e. the lower, the better”).

Brier Score (BS)

The Brier Score (BS) measures the mean squared probability error of the forecast against a reference dataset (observations, or analyses, or climatology)

...

  • BS = 0 the forecast is wholly accurate;
  • BS = 1 the forecast is wholly inaccurate.

Brier Skill Score (BSS)

The Brier Skill Scores (BSS) measures the relative skill of the forecast compared to the skill of using climatology.

...

Similar to the MSE, the BS can be decomposed into three terms, the most often quoted was suggested by Allan Murphy (1973, 1986) who used ““binned”” probabilities:

where nk is the number of forecasts of the same probability category k.

  • The first term measures the reliability (i.e. how much the forecast probabilities can be taken at face value).  On the reliability diagram this is the nk weighted sum of the distance (vertical or horizontal) between each point and the 45° diagonal (see Fig12.B.10).
  • The second term measures the resolution (i.e. how much the predicted probabilities differ from a climatological average and therefore contribute information).  On the reliability diagram this is the weighted sum of the distances to a horizontal line defined by the climatological probability reference (see Fig12.B.10).
  • The third term measures the uncertainty (i.e. the variance of the observations).  It takes its highest, most ““uncertain””, value when õ = 0.5 (see Fig12.B.11).

Fig12.B.10: Summary of Allan Murphy’’s reliability and resolution terms.


Fig12.B.11: Uncertainty is at its maximum for a climatological observed probability average of 50%.


The Brier score is a ““proper”” score

The Brier score is strictly ““proper”” (i.e. it encourages forecasters to really try to find out the probability, without thinking about whether the forecast value is ““tactical”” or not).  Indeed, if forecasters deviate from their true beliefs, the BS will ““punish”” them!  This sounds strange.  How can an abstract mathematical equation know someone’’s inner beliefs?

Assume forecasters honestly think the probability of an event is p but have, for misguided ““tactical”” reasons, instead stated r.  If the event occurs, the contribution to the BS (first term) is (1 - r)2  weighted by the probability for the outcome to occur.  If the event does not occur the contribution to the BS (first term) is (r - 0)2 weighted by the probability for the outcome not to occur.  For these weightings, the ““honest”” probability must be used; p when the event occurs, (1 - p) when the event does not occur.  This is where the forecaster’’s true beliefs are revealed!

The expected contribution to the BS is therefore:

Differentiating with respect to r yields

with a minimum for r = p. Therefore to minimize the expected contribution to the Brier Score the honestly believed probability value should be used.

The Brier Skill Score (BSS)

A Brier Skill Score (BSS) is conventionally defined as the relative probability score compared with the probability score of a reference forecast.

““Uncertainty”” plays no role in the BSS.


Rank Probability Scores (RPS)

Probabilities often refer to the risk that some threshold might be exceeded, for example that the precipitation >1 mm/12hr or that the wind >15 m/s.  However, when evaluating a probabilistic system, there are no reasons why these thresholds are particularly significant.  For the Rank Probability Score (RPS) the BS is calculated for different (one-sided) discreet thresholds and then averaged over all thresholds.  

Ranked Probability Score (RPS)

Discrete Ranked Probability Score (RPS) is the deviation of the forecast values being placed within a category against corresponding observations that actually lie within that category (e.g. tercile, quintile, etc.).  The words "discrete" and "ranked"  refer to the discrete nature of the categories. 

...

  • RPS = 0 the forecast is wholly accurate;
  • RPS = 1 the forecast is wholly inaccurate.

Ranked Probability Skill Score (de-biased) (RPSS-D)

Ranked Probability Skill Score (RPSS) is evaluated for by calculating the function  RPSS = 1 − RPSforecast / RPSreference where:

...

Weigel, A.P., D. Baggenstos, M. A. Liniger, F. Vitart, and F. Appenzeller, 2008: Probabilistic verification of monthly temperature forecasts. Mon. Wea. Rev., 136, 5162‐5182.


Continuous Ranked Probability Scores (CRPS)

Continuous Ranked Probability Score (CRPS)

A generalisation of Ranked Probability Score (RPS) is the Continuous Rank Probability Score (CRPSS) where the thresholds are continuous  rather than discrete (see Nurmi, 2003; Jollife and Stephenson, 2003; Wilks, 2006). The Continuous Ranked Probability Score (CRPS) is a measure of how good forecasts are in matching observed outcomes.   Where:

...

CRPS is calculated by comparing the Cumulative Distribution Functions (CDF) for the forecast against a reference dataset (observations, or analyses, or climatology) over a given period.

Continuous Ranked Probability Skill Score (CRPSS)

The Continuous Ranked Probability Skill Score (CRPSS) is a measure of how good forecasts are in matching observed outcomes.   Where:

...

Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559-570.

Measurement of model performance

It is useful to have some measures of the current performance and biases of the IFS.  Users can assess from Reliability and ROC diagrams whether the forecast model is:

  • effective in capturing an event (e.g. upper tercile rainfall),

  • tending to over- or under-forecast with different probabilities of an event,

  • tending to forecast events that actually happen while minimising those that don't.

The Relative Operating Characteristics (ROC) diagram

A powerful way to verify probability forecasts and in particular to compare their performance with deterministic forecast systems, is the two-dimensional ““Relative Operating Characteristics”” or ““ROC”” diagram.  These categorical forecasts will produce a set of pairs of ““Hit Rate”” and ““False Alarm Rate”” values to be entered into the ROC diagram: False Alarm Rate (FR) on the x-axis and Hit Rate (HR) value on the y-axis (derived from the Contingency Table).  The upper left corner of the ROC diagram represents a perfect forecast system (no false alarms, only hits).  The closer any verification is to this upper left corner, the higher the skill.  The lower left corner (no false alarms, no hits) represents a system which never warns of an event.  The upper right corner represents a system where the event is always warned for (see Fig12.B.12).

...

  • arches towards the top left of the diagram then model is effective in forecasting events that occur without warning of events that don't.
  • follows the diagonal then the model is forecasting as many events that occur as warning of events that don't.
  • lies below the diagonal then the model is forecasting few events that occur while mostly warning of events that don't.

Effect of the distribution of forecast probabilities.

The distribution of forecast probabilities gives an indication of the tendency of the forecast towards uncertainty.  These are plotted as a histogram to give an indication of confidence in model performance:

...

Conversely, in Fig12B.15 the majority of probabilities lie between 0.2 and 0.5 and reliability within this range appears fairly good while there is much less confidence in model performance for over- or under-forecasting an event.  This is as expected as the forecast range becomes longer.

Reliability and ROC diagrams (presented on ECMWF web)

It is useful to have some measures of the current performance and biases of the IFS.  Users can assess from Reliability and ROC diagrams whether the forecast model is:

...

The ROC score is the area beneath the graph on the ROC diagram and lies between 1 (perfect capture of events) and 0 (consistently warning of events that don't happen).  Fig12.B.14 shows high effectiveness in forecasting events (ROC score 0.859) while Fig12.B.15 shows reduced effectiveness (ROC score 0.593).  This is as expected as the forecast range becomes longer.   

Cost/Loss Diagrams and Economic Value 

The cost-loss model is a simple model of economic decision-making that demonstrates some of the differences between skill and value.  It shows how different users can get contrasting benefits from the same forecast and how appropriate use of probability forecasts can improve the user's decisions.    The user decides whether or not to take mitigating action in advance of a potential adverse weather event; this can be based on either deterministic or probabilistic forecast information.Forecasts can only have economic benefits (or value) if, as a consequence of the forecast, a user takes a course of action that would otherwise not have been taken.  

...

It is useful to understand the potential value to the user of a forecast technique (subjective, deterministic or EPS).  Calculating the value of a system is based on past performance and gives some idea of the usefulness of a forecast technique in making decisions, particularly when forecasting high or low probability events.

The concept of Cost/Loss.

Consider a weather-related event and the action to be taken to lessen the impact.

...

If the cost of protection is greater than the potential loss (i.e. C/L>1) there is no benefit in taking protective action. Also in the trivial case where the cost of protection is zero protective action should always be taken.  Thus the Cost/Loss ratio (C/L) need only be considered within the range 0.0 to 1.0.   The cost/loss ratio is the ratio of the cost of taking action (e.g. buying insurance against a forecast event) against the potential loss should an event occur. 


The concept of Mean Expense.

Forecasts have economic benefits (or value) if, as a consequence of the forecast, a user takes a course of action that would otherwise not have been taken.  Either action is taken (incurring a cost but no loss if the event occurs) or action is not taken (incurring no cost and but loss if the event occurs).  

...

The mean expense per unit loss (ME / L) can be plotted against Cost/Loss ratio C/L on an expense diagram (see Fig12.B.16).


Strategies to minimise expense.

Using climatology alone.

Climatology gives the proportion ρc of occasions that a given event occurs (e.g. rainfall at the location is greater than 5mm/hr occurs on 20% of occasions – i.e. ρc = 0.2).  If the decision to take an action is based upon climatology alone there are just two options:

...

  • Cost/Loss ratio is below the climatological probability the Mean Expense per unit loss equates to the Cost/Loss ratio.
  • Cost/Loss ratio is above the climatological probability the Mean Expense remains constant at the value attained at the climatological probability.

Using a Perfect Forecast.

A perfect forecast gives perfect knowledge of future conditions – it is never wrong.  Thus action is taken:

...

  • climatological probability of a given event (red line).
  • using a Perfect Forecast (dashed line).

Using deterministic forecasts.

Any practical forecasting system will not achieve a perfect forecast, but nevertheless should show an improvement over using climatology alone.  The reduction in mean expense by using a deterministic forecast over climatology compared with that by using a perfect forecast over climatology may be used as a measure of value.  Define the Value V as the reduction in mean expense (ME) by use of a forecast system as a proportion of the reduction that would be achieved by use of a perfect forecast. Thus:

  • a maximum value V = 1 is obtained from a perfect forecast.
  • a minimum value V = 0 is obtained from a forecast based on climate alone.
  • a value V > 0 indicates that the user will benefit by using the deterministic forecast.


Skill and Value for a deterministic system.

Measures of the performance of a forecast include the Hit Rate (HR), False Alarm Rate (FAR) and the Peirce Skill Score (PSS).  But these give little information on the value of the forecast.  It can be shown that ME is a function of HR and FAR, as well as ρc and C/L.  For a given weather event and forecast system, the values of ρc, HR and FAR are known characteristics, and therefore the economic value V depends only upon the Cost/Loss ratio (C/L).

...

In Fig12.B.19, note how the more extreme events (in this theoretical case, temperatures anomalies greater than 8 deg shown green or blue) have a lower climatological probability (are more unusual) and therefore the maximum potential Value occurs when C/L is low.  The less extreme events (temperature anomalies greater than 4 deg shown red. pale blue) occur more often (higher climatological probability) and thus the potential Value that can be delivered by a forecast peaks at higher C/L ratios.

Using ensemble forecasts.

Ensemble forecasts deliver a probability ρe that an event will occur.  But at which probability threshold of an event ρ* should action be taken?  Should action be taken if the event is forecast with a moderate probability (say ρe = 20%) or should this be delayed until the forecast shows more confidence (perhaps ρe = 60%)?  And is there an optimum probability ρ* of an event (for each individual user) above which action should be taken?

...

Fig12.B.21: Values of EPS plotted for different EPS probability thresholds (p*).  The Values derived from the curves such as in Fig12.B.20 are plotted against Cost/Loss. (ρ*= 0.1 in green, ρ*=0.9 in red, other sample ρ* in pale blue). The envelope of maximum Values using all EPS probability thresholds (p*) is shown in black.


Skill and Value for a ensemble system.

Diagrams Fig12.B.20 and Fig12.B.21 show:

...

Further reading: ECMWF Newsletter Number 80 – Summer 1998. https://www.ecmwf.int/sites/default/files/elibrary/1998/14644-newsletter-no80-summer-1998.pdf



Statistical Post-processing –– Model Output Statistics (MOS)

An efficient way to improve the ensemble forecast, both the EM and the probabilities, is by Statistical Post-Processing (SPP), which is an advanced form of calibration of the output from the deterministic ensemble members.  The most commonly used SPP method is ““model output statistics”” (MOS).


The MOS equation

Deterministic NWP forecasts are statistically matched against a long record of verifying observations though a linear regression scheme.  The predictand (Y) is normally scalar (for example 2m temperature) and the predictors (X) one or several forecast parameters, selected by a linear regression system using the parameters which provide the most information (e.g. forecasts of 2m temperature use 850hPa temperature, 500hPa geopotential etc):

...

where X1 and X2 have been estimated from a large amount of representative historical material.  This is often quite an effective correction equation, since the errors in many meteorological forecast parameters in a first approximation tend to be linearly dependent on the forecast itself (except perhaps precipitation and cloudiness).


Simultaneous Corrections of Mean Error and Variability

The MOS equation not only minimizes the RMSE, it also corrects simultaneously for both systematic mean errors and for the variability.  In the above equation X1 represents the mean error correction and X2 the variability correction.  There is therefore no necessity to apply two different schemes, one for reducing the systematic error (““bias””) and one for correcting the spread.


Short-range MOS

However, the corrections imposed by MOS have different emphases in the short and medium range.  In the short range, where most synoptic features are forecast with realistic variability, the MOS equation mainly corrects true systematic errors and representativeness errors.

...

The scatter diagram in Fig12.B.24 depicts the errors at D+1 and therefore shows true systematic errors: the colder the forecast, the larger the mean error, which is equivalent to over-forecast variability.

Medium-range MOS

MOS also improves forecasts in the medium range but, with increasing forecast range, less and less of this improvement is due to the MOS equation’’s ability to remove systematic errors.  In the medium range, the dominant errors are non-systematic.  These non-systematic errors (e.g. false model climate drift) can appear as false systematic errors (e.g. see Fig12.A.4).  They will thus be ““corrected”” by the MOS in the same way as true systematic errors.  By this means MOS is essentially dampening the forecast anomalies and thereby minimizing the RMSE.  This might be justified in a purely deterministic context but not in an ensemble context, where the most skilful damping of less predictable anomalies is achieved by ensemble averaging through the EM.  It is therefore recommended that MOS equations are calculated in the short range, typically at D+1, based on forecasts from the CTRL, and then applied to all the members in the ensemble throughout the whole forecast range, as long as any genuine model drift can be discarded.

Adaptive MOS methods

About every few years, NWP models undergo significant changes that make the MOS regression analysis obsolete.  There are, however, techniques whereby the MOS can be updated on a regular (monthly or quarterly) basis, although this does not completely eliminate the drawback of historic inertia.  Alternatively, adaptive methods have increasingly come into use.  Here the coefficients X1 and X2 in the error equation are constantly updated in the light of daily verification (Persson, 1991).
Fig12.B.25 shows forecasts and observations for the location with severe systematic 2m temperature errors depicted in Fig12.B.24.  It is not a case of ““plain bias”” but of ““conditional bias””, since mild forecasts are less at error than cold forecasts.  A simple mean-error correction would therefore not be optimal.

...

A two- or multi-dimensional error equation is able not only to correct for mean errors, but also systematic over- and under-forecasting of the variability, thereby providing realistic probabilities.

Ensemble Model Output Statistics (EMOS)

This is a technique where model output statistics are optimised for the Continuous Ranked Probability Score associated with an ensemble of forecasts rather than with deterministic forecasts where model output statistics are appropriate.

...