Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

Some statistical concepts are outlined to facilitate the use and interpretation of deterministic medium-range forecasts.  An NWP system can be evaluated in at least two ways.  Validation measures the realism of the model with respect to its ability to simulate the atmosphere’’s behaviour whereas verification measures the system’’s ability to predict atmospheric states.

Only the most commonly used validation and verification methods will be discussed here, mainly with respect to upper air variables, 2 m temperature and 10 m wind. Verification of binary forecasts will be discussed in relation to utility. For a full presentation the reader is referred to Nurmi 2003; Joliffe and Stephenson, 2003; Wilks, 2006.

Forecast validation

A forecast system that perfectly simulates the behaviour of the atmosphere has the same degree of variability as the atmosphere with no systematic errors.

The Mean Error (ME)

The mean error (ME) of forecasts (f) relative to analyses (a) can be defined as

...

Fig12.A.1: A convenient way to differentiate between ““unconditional”” and ““conditional”” biases is to plot scatter diagrams, with forecasts vs. forecast errors or observations (analyses) vs. forecast error. From the slope and direction of the scatter in these diagrams it is also possible to find out if the forecasts are over- or under-variable In this case the colder the forecasts the larger the positive error, the warmer the forecast the larger the negative error. This implies that cold anomalies are not cold enough and warm anomalies not warm enough, i.e. the forecasts are under-variable.

Forecast Variability

The ability of a NWP model to forecast extremes with the same frequency as they occur in the atmosphere is crucial for any ensemble approach, either lagged, multi-model or EPS. If the model has a tendency to over- or under-forecast certain weather elements, their probabilities will, of course, also be biased.

...

If a perfect model has, by definition, no systematic errors, then a stable model might have systematic errors which do not change their characteristics during the forecast range.  Most state-of-the-art NWP models are fairly stable in the medium range but start to display some model drift, such as gradual cooling or warming, moistening or drying, in the extended ranges.

False Systematic Errors

One of the complexities of interpreting the ME is that apparent systematic errors might, in fact, have a non-systematic origin. If this is the case, a perfect model appears to have systematic errors; a stable model appears to suffer from model drift. This is a reflection of a general statistical artefact, the ““regression to the mean”” effect.  (The ““regression to the mean”” effect was first discussed by Francis Galton (1822-1911) who found that tall (short) fathers tended to have tall (short) sons, but on average slightly shorter (taller) than themselves).

...

Fig12.A.4: A scatter diagram of forecasts versus forecast error. Warm forecasts appear too warm, cold forecasts appear too cold. If the forecasts are short range, it is reasonable to infer that the system is over-active, overdeveloping warm and cold anomalies. If, on the other hand, the forecasts are well into the medium range, this might not be the case. Due to decreased forecast skill, predicted anomalies tend to verify against less anomalous observed states.

False Model Climate Drift

This ““regression to the mean”” effect gives rise to another type of false systematic error. Forecasts produced and verified over a period characterized by on average anomalous weather will display a false impression of a model climate drift. A perfect model will produce natural looking anomalies, independent of lead time, but since the initial state is already anomalous, the forecasts are, with decreasing skill, more likely to be less anomalous than even more anomalous. At a range where there is no longer any predictive skill, the mean error will be equal to the observed mean anomaly with the opposite sign (see Fig12.A.5).

...

The ME can be trusted to reflect the properties of the model’’s performance only during periods with no or small average anomalies.

Forecast Verification

Objective weather forecast verification can be performed from at least three different perspectives: accuracy (the difference between forecast and verification), skill (comparison with some reference method, such as persistence, climate or an alternative forecast system) and utility (the economic value or political consequences of the forecast). They are all ““objective”” in the sense that the numerical results are independent of who calculated them, but not necessarily objective with respect to what is considered ““good”” or ““bad””. The skill measure depends on a subjective choice of reference and the utility measure depends on the preferences of the end-user. Only the first approach, the accuracy measure, can be said to be fully ““objective””, but, as seen in 4.3.4, in particular Figure 31 and Figure 32, the purpose of the forecast might influence what is deemed ““good”” or ““bad””.

Measures of Accuracy

Root Mean Square Error (RMSE)

where f = forecast value; a = observed value.

This is the most common accuracy measure.  It measures the distance between the forecast and the verifying analysis or observation. The RMSE is negatively orientated (i.e. increasing numerical values indicate increasing ““failure””).

Mean Absolute Error (MAE)

where f = forecast value; a = observed value.

This is also negatively orientated.  Due to its quadratic nature, the RMSE penalizes large errors more than the non-quadratic MAE and thus takes higher numerical values.  This might be one reason why MAE is sometimes preferred, although the practical consequences of forecast errors are probably better represented by the RMSE. 

Mean Square Error (MSE)

where f = forecast value; a = observed value.

We shall concentrate on the RMSE, or rather the squared version, the mean square error (MSE) which is more convenient to analyse mathematically.

The Effect of Mean Analysis and Observation Errors on the RMSE

If the forecasts have a mean error (ME), f = fo + ME and (fo - a) is uncorrelated to ME, then the MSE is their quadratic sum:

...

Any improvement of the NWP output must, therefore, with increasing forecast range, increasingly address the non-systematic errors (e.g. Model Output Statistics MOS).

The Decomposition of MSE

The MSE can be decomposed around c, the climate of the verifying day:

...

Hence the level of forecast accuracy is determined not only by the predictive skill, as reflected in the covariance term, but also by the general variability of the atmosphere, expressed by Aa, and by how well the model simulates this, expressed by Af.

Forecast Error Baseline

When a climatological average c replaces the forecast (i.e. f = c), the model variability Af and the covariance term become zero and

...

where ENWP forecast = (fNWP - a) = Error of forecast based on NWP,    Ehuman = (fhuman - a) = Error of forecast based on human forecaster and NWP,    Eclimate = (c - a) = Error of forecast based on climatology,    a = observed value,  c- climatological value.

Error Saturation Level (ESL)

Forecast errors do not grow indefinitely but asymptotically approach a maximum, the “Error Saturation Level.

...

which is 41% larger than Eclimate, the error when a climatological average is used as a forecast (see Fig12.A.7).  The value Aa√2 is also the ESL for persistence forecasts or guesses based on climatological distributions.

Measure of Skill - the Anomaly Correlation Coefficient (ACC)

Another way to measure the quality of a forecast system is to calculate the correlation between forecasts and observations. However, correlating forecasts directly with observations or analyses may give misleadingly high values because of the seasonal variations.  It is therefore established practice to subtract the climate average from both the forecast and the verification and to verify the forecast and observed anomalies according to the Anomaly Correlation Coefficient (ACC).  In its most simple form ACC can be written:

...

  • ACC ~0.8 corresponds to a range where there is synoptic skill in large-scale synoptic patterns.
  • ACC=0.6 corresponds to the range down to which there is synoptic skill for the largest scale weather patterns.
  • ACC=0.5 corresponds to forecasts for which the error is the same as for a forecast based on a climatological average (i.e. RMSE = Aa, the accuracy of climatological weather information used as forecasts).

Interpretation of Verification Statistics

The mathematics of statistics can be relatively simple but the results are often quite difficult to interpret, due to their counter-intuitive nature: what looks ““good”” might be ““bad””, what looks ““bad”” might be ““good””. As we have seen in A-1.3, seemingly systematic errors can have a non-systematic origin and forecasts verified against analyses can yield results different from those verified against observations. As we will see below, different verification scores can give divergent impressions of forecast quality and, perhaps most paradoxically, improving the realism of an NWP model might give rise to increasing errors.

Interpretation of RMSE and ACC

Both Af and Aa and, consequently, the RMSE vary with geographical area and season.  In the mid-latitudes they display a maximum in winter, when the atmospheric flow is dominated by large-scale and stronger amplitudes, and a minimum in summer, when the scales are smaller and the amplitudes weaker.

...

Comparing RMSE verifications of different models or of different versions of the same model is most straightforward when Af =Aa and the models have the same general variability as the atmosphere.

Effect of Flow Dependency

Both RMSE and ACC are flow dependent, sometimes in a contradictory way. In non-anomalous conditions (e.g. zonal flow) the ACC can easily take low (““bad””) values, while in anomalous regimes (e.g. blocking flow) it can take quite high (““good””) values. The opposite is true for RMSE, which can easily take high (““bad””) values in meridional or blocked flow regimes and low (““good””) values in zonal regimes. Conflicting indications are yet another example of ““what looks bad is good””, as they reflect different virtues of the forecasts and thereby provide the basis for a more nuanced overall assessment.

The ““Double Penalty Effect””

A special case of the flow dependence of the RMSE and ACC is the ““double penalty effect””, where a bad forecast is ““penalised”” twice: first for not having a system where there is one and second for having a system where there is none. It can be shown that, if a wave is forecast with a phase error of half a wave length or more, it will score worse in RMSE and ACC than if the wave had not been forecast at all (see Fig12.A.8).

...

The double penalty effect often appears in the late medium range, where phase errors become increasingly common. At this range they will strongly contribute to false systematic errors (see A-1.3).

Subjective Evaluations

Considering the many pitfalls in interpreting objective verification results, purely subjective verifications should not be dismissed.  They might serve as a good balance and check on the interpretation of the objective verifications.  This applies in particular to the verification of extreme events, where the low number of cases makes any statistical verification very difficult or even impossible.

Graphical Representation

The interpretation of RMSE and ACC outlined above may be aided by a graphical vector notation, based on elementary trigonometry.  The equation for the decomposition of the MSE is mathematically identical to the ““cosine law””.  From this it follows that the cosine of the angle β between the vectors (f-c) and (a-c) corresponds to the ACC (see Fig12.A.9).

...

Fig12.A.9: The relationship between the cosine theorem and the decomposition of the RMSE.

Forecast Errors

When the predicted and observed anomalies are uncorrelated (i.e. there is no skill in the forecast), they are in a geometrical sense orthogonal and the angle β between vectors (a-c) and (f-c) is 90° and the error is on average √2 times the atmospheric variability around climate.

...

Fig12.A.11: When the ACC=50% (i.e. the angle between the anomalies (a-c) and (f-c) =60°), the RMSE=Aa, the atmospheric variability (left).  When ACC>50% the RMSE is smaller.  An ACC=60% is agreed to indicate the limit of useful synoptic forecast skill.

Flow Dependence

The flow dependence of RMSE and ACC is illustrated in Fig12.A.12, for (left) a case of, on average, large anomaly, when a large RMS error is associated with a large ACC (small angle β), and (right) a less anomalous case, when a smaller RMS error is associated with a small ACC (large angle β).

...

If the RMSE is used as the norm, it would in principle be possible, at an extended range, to pick out those ““Members of the Day”” that are better than the average, just by selecting those members which are less anomalous.  If, however, the ACC is used as the norm, the ““Members of the Day”” may turn out to be those members which are more anomalous.

Damping of Forecast Anomalies

On average, dampening the variability (or jumpiness) of the forecasts reduces forecast error.   It can be shown, (see Fig12.A.13), that optimal damping is achieved when the variability is reduced by a proportion that is equal to cosine (β) or the ACC.

...

Fig12.A.13: Damping the forecast variability Af will minimize the RMSE, if it becomes orthogonal to the forecast.  This happens when  Af = ACC · Aa  and the forecast vector (f - c) varies along a semi-circle with a radius equal to half Aa.

Forecast Error Correlation

At an extended forecast range, when there is low skill in the forecast anomalies and weak correlation between them, there is still a fairly high correlation between the forecast errors.  This is because the forecasts are compared with the same analysis.  Consider (see Fig12.A.14) two consecutive forecasts f and g, from the same model or two different models, with errors (f - a) and (g - a). Although the angles between (f - c), (g - c) and (a - c) at an infinite range are 90° and thus the correlations zero, the angle between the errors (f - a) and (g - a) is 60°, which yields a correlation of 50%.  For shorter ranges the correlation decreases when the forecast anomalies are more correlated and the angle between them <60°. The perturbations in the analyses are constructed to be uncorrelated.

...

Fig12.A.14: A 3-dimensional vector figure to clarify the relation between forecast jumpiness and error.  Two forecasts, f and g, are shown at a range when there is no correlation between the forecast and observed anomalies (f - c), (g - c) and (a - c). The angles between the three vectors are 90°. The angles in the triangle a-f-g measure up to 60° which means that there is a 50% correlation between the ““jumpiness”” (g - f) and the errors (f - a) and (g - a). The same is true for the correlation between (f - a) and (g - a).

Forecast Jumpiness and Forecast Skill

From the same Fig12.A.14 it follows that since the angle between the forecast ““jumpiness”” (f - g) and the error (f - a) is 60°, the correlation at an infinite range between ““jumpiness”” and error is 50%. For shorter forecast ranges the correlations decrease because the forecast anomalies become more correlated, with the angle between them <60°.

Combining Forecasts

Combining different forecasts into a ““consensus”” forecast either from different models (““the multi-model ensemble””) or from the same model (““the lagged average forecast””) normally yields higher forecast accuracy (lower RMSE). The forecasts should be weighted together with respect not only to their average errors but also to the correlation between these errors (see Fig12.A.15).

...

The discussion can be extended to any number of participating forecasts.  In ensemble systems the forecast errors are initially uncorrelated but slowly increase in correlation over the integration period, though never exceeding 50%.

Usefulness of Statistical Know-how

Statistical verification is normally associated with forecast product control. Statistical know- how is not only able to assure a correct interpretation but also to help add value to the medium-range NWP output. The interventions and modifications performed by experienced forecasters are to some extent statistical in nature. Modifying or adjusting a short-range NWP in the light of later observations is qualitatively similar to ““optimal interpolation”” in data assimilation. Correcting for systematic errors is similar to linear regression analysis and advising end-users in their decision-making involves an understanding of cost-loss analysis. Weather forecasters are not always aware that they make use of Bayesian principles in their daily tasks, even if the mathematics is not formally applied in practice (Doswell, 2004).

Investigations have shown that forecasters who have a statistical education and training do considerably better than those who do not have such understanding (Doswell, 2004).  Forecasters should therefore keep themselves informed about recent statistical validations and verifications of NWP performance.

Usefulness of the Forecast  - A Cost/Benefit Approach

The ultimate verification of a forecast service is the value of the decisions that end-users make based on its forecasts, providing that it is possible to quantify the usefulness of the forecasts; this brings a subjective element into weather forecast verification.

The Contingency Table

For evaluating the utility aspect of forecasts it is often convenient to present the verification in a contingency table with the corresponding hits (H), false alarms (F), misses (M) and correct no-forecasts (Z).  If N is the total number of cases then N=H+F+M+Z.  The sample climatological probability of an event occurring is then Pclim = (H + M) / N.

...

The terminology here may be different from that used in other books.  We refer to the definitions given by Nurmi (2003) and the recommendations from the WWRP/WGNE working group on verification.

The ““Expected Expenses”” (EE)

The Expected Expenses are defined as the sum of the costs due to protective actions and the losses endured:

...

where c is the cost of protective action, when warnings have been issued, and L is the loss, if the event occurs without protection.  Always protecting makes EE = c · N and never protecting EE = L · (M + H). The break-even point, when protecting and not protecting are equally costly, occurs when  c · N = L (H + M)  which yields c / L = (H + M) / N = Pclim.  It is advantageous to protect whenever the ““cost-loss ratio”” c / L < Pclim, if Pclim is the only information available.

Practical Examples

The following set of examples is inspired by real events in California in the 1930s (Lewis, 1994, p.73-74).

A Situation with No Weather Forecast Service

Imagine a location where, on average, it rains 3 days out of 10.  Two enterprises, X and Y, each lose €€100 if rain occurs and they have not taken protective action.  X has to invest €€20 for protection, whereas Y has to pay €€60.

...

Fig12.A.17: The triangle defined by the expected daily expenses for different costs (c), when the loss (L) is 100€.  End-users who always protect increase their expenses (yellow), end-users who never protect lose on average 30 per day.  Even if perfect forecasts were supplied, protection costs could not be avoided (blue line). The triangle defines the area within which weather forecasts can reduce the expected expenses.  Note the baseline is not a lack of expenses but the cost of the protection necessary, if perfect knowledge about the future weather is available, in X’’s case €€6 and in Y’’s €€18 per day.

The Benefit of a Local Weather Service

The local weather forecast office A issues deterministic forecasts.  They are meteorologically realistic in that rain is forecast with the same frequency as it is observed.  The overall forecast performance is reflected in a contingency table (Table3)

...

Note that end-users with very low or very high protection costs do not benefit from A’’s forecast service.

Effect of Introducing further Weather Services

Two new weather agencies, B and C, start to provide forecasts to X and Y.  The newcomers B and C have forecast performances in terms of H, F, M and Z:

...

There is, however, a third way, which will enable weather service A to quickly outperform B and C with no extra cost and without compromising its well tuned forecasts policy.

An Introduction to Probabilistic Weather Forecasting

The late American physicist and Nobel Laureate Richard Feynman (1919-88) held the view that it is better not to know than to be told something that is wrong or misleading.  This has recently been re-formulated thus: it is better to know that we do not know than to believe that we know when actually we do not know.

Uncertainty - how to turn a Disadvantage into an Advantage

Local forecast office A in its competitive battle with B and C starts to make use of this insight. It offers a surprising change of routine service: it issues a categorical rain or no-rain forecast only when the forecast is absolutely certain. If not, a ““don't know”” forecast is issued. If such a ““don't’ know”” forecast is issued about four times during a typical ten-day period, the contingency table might look like this (assuming ““don't’ know”” equates to ““50-50”” or 50%):

...

So what might appear as ““cowardly”” forecasts prove to be more valuable for the end-users! If forecasters are uncertain, they should say so and thereby gain respect and authority in the longer term.

Making More Use of Uncertainty - Probabilities

However, service A can go further and quantify how uncertain the rain is. This is best done by expressing the uncertainty of rain in probabilistic terms.  If ““don't know”” is equal to 50% then 60% and 80% indicate less uncertainty, 40% and 20 % larger uncertainty. Over a 10-day period the contingency table might, on average, look like this, where the four cases of uncertain forecasts have been grouped according to the degree of uncertainty or certainty:

...

X lowers his expenses to €€10 and Y lowers his expenses to €€24.


Towards more Useful Weather Forecasts

What looks ““bad”” has indeed been ““good””.  Using vague phrasing or expressing probabilities instead of giving a clear forecast is often regarded by the public as a sign of professional incompetence. 

...

Although the ultimate rationale of probability weather forecasts is their usefulness, which varies from end-user to end-user, forecasters and developers also need verification and validation measures which are objective, in the sense that they do not reflect the subjective needs of different end-user groups.

Quality of Probabilistic Forecasts

The forecast performance in Table 7 exemplifies skilful probability forecasting.  In contrast to categorical forecasts, probability forecasts are never ““right”” or ““wrong”” (except when 0% or 100% has been forecast). They can therefore not be verified and validated in the same way as categorical forecasts.  This is further explained in Appendix B.

When probabilities are not required

If an end-user does not appreciate forecasts in probabilistic terms and, instead, asks for categorical ““rain”” or ““no rain”” statements, the forecaster must make the decisions for him.  Unless the relevant cost-loss ratio is known, this restriction puts forecasters in a difficult position.

...

Generally, categorical forecasts have to be biased, either positively (i.e. over-forecasting the event, for end-users with low cost-loss ratios) or negatively (i.e. under-forecasting, for end- users with high cost-loss ratios).  A good NWP model should not over-forecast nor under-forecast at any forecast range.  This is another example of how computer-based forecasts differ from forecaster-interpreted, customer-orientated, forecasts.

An Extension of the Contingency Table –– the ““SEEPS”” score

The “SEEPS (Stable Equitable Error in Probability Space) score has been developed to address the task of verifying deterministic precipitation forecasts.  In contrast to traditional deterministic precipitation verification, it makes use of three categories: ““dry””, ““light precipitation”” and ““heavy precipitation””.  ““Dry”” is defined according to WMO guidelines as ≤0.2 mm per 24 hours.  The ““light”” and ““heavy”” categories are defined by local climatology, so that light precipitation occurs twice as often as ““heavy”” precipitation.  In Europe the threshold between ““light”” and ““heavy”” precipitation is generally between 3mm and 15mm per 24hrs.

Additional Sources of Information

Read further information on the verification of categorical predictands.

...