Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Validation measures the realism of the model with respect to its ability to simulate the behaviour of the atmosphere.
  • Verification measures the system’’s ability of the system to predict atmospheric states.

...

where the over-bar denotes an average over a large sample in time and space.  A perfect score, ME=0, does not exclude very large errors of opposite signs which cancel each other out.  If the mean errors are independent of the forecast and vary around a fixed value, this constitutes an ““unconditional bias””““unconditional bias””.  If the ME is flow dependent, (i.e. if the errors are dependent on the forecast itself or some other parameter), then we are dealing with systematic errors of ““conditional bias”” ““conditional bias”” type; in this case, variations in the ME from one month to another might not necessarily reflect changes in the model but in the large-scale flow patterns (see Fig12.A.1).

...

Fig12.A.1: A convenient way to differentiate between ““unconditional”” ““unconditional”” and ““conditional”” ““conditional”” biases is to plot scatter diagrams, with forecasts vs. forecast errors or observations (analyses) vs. forecast error. From the slope and direction of the scatter in these diagrams it is also possible to find out if the forecasts are over- or under-variable In this case the colder the forecasts the larger the positive error, the warmer the forecast the larger the negative error. This implies that cold anomalies are not cold enough and warm anomalies not warm enough, i.e. the forecasts are under-variable.

...

One of the complexities of interpreting the ME is that apparent systematic errors might, in fact, have a non-systematic origin. If this is the case, a perfect model appears to have systematic errors; a stable model appears to suffer from model drift. This is a reflection of a general statistical artefact, the ““regression ““regression to the mean”” mean”” effect.  (The ““regression ““regression to the mean”” mean”” effect was first discussed by Francis Galton (1822-1911) who found that tall (short) fathers tended to have tall (short) sons, but on average slightly shorter (taller) than themselves).

...

False Model Climate Drift

This ““regression ““regression to the mean”” mean”” effect gives rise to another type of false systematic error. Forecasts produced and verified over a period characterized by on average anomalous weather will display a false impression of a model climate drift. A perfect model will produce natural looking anomalies, independent of lead time, but since the initial state is already anomalous, the forecasts are, with decreasing skill, more likely to be less anomalous than even more anomalous. At a range where there is no longer any predictive skill, the mean error will be equal to the observed mean anomaly with the opposite sign (see Fig12.A.5).

...

Objective weather forecast verification can be performed from at least three different perspectives: accuracy (the difference between forecast and verification), skill (comparison with some reference method, such as persistence, climate or an alternative forecast system) and utility (the economic value or political consequences of the forecast). They are all ““objective”” ““objective”” in the sense that the numerical results are independent of who calculated them, but not necessarily objective with respect to what is considered ““good”” ““good”” or ““bad””““bad””. The skill measure depends on a subjective choice of reference and the utility measure depends on the preferences of the end-user. Only the first approach, the accuracy measure, can be said to be fully ““objective””““objective””, but, as seen in 4.3.4, in particular Figure 31 and Figure 32, the purpose of the forecast might influence what is deemed ““good”” ““good”” or ““bad””““bad””.

Measures of Accuracy

Root Mean Square Error (RMSE)

...

This is the most common accuracy measure.  It measures the distance between the forecast and the verifying analysis or observation. The RMSE is negatively orientated (i.e. increasing numerical values indicate increasing ““failure””““failure””).

Mean Absolute Error (MAE)

...

Fig12.A.6: Forecasts verified against analyses often display a ““kink”” ““kink”” for the first forecast interval. This is because the error curve starts from the origin, where the forecast at t=0 is identical to the analysis. However, the true forecast error (forecasts vs. correct observations) at initial time (t=0) represents the analysis error and is rarely zero. The true error curve with respect to the correct observations lies at a slightly higher level than the error curve with respect to the analysis, in particular initially.

...

which can be written:

where:

  • Aa and Af are the atmospheric and model variability respectively around the climate

...

  • cov refers to the covariance. 

Hence the level of forecast accuracy is determined not only by the predictive skill, as reflected in the covariance term, but also by the general variability of the atmosphere, expressed by Aa, and by how well the model simulates this, expressed by Af.

...

Forecast errors do not grow indefinitely but asymptotically approach a maximum, the “Error “Error Saturation Level.

Fig12.A.7: The error growth in a state-of-the-art NWP forecast system will at some stage display larger errors than a climatological average used as forecast and will, as do the errors of persistence forecasts and guesses, asymptotically approach an error level 41% above that of a forecast based on a climatological average.

...

The Anomaly Correlation Coefficient (ACC) can be regarded as a skill score relative to the climate.  Increasing numerical values indicate increasing ““success””““success””.   It has been found empirically that:

...

The mathematics of statistics can be relatively simple but the results are often quite difficult to interpret, due to their counter-intuitive nature: what looks ““good”” ““good”” might be ““bad””““bad””, what looks ““bad”” ““bad”” might be ““good””““good””. As we have seen in A-1.3, seemingly systematic errors can have a non-systematic origin and forecasts verified against analyses can yield results different from those verified against observations. As we will see below, different verification scores can give divergent impressions of forecast quality and, perhaps most paradoxically, improving the realism of an NWP model might give rise to increasing errors.

...

For a forecast system that realistically reflects atmospheric synoptic-dynamic activity Af =Aa.  If Af < Aa the forecasting system underestimates atmospheric variability, which will contribute to a decrease in the RMSE.  Jumpiness is ““bad”” ““bad”” if we are dealing with a NWP model but ““good””““good””, if we are dealing with post-processed deterministic forecasts to end-users.  On the other hand, if Af > Aa the model overestimates synoptic-dynamic activity, which will contribute to increasing the RMSE.  This is normally ““bad”” ““bad”” for all applications.

Comparing RMSE verifications of different models or of different versions of the same model is most straightforward when Af =Aa and the models have the same general variability as the atmosphere.

...

Both RMSE and ACC are flow dependent, sometimes in a contradictory way. In non-anomalous conditions (e.g. zonal flow) the ACC can easily take low (““bad””““bad””) values, while in anomalous regimes (e.g. blocking flow) it can take quite high (““good””““good””) values. The opposite is true for RMSE, which can easily take high (““bad””““bad””) values in meridional or blocked flow regimes and low (““good””““good””) values in zonal regimes. Conflicting indications are yet another example of ““what ““what looks bad is good””good””, as they reflect different virtues of the forecasts and thereby provide the basis for a more nuanced overall assessment.

The

...

““Double Penalty

...

Effect””

A special case of the flow dependence of the RMSE and ACC is the ““double ““double penalty effect””effect””, where a bad forecast is ““penalised”” ““penalised”” twice: first for not having a system where there is one and second for having a system where there is none. It can be shown that, if a wave is forecast with a phase error of half a wave length or more, it will score worse in RMSE and ACC than if the wave had not been forecast at all (see Fig12.A.8).

...

The interpretation of RMSE and ACC outlined above may be aided by a graphical vector notation, based on elementary trigonometry.  The equation for the decomposition of the MSE is mathematically identical to the ““cosine law””““cosine law””.  From this it follows that the cosine of the angle β between the vectors (f-c) and (a-c) corresponds to the ACC (see Fig12.A.9).

...

If the RMSE is used as the norm, it would in principle be possible, at an extended range, to pick out those ““Members ““Members of the Day”” Day”” that are better than the average, just by selecting those members which are less anomalous.  If, however, the ACC is used as the norm, the ““Members ““Members of the Day”” Day”” may turn out to be those members which are more anomalous.

...

Fig12.A.14: A 3-dimensional vector figure to clarify the relation between forecast jumpiness and error.  Two forecasts, f and g, are shown at a range when there is no correlation between the forecast and observed anomalies (f - c), (g - c) and (a - c). The angles between the three vectors are 90°. The angles in the triangle a-f-g measure up to 60° which means that there is a 50% correlation between the ““jumpiness”” ““jumpiness”” (g - f) and the errors (f - a) and (g - a). The same is true for the correlation between (f - a) and (g - a).

...

From the same Fig12.A.14 it follows that since the angle between the forecast ““jumpiness”” ““jumpiness”” (f - g) and the error (f - a) is 60°, the correlation at an infinite range between ““jumpiness”” ““jumpiness”” and error is 50%. For shorter forecast ranges the correlations decrease because the forecast anomalies become more correlated, with the angle between them <60°.

...

Combining different forecasts into a ““consensus”” ““consensus”” forecast either from different models (““the ““the multi-model ensemble””ensemble””) or from the same model (““the ““the lagged average forecast””forecast””) normally yields higher forecast accuracy (lower RMSE). The forecasts should be weighted together with respect not only to their average errors but also to the correlation between these errors (see Fig12.A.15).

...

Statistical verification is normally associated with forecast product control. Statistical know- how is not only able to assure a correct interpretation but also to help add value to the medium-range NWP output. The interventions and modifications performed by experienced forecasters are to some extent statistical in nature. Modifying or adjusting a short-range NWP in the light of later observations is qualitatively similar to ““optimal interpolation”” ““optimal interpolation”” in data assimilation. Correcting for systematic errors is similar to linear regression analysis and advising end-users in their decision-making involves an understanding of cost-loss analysis. Weather forecasters are not always aware that they make use of Bayesian principles in their daily tasks, even if the mathematics is not formally applied in practice (Doswell, 2004).

...

The terminology here may be different from that used in other books.  We refer to the definitions given by Nurmi (2003) and the recommendations from the WWRP/WGNE working group on verification.

The

...

““Expected Expenses”” (EE)

The Expected Expenses are defined as the sum of the costs due to protective actions and the losses endured:

...

where c is the cost of protective action, when warnings have been issued, and L is the loss, if the event occurs without protection.  Always protecting makes EE = c · N and never protecting EE = L · (M + H). The break-even point, when protecting and not protecting are equally costly, occurs when  c · N = L (H + M)  which yields c / L = (H + M) / N = Pclim.  It is advantageous to protect whenever the ““cost““cost-loss ratio”” ratio”” c / L < Pclim, if Pclim is the only information available.

...

Imagine a location where, on average, it rains 3 days out of 10.  Two enterprisesusers, X and Y, each lose €€100 €100 if rain occurs and they have not taken protective action.  X has to invest €€20 €20 for protection, whereas Y has to pay €€60€60.

Thanks to his low protection cost, X protects every day, which costs on average €€20 €20 per day over a longer period.  Y, on the other hand, chooses never to protect due to the high cost, and suffers an average loss of €€30 €30 per day over an average 10-day period, owing to the three rain events (see Fig12.A.17).

...

Fig12.A.17: The triangle defined by the expected daily expenses for different costs (c), when the loss (L) is 100€100.  End-users who always protect increase their expenses (yellow), end-users who never protect lose on average 30 per day.  Even if perfect forecasts were supplied, protection costs could not be avoided (blue line). The triangle defines the area within which weather forecasts can reduce the expected expenses.  Note the baseline is not a lack of expenses but the cost of the protection necessary, if perfect knowledge about the future weather is available, in X’’s case €€6 and in Y’’s €€18 .  For user X this is €6 per day; for user Y this is €18 per day.

The Benefit of a Local Weather Service

...

Relying on these forecasts over a typical 10-day period, both X and Y protect three times and are caught out unprotected only once.  X is able to lower his loss from €€20 €20 to €€16€16, and Y from €€30 €30 to €€28 €28 (see Fig12.A.18).

Fig12.A.18: The same as Fig12.A.17, but with the expected expenses for end-users served by forecast service A. The red area indicates the added benefits for X and Y from basing their decisions on deterministic weather forecasts from service A.

Note that end-users with very low or very high protection costs do not benefit from A’’s forecast serviceAgency A.

Effect of Introducing further Weather Services

Two new weather agencies, B and C, start to provide forecasts to Users X and Y.  The newcomers newcomer Agencies B and C have forecast performances in terms of H, F, M and Z:

...

Agency B heavily under-forecasts rain and agency Agency C heavily over-forecasts.  Both give a distorted image of atmospheric behaviour - but what might seem ““bad”” "bad" is actually ““good””.By following B’’s forecasts, which "good". (See Fig12.A.19).

  • User Y has high protection costs.  Agency B heavily under-

...

  • forecasted rain but User Y reduces his expenses from

...

  • €28 to

...

  • €26.
  • User X has low protection costs. Agency C

...

  • heavily over-

...

  • forecasted rain but User X reduces his expenses from

...

  • €16 to €12.


Fig12.A.19: The cost-loss diagram with the expected expenses according to forecasts from agencies B and C for different end-users, defined by their cost-loss ratios. Weather service A is able to provide only a section of the potential end-users, the ones with C/L-ratios between 33 and 50%, with more useful forecasts than B and C. The green and yellow areas indicate where X and Y benefit from the forecasts from agencies Agencies B and C respectively.

Agency B has also managed to provide a useful weather service to those with very low protection costs , (C) to those with very high protection costs.  In general, any end-user with protection costs <€€33 benefits from C’’s the services of Agency C, any end-user with protection costs >€€50 benefits from B’’s the services . Only end-of Agency B.  Only users with costs between €€33 €33 and €€50 €50 benefit from A’’s the services of Agency A more than they do from B’’s Agencies B and C’’sC.

There seems to be only two ways in which weather service Agency A can compete with Agencies B and C:

  • It can improve the deterministic forecast skill –– –– this would involve NWP model development, which takes time and is costly.
  • It can ““tweak”” "tweak" the forecasts in the same way as Agencies B and C, thus violating its policy of well tuned forecasts.

There is, however, a third way, which will enable weather service Agency A to quickly outperform Agencies B and C with no extra cost and without compromising its well tuned forecasts policy.

...

Local forecast office A in its competitive battle with B and C starts to make use of this insight. It offers a surprising change of routine service: it issues a categorical rain or no-rain forecast only when the forecast is absolutely certain. If not, a ““don"don't know”” forecast is issued. If such a ““don"don't’ know”” forecast is issued about four times during a typical ten-day period, the contingency table might look like this (assuming ““don"don't’ know”” equates to ““50"50-50”” 50" or 50%):

This does not look very impressive, rather the opposite, but, paradoxically, both Users X and Y benefit highly from this special service.  This is because they are now free to interpret the forecasts in their own way. (see Fig12.A.20).

  • User X,

...

  • has low protection costs

...

  • and can afford to interpret the

...

  • "don't know”” forecast as if it could rain and therefore

...

  • decides to take protective action.  By doing so, User X drastically lowers his costs to

...

  • €10 per day

...

  • .  This is €20 cheaper than following

...

  • the forecasts of Agency C.

...

  • User Y, has expensive protection costs and will prefer to interpret "don't know”” forecast as if there will be no rain and decides not to

...

  • take protective action.  By doing so, User Y lowers his costs to

...

  • €26 per day

...

  • .  This is similar to following the forecasts of Agency B.


Fig12.A.20: The expected daily expenses when the end-users are free to interpret the ““don"don't’ know”” forecast either as ““rain””"rain", if they have a low c/L ratio, or as ““no rain””"no rain", if their c/L ratio is high.

So what might appear as ““cowardly”” "cowardly" forecasts prove to be more valuable for the end-users! If  If forecasters are uncertain, they should say so and thereby .  In this way forecasters can gain respect and authority in the longer term.

...

However, service A can go further and quantify how uncertain the rain is. This is best done by expressing the uncertainty of rain in probabilistic terms.  If ““don"don't know”” is equal to 50% then 60% and 80% indicate less uncertainty, 40% and 20 % larger uncertainty. Over a 10-day period the contingency table might, on average, look like this, where the four cases of uncertain forecasts have been grouped according to the degree of uncertainty or certainty:

Note: A ““do "do not know”” know" forecast does not necessarily mean ““50"50-50””50".  It could mean the climatological probability.  In fact, unless the climatological rain frequency is indeed 50% a ““50"50-50”” 50" statement actually provides the non-trivial information that the risk is higher or lower than normal.

...

The use of probabilities allows other end-users, with protection costs different from X’’s and Y’’sfrom user X or user Y, to benefit from A’’s forecast service A.  They should take protective action if the forecast probability exceeds their cost/loss ratio (P > c / L).  Assuming possible losses of €€100€100, someone with a protection cost of €€30 €30 should take action when the risk >30% probability, someone with costs of €€75, should take action when the risk >75% probability (see Fig12.A.21).

...

Fig12.A.21: The same figures but with the expected expenses indicated for cases where different end-users take action after receiving probability forecasts. The general performance (diagonal thick blue line) is now closer to the performance for perfect forecasts.

User X lowers his expenses to €€10 €10 and User Y lowers his expenses to €€24€24.


Towards more Useful Weather Forecasts

What looks ““bad”” "bad" has indeed been ““good””"good".  Using vague phrasing or expressing probabilities instead of giving a clear forecast is often regarded by the public as a sign of professional incompetence. 

““Unfortunately"Unfortunately, a segment of the public tends to look upon probability forecasting as a means of escape for the forecaster”” forecaster" (Lorenz, 1970).

Instead, it has been shown that what looks like ““cowardly”” "cowardly" forecast practice is, in reality, more beneficial to the public and end-users than perceived ““brave”” "brave" forecast practice.

““What the critics of probability forecasting fail to recognize or else are reluctant to acknowledge is that a forecaster is paid not for exhibiting his skill but for providing information to the public,

and that a probability forecast conveys more information, as opposed to guesswork, than a simple [deterministic] forecast of rain or no rain.””””(Lorenz, 1970)

Although the ultimate rationale of probability weather forecasts is their usefulness, which varies from end-user to end-user, forecasters and developers also need verification and validation measures which are objective, in the sense that they do not reflect the subjective needs of different end-user groups.

...

The forecast performance in Table 7 exemplifies skilful probability forecasting.  In contrast to categorical forecasts, probability forecasts are never ““right”” ““right”” or ““wrong”” ““wrong”” (except when 0% or 100% has been forecast). They can therefore not be verified and validated in the same way as categorical forecasts.  This is further explained in Appendix B.

...

If an end-user does not appreciate forecasts in probabilistic terms and, instead, asks for categorical ““rain”” ““rain”” or ““no rain”” ““no rain”” statements, the forecaster must make the decisions for him.  Unless the relevant cost-loss ratio is known, this restriction puts forecasters in a difficult position.

If, on the other hand, they have a fair understanding of the end-user’’s user’’s needs, forecasters can simply convert their probabilistic forecast into a categorical one, depending on whether the end-user’’s user’’s particular probability threshold is exceeded or not.  The forecasters are, in other words, doing what the end-user should have done. 

...

Generally, categorical forecasts have to be biased, either positively (i.e. over-forecasting the event, for end-users with low cost-loss ratios) or negatively (i.e. under-forecasting, for end- users with high cost-loss ratios).  A good NWP model should not over-forecast nor under-forecast at any forecast range.  This is another example of how computer-based forecasts differ from forecaster-interpreted, customer-orientated, forecasts.

An Extension of the Contingency Table

...

–– the

...

““SEEPS”” score

The “SEEPS (Stable Equitable Error in Probability Space) score has been developed to address the task of verifying deterministic precipitation forecasts.  In contrast to traditional deterministic precipitation verification, it makes use of three categories: ““dry””““dry””, ““light precipitation”” and ““heavy precipitation””““light precipitation”” and ““heavy precipitation””““Dry”” ““Dry”” is defined according to WMO guidelines as ≤0.2 mm per 24 hours.  The ““light”” ““light”” and ““heavy”” ““heavy”” categories are defined by local climatology, so that light precipitation occurs twice as often as ““heavy”” ““heavy”” precipitation.  In Europe the threshold between ““light”” ““light”” and ““heavy”” ““heavy”” precipitation is generally between 3mm and 15mm per 24hrs.

...