View Source

Using Verification Metrics alongside the Forecast

It is useful to have some measures of the current performance and biases of the IFS. Users can assess from Reliability and ROC diagrams whether the forecast model is:

effective in capturing an event.
tending to over- or under-forecast with different probabilities of an event.
tending to forecast events that actually happen while minimising those that don't.

It is vital that users understand, from the outset, the general characteristics of the model forecasts relative to the subsequent verifying observations (e.g. whether or not the model typically over-forecasts or under-forecasts certain types of outcome). Users should then interpret any forecast signals accordingly. Usually this will mean that they need to be wary of over-stating the significance of any such signals (that have historically been unreliable and/or unskilful). Such a strategy should be applied at all lead times, within forecasting in general. However it is particularly important for the longer lead forecasts (such as monthly and seasonal).

ECMWF provides a number of verification metrics to use in this way, such as anomaly correlation coefficients, reliability diagrams and ROC curves, which have all been computed using the re-forecasts.

Brier Score

Brier Score (BS) is a measure, over a large sample, of the correspondence between each forecast probability against the frequency of occurrence of the verifying observations. On average, when rain is forecasted with probability p, it should occur with the same frequency p. Observation frequency is plotted against forecast probability as a graph. A perfect correspondence means the graph will lie upon the diagonal; the area between the graph and the diagonal measures the Brier Score. Values lie between 0 (perfect) and 1 (consistently wrong).

Distribution of forecast probabilities

The distribution of forecast probabilities gives an indication of the tendency of the forecast towards uncertainty. These are plotted as a histogram to give an indication of confidence in model performance:

A U-shaped distribution (i.e. higher proportion of forecast probabilities occur at each end of the histogram). This implies a clearer decision on whether an event will or won't occur and gives a higher confidence in model performance.
A peaked distribution (i.e. higher proportion of forecast probabilities occur in the centre of the histogram). This implies more equivocal decision on whether an event will or won't occur and gives much less confidence in model performance.

Where there are only a few entries for a given probability on the histogram then confidence in the Reliability diagram is reduced for that probability. In Fig8.3.5-1:

the predominance of probabilities below 0.2 suggests that the user can have some confidence that IFS tends to be over confident that the event will occur.
the predominance of probabilities above 0.9 suggests that the user can have some confidence that IFS ends to be under confident that the event won't occur.
only a few probabilities on the histogram between 0.2 and 0.9

However, there are few probabilities on the histogram between 0.2 and 0.9 which suggests that it would be unsafe to confidently draw similar deductions from the Reliability diagram within this probability range. Conversely, in Fig8.3.5-2 the majority of probabilities lie between 0.2 and 0.5 and reliability within this range appears fairly good while there is much less confidence in model performance for over- or under-forecasting an event. This is as expected as the forecast range becomes longer.

The Reliability diagram

The reliability diagram gives a measure of the capacity to discriminate between model over- or under-forecasting.

The diagram shows, for a given event, the relationship between:

forecast probability of an event, measured as the proportion of ensemble members,
- against.
the observed frequency of that event, measured by climatology frequency taken from re-forecasts.

An example might be, the probability that a 2m temperature will be greater than 20C, plotted against the climatological frequency of that event.

Ideally points should lie on the diagonal. If the plotted points lie:

to the right of the diagonal then there is over-forecasting (e.g. where rain is forecast with 100% probability but actually observed on 80% of occasions).
to the left of the diagonal then there is under-forecasting (e.g. where rain is forecast with 60% probability but actually observed on 90% of occasions).

The size of the departure from the diagonal indicates the magnitude of the over- or under-forecasting error.

A common feature of reliability diagrams is that the profile of the forecasts (red line in Fig8.3.5-1) has a shallower slope than the diagonal, but crosses it somewhere near the climatological value (blue line intersection). This means that the forecast has a tendency to be over-confident. Users should adjust forecast probabilities, even if departures from the diagonal are only small. This is to offset any model tendency to over-forecast frequently observed events, and to under-forecast rather more rare events.

Current Reliability Diagrams (which include distribution of forecast probabilities) are available on Opencharts (days 4, 6, and 10 only)

The ROC diagram

The ROC diagram gives a measure of the capacity to discriminate when events are more likely to happen.

It shows the effectiveness of the IFS:

in forecasting an event that actually happens (Probability of Detection or Hit Rate).
- balanced against
the undesirable cases of predicting an event that fails to occur (False Alarm Rate).

The effectiveness is also known as the 'resolution' of the forecasting system. The word 'resolution' here should not be confused with spatial and temporal resolution.

A system which always forecast climatological probabilities, for example, would have no discrimination ability (i.e. zero resolution). The resolution can be investigated using the Relative Operating Characteristic (ROC) diagram. The ROC plots hit rate on the y-axis against false alarm rate on the x-axis. Ideally:

the Hit Rate should be high and the False Alarm Rate low (i.e. ideally the graph should lie well towards the top left corner).
the Hit Rate should be better than the False Alarm Rate (i.e. values should lie above the diagonal).

Where a ROC graph:

arches towards the top left of the diagram then model is effective in forecasting events that occur without incorrectly warning of events that don't.
follows the diagonal then the model is forecasting as many events that occur as warning of events that don't.
lies below the diagonal then the model is forecasting few events that occur while mostly warning of events that don't.

The ROC score is the area beneath the graph on the ROC diagram and lies between 1 (perfect capture of events) and 0 (consistently warning of events that don't happen). Fig8.3.5-1 shows high effectiveness in forecasting events (ROC score 0.859) while Fig8.3.5-2 shows reduced effectiveness (ROC score 0.593). This is as expected as the forecast range becomes longer.

Current ROC Diagrams are available on Opencharts (for day5 onwards).

Forecast User Guide > Section 8.3.5 Using verification metrics with the output > Reliability Diagram eg Week1.png

Fig8.3.5-1: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week1 (day5-11), DT:20 Jun 2019.

Forecast User Guide > Section 8.3.5 Using verification metrics with the output > Sample ROC Reliability Wk1.png

Fig8.3.5-2: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week5 (day19-32), DT:20 Jun 2019.

In the above diagrams:

BrSc=Brier Score (BS), LCBrSkSc = Brier Skill Score (BSS).,
BS_REL = Forecast reliability and BS_RSL = Forecast resolution with respect to observations.
BSS_RSL = Forecast resolution and, BSS_REL = Forecast reliability with respect to climatology.

Fig8.3.5-3: Example of Reliability Diagrams from Opencharts. Total 24hr precipitation Day6, assessed from ensemble probability forecasts during a three month period and compared climatology from the same period. The traces show the comparison of forecast probabilities against observed occurrences for 24h precipitation totals of >1mm, >5mm, >10mm, >20mm. Ideally the traces should lie along the dashed blue line (i.e. the ensemble probability forecast should agree with the observed frequency). The diagram shows:

reasonably good forecasting at low ensemble probabilities
- e.g. ensemble 20% probability occurred on 20% of the time for each group
over-forecasting at higher ensemble probabilities:
- e.g. ensemble 90% probability of >1mm/24h actually occurred only 60% of the time - the wide distribution of forecast probabilities suggest some confidence in the Reliability trace.
- e.g. ensemble 90% probability of >20mm/24h actually occurred 80% of the time - but the very few forecasts of high probabilities suggest very low confidence in the corresponding implied reliabilities.

Forecast User Guide > Section 8.3.5 Using verification metrics with the output > Screenshot 2021-10-18 at 15.52.24.png

Fig8.3.5-4: Example of Reliability Diagrams from Opencharts. Temperature anomaly Day4, assessed from ensemble probability forecasts during a three month period and compared climatology from the same period. The traces show the comparison of forecast probabilities of anomalies against observed occurrences of anomalies for 2metre temperature of >8°C below, >4°C below, >4°C above, >8°C above climatology. Ideally the traces should lie along the dashed blue line (i.e. the ensemble forecast probability should agree with the observed frequency). The diagram shows:

under-forecasting at low probabilities
- e.g. for >8°C above climatology, >4°C above, >8°C below climatology, ensemble 20% probability actually occurred on 35% of the time.
- e.g. for >4°C below climatology ensemble 20% probability actually occurred on 25% of the time - fairly good correspondence.
over-forecasting at higher ensemble probabilities e.g.:
- for >4°C below climatology ensemble 90% probability actually occurred only 70% of the time - the wide distribution of forecast probabilities suggest some moderate confidence in the implied reliability.
- for >8°C above climatology ensemble 90% probability actually occurred only 65% of the time - but the very few forecasts of high probabilities suggest very low confidence in the implied reliability.

However:

- for >4°C above climatology ensemble 90% probability actually occurred 85% of the time - the wide distribution of forecast probabilities suggest some moderate confidence in the implied reliability.
- for >8°C below climatology ensemble 90% probability actually occurred 85% of the time - but the very few forecasts of high probabilities suggest very low confidence in the implied reliability

Fig8.3.5-5: Example reliability diagrams for 2m temperature based on July starts of the seasonal forecasts for months 4-6.

left for the tropics - a slight tendency towards over-confidence, more especially where forecasting that this event (warm anomalies) will happen.
right for Europe - a tendency towards over-confidence, though the sample size for high confidence forecasts is small, making the plot noisy.

Forecast User Guide > Section 8.3.5 Using verification metrics with the output > Screenshot 2025-07-13 at 21.21.18.png

Fig8.3.5-6: Example reliability diagrams for rain based on July starts of the seasonal forecasts for months 4-6:

left for the tropics - a tendency towards over-confidence.
right for Europe - forecast not reliable at all. Thus it should not be used, unless there are exceptional circumstances that warrant an expectation of skill that is ordinarily not there).

Fig8.3.5-7: Example ROC diagrams for Europe based on July starts of the seasonal forecasts for months 4-6:

the left diagram is for 2m temperatures in the upper tercile. The Hit Rate is slightly better than the False Alarm Rate indicating that the forecast system has some limited ability to discriminate occasions when warm events are likely from occasions when they are not.
the right diagram is for precipitation in the upper tercile. The Hit Rate and False Alarm Rate are similar throughout indicating that the seasonal forecast system has no ability to distinguish occasions when it will be wet from occasions when it will not.

Anomaly Correlation

Anomaly Correlation Coefficient (ACC) charts give an assessment of the skill of the forecast. They show the correlation at all geographical locations in map form.

At ECMWF the anomaly correlation coefficient (ACC) scores represent the spatial correlation between:

the anomalies of a forecast product from a reference model climate and
the anomalies of observations or reanalysis from the same reference model climate.

Seasonal products are available in chart form and the correlation is evaluated between:

- the anomaly of the product measured relative to the a model climatology based on the ERA-interim re-analysis (based on the period 1993-2016) and
- the anomaly of the verifying observations or reanalysis relative to the seasonal model climate (S-M-Climate).

The seasonal model climate (S-M-climate) is based on re-forecasts spanning the last 20 years, which used the ERA-interim re-analysis for their initialisation

Anomaly correlation coefficient (ACC) charts are produced for several parameters. Each chart shows the skill of the forecast at each location for the given month and lead-time.

Positive ACC implies a correlation between forecast anomalies and the verifying observed anomalies. Higher values imply a strong correlation.
Zero ACC implies the forecasts are no better than climatology
Negative ACC implies the forecasts have a tendency to predict the opposite of what subsequently happened (though often this is a sampling issue which should just be interpreted as "no skill").

Locations with correlation significantly (95% confidence level) different from zero are highlighted by dots.

Fig8.3.5-8: Anomaly Correlation Coefficient for 2m temperature for months 2-4 based on November runs of the seasonal model. On the chart:

Red (high ACC) e.g. over the eastern Pacific, suggests the seasonal model captures the amplitude of the variability of the 2m temperature quite well.
Grey (ACC near zero) e.g. Siberia, suggests the seasonal model is no better than climatology (i.e. doesn't capture the variability).
Cyan (negative ACC) e.g. near Newfoundland, suggests the seasonal model can be rather unreliable and misleading in this area.

(FUG associated with Cy50r1)