Cumulative distribution function and Probability density function

Cumulative distribution function (CDF)

The cumulative distribution function (CDF) is the probability that a continuous random variable has a value less than or equal to a given value.  Each member of the ensemble gives a different forecast value (e.g. of temperature) for a given time and location, and consequently these results may be used to define a CDF where the x-axis is the forecast variable (e.g. temperature) and the y-axis the number of ensemble members (expressed as a proportion of the total number of ensemble members) forecasting a value less than  a given threshold.  The median value will be where the CDF is 50%.

Cumulative distribution function plot design.

The CDF for the ensemble values is constructed from the temperature or precipitation forecast by each ensemble member (red line in the examples below) together with the CDF of the temperature or rainfall M-climate (black line) for that location for the date in question.

For temperature:

  • Some values are positive (i.e. warmer than the M-climate), some in the tails of the plot can be extremely positive; some are negative (i.e. colder than the M-climate), some some in the tails of the plot can be extremely negative.  The x-axis on the diagrams starts from the overall minimum encountered within all the displayed CDFs (M-climate and ensemble forecasts). 

For precipitation:

  • Some values are positive (i.e. wetter than the M-climate), some in the upper tail of the plot can be extremely positive.  For the vast majority of the world the lower tail of the M-climate this will equate to zero precipitation for the day and it is impossible to get less than zero precipitation. 
  • The x-axis on the diagrams starts from zero.


Strictly, for different lead times (i.e. for the different coloured curves) the M-climate will vary a little bit (due to model drift and under-sampling) but also more particularly in spring and autumn when day-to-day climatic changes are at their greatest (see Limitations of twice weekly updates to the M-Climate).   In spite of such variations it is still reasonable, helpful and recommended to inter-compare the M-climate curve (black) with all the coloured curves (even if this is only strictly valid for the same lead time that it represents - i.e. the red curve).  Note, incidentally, that the M-climate, as used here and on meteograms, is now based on re-forecasts initialised from ERA5 data; this is higher quality output, and has greater compatibility with actual forecasts, than was the case previously when the re-forecasts were initialised from ERA-Interim data (i.e. before model cycle 46r1 was introduced in June 2019).


Note: a zone of the ensemble forecast curve (red) which has a steep slope implies more ensemble members have forecast similar values and indicates higher probabilities (assuming the forecasts are unbiased) in the value indicated.  Conversely, a shallow slope in the forecast curve (red) implies greater diversity among ensemble member forecasts and hence lower probabilities for the temperature indicated.

Simplistically:

  • Steep slope implies higher confidence.
  • Shallow slope implies lower confidence.


 Fig8.1.9.1-1: The cumulative distribution function (CDF) shows the probability not exceeding a threshold value (e.g. say, not exceeding 20°C).  The figure is a schematic explanation of the principle behind the Extreme Forecast Index (EFI).  The blue line shows the cumulative probability of temperatures evaluated by M-climate for a given location, time of year and forecast lead time.  The red line shows the corresponding cumulative probability of temperatures evaluated by the ensemble.  EFI is measured by the area between the CDFs of the M-Climate (blue) and the CDFs of the ensemble members (red).   Almost all the ensemble forecast temperatures are above the M-climate median and about 15% are above the M-climate maximum.  In this case, the EFI is positive (the red line to the right of the blue line), indicating higher than normal probabilities of warm anomalies.  


Probability Density Function (PDF)

The Probability Density Function (PDF) is the first derivative of the cumulative distribution function CDF). 

Fig8.1.9.1-2(left): Example cumulative distribution function (CDF).

Fig8.1.9.1-2(right): The probability density function (PDF) is defined as the first derivative of the CDF.  The graphs correspond to the example CDF curves in Fig8.1.9.1-2 with the temperature M-climate (blue) and the forecast distribution (red). 

Dotted lines show the median for the M-Climate and forecast.

The median and any other percentile is given by reading the point on the x-axis where a horizontal probability line intersects the curve.  The most likely values are associated with those where the CDF is steepest.  Similarly, the PDF shows peaks in the curve at the highest probability intervals.  The EFI can be understood and interpreted with both the CDF and PDF in mind; the former relates to the EFI value, the latter clarifies the connection to probabilities.  A steep slope of the CDF, or equivalently a narrow peak of the PDF, implies a high confidence in the forecast.

In the upper frames of Fig8.1.9.1-2 the peak of the forecast PDF (red) is to the right of the peak of the M-climate PDF (blue), indicating that the forecast predicts warmer than normal conditions and the sharpness of the peak indicates fairly high probability.

In the lower frames of Fig8.1.9.1-2 the peak of the forecast PDF (red) is to the left of the peak of the M-climate PDF (blue), indicating that the forecast predicts colder than normal conditions and the sharpness of the peak indicates high probability.


Bi-modality

Sometimes the distribution of possible outcomes can have two favoured solutions.  This is called "bimodality".  On a PDF this is clearly shown by two peaks.  On a CDF curve it will be denoted by a step. A scenario in which one can sometimes see bimodal solutions is for the maximum wind gust parameter, close to the track of an active, small scale frontal wave cyclone.  North of the track relatively light winds are favoured whilst south of the track very strong winds are favoured.  Values in between may be less likely overall.

Fig8.1.9.1-3: This example probability density function (PDF) diagram shows the ensemble members to be widely distributed but fall towards two distinct more likely wind speeds - one set suggests a most probable wind speed centred around the peak at W1 and a second set suggests a probable wind speed centred around the peak at W2.  The CDF associated with the example PDF shows the probability of (i.e. the percentage of ensemble members) attaining wind speeds.

The third diagram is ensemble forecast and M-Climate CDFs for maximum wind gusts at 45.9°N 45.28°W during the period Saturday 24 March 2018 00 UTC to Sunday 25 March 2018 00 UTC.  The trace show CDFs at this location from a series of recent ensemble forecasts for this period and the black line is the CDF from the M-climate.  The red (most recent) trace shows a flat interval (at about 57% probability of not exceeding 20m/s gusts) indicating bi-modal structure of the PDF.   


A steep CDF trace corresponds to a several ensemble members having similar forecast values and therefore a peak in the PDF trace - both indicate a higher confidence around that value.   The CDF increases most rapidly at W1 and corresponds to the first peak of the PDF at W1.  The CDF trace then flattens out showing only a few additional ensemble members spread over slightly higher wind speeds and the PDF therefore falls to a low value. Finally at W2 the CDF trace steepens again around a higher value of wind speed with the increasing number of ensemble members forecasting the higher wind speeds and a peak of PDF.   The PDF peak at W2 is not as high as at W1 because the CDF slope at W2 is less than at W1.

Bi-modal patterns can occur, for example, when there is uncertainty whether a depression will pass one side or the other of the location in question.  The diagrams say nothing about the direction of the winds (e.g. they may be moderate easterlies to the north of the location or strong westerlies to the south (N Hem)) nor about timing of the depression (e.g. it may be slower or faster).  The diagrams only give information on the variation among the ensemble member solutions. 

Estimation of the mean value from a cumulative distribution function

It is also possible to estimate, graphically, the mean value of ensemble forecasts (or the model climate) from a CDF, using the method shown on Fig8.1.9.1-4.

Fig8.1.9.1-4: Estimating the mean value for a CDF graphically, using a 2m temperature example.  The mean value of a set of ensemble forecast results may be obtained by adjusting a vertical line V laterally until the area A above the CDF curve equals the area B below the curve. In this example the mean V for the black (M-Climate) profile is slightly above the median (where the y-axis probability = 50%), implying some skew to the distribution (related to the longer positive tail).  The same approach could be used to estimate the mean for any of the coloured (forecast) CDFs shown.