CDS - Best Practices

page under construction ------------------------------

A large volume of data (100s of TB) is downloaded everyday from the CDS.

In this page we summarise:

The reasons underpinning the CDS queuing time
The size of the CEMS-Flood datasets stored on MARS and accessible through the CDS
The best practices to to maximise efficiency and minimise waiting time.

CDS Request times

CDS Queuing can be monitored from the 'Your Requests' or using the 'CDS Live'

The CDS retrieval times can vary significantly depending on the number of requests that the CDS has at any one time and also based on the following factors that affect EFAS and GloFAS:

The priority of the dataset in question
The size of the request
The number of requests submitted by a user
The number of requests to retrieve data from ECMWF Archive
The number of requests requesting a specific dataset
The number of active slots
The size of the overall queue.

The CDS strives to deliver data as fast as possible, however, it is not an operational service and should not be relied upon to deliver data in real time as it is produced.
Here we will try to give some context of why requests can takes time:

Data for the CEMS-Floods (EFAS and GloFAS) datasets are held within MARS at ECMWF.
The MARS Service is a system designed for the request of GRIB Files based on a Disk Cache and Tape storage architecture.

Most recent data is held on disk cache with all data being available from Tape.
When a user requests data, the CDS queues the request based on the CDS's own queueing priorities using the factors described above.

Once the job becomes eligible it is passed to the MARS Service at ECMWF for extraction of the relevant fields.

It is only at this point that you will see your job as 'Running'

Selecting areas of data does not mean that you are not retrieving the whole globe. Each timestep of each date of each variable is classed as an individual grib field.

MARS extracts sub areas by retrieving the global grid and cropping the area and returning the requested area.

MARS as a separate service also has constraints on its workload and has separate QOS limits that apply to jobs for data as it is a service shared across Operational services 'ie producing ERA5 and GloFAS' and non operational services such as the CDS.

CEMS-Flood data on MARS

Table 1 -

Dataset	CDS Catalogue Form	Overall Size	Days on Disk
GloFAS climatology	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=overview	86GB	30
GloFAS forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-forecast	7.6TB	15
GloFAS reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-reforecast	6TB	0 (unless recently requested)
GloFAS seasonal forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal	0.5TB	10 days for most recent forecast
GloFAS seasonal reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal-reforecast	8.75TB	0 (unless recently requested)
EFAS climatology	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-historical	817GB	30
EFAS forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-forecast	26.49TB	10
EFAS reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-reforecast	25.95TB	0 (unless recently requested)
EFAS seasonal forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal	0.5TB	10 days for most recent forecast
EFAS seasonal reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal-reforecast	13.25 TB	0 (unless recently requested)

Request strategy

Table 2 - Summary

Dataset	CDS Catalogue Form	API field limits	Final data size	Request strategy	Link to example script
GloFAS climatology	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=overview	500	2 GB	Loop over years	API request
GloFAS forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-forecast	60	8.1 GB	loop over years, months, days	API request
GloFAS reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-reforecast	950	32 GB	loop over months, days Subset to ROI	API request
GloFAS seasonal forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal	125	31.5 GB	Loop over years, months Subset to ROI	API request
GloFAS seasonal reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal-reforecast	125	31.5 GB	Loop over years, months Subset to ROI	API request
EFAS climatology	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-historical	to be confirmed	to be confirmed	to be confirmed	to be confirmed
EFAS forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-forecast	to be confirmed	to be confirmed	to be confirmed	to be confirmed
EFAS reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-reforecast	to be confirmed	to be confirmed	to be confirmed	to be confirmed
EFAS seasonal forecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal	to be confirmed	to be confirmed	to be confirmed	to be confirmed
EFAS seasonal reforecast	https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal-reforecast	to be confirmed	to be confirmed	to be confirmed	to be confirmed

Speed up retrieval through concurrency:

The CDS enforces a per user limit to the number of requests that can be processed in parallel. This limit is 10 parallel requests running at the same time. The are also 'global limits' that can affect the user requests. More information here.

Whilst submitting multiple requests can improve the tasks' index in the queuing system, the user needs to understand that overloading the system with requests will eventually slow down the overall system performance.

Too many parallel requests could eventually result in a slower overall download time

For this reason we suggest to limit to a maximum of 10 parallel requests.

Example code, download 20 years of GloFAS reforecasts using 10 threads.

import cdsapi
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings("ignore")

LEADTIMES = ["%d" % (l) for l in range(24, 1128, 24)]
YEARS = ["%d" % (y) for y in range(1999, 2019)]


def get_dates(start=[2019, 1, 1], end=[2019, 12, 31]):
    start, end = datetime(*start), datetime(*end)
    days = [start + timedelta(days=i) for i in range((end - start).days + 1)]
    dates = [
        list(map(str.lower, d.strftime("%B-%d").split("-")))
        for d in days
        if d.weekday() in [0, 3]
    ]
    return dates


DATES = get_dates()

def retrieve(client, request, date):

    month = date[0]
    day = date[1]
    print(f"requesting month: {month}, day: {day} /n")
    request.update({"hmonth": month, "hday": day})
    client.retrieve(
        "cems-glofas-reforecast", request, f"glofas_reforecast_{month}_{day}.grib"
    )
    return f"retrieved month: {month}, day: {day}"


def main(request):
    "concurrent request using 10 threads"
    client = cdsapi.Client()
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [
            executor.submit(retrieve, client, request.copy(), date) for date in DATES
        ]
        for f in as_completed(futures):
            try:
                print(f.result())
            except:
                print("could not retrieve")


if __name__ == "__main__":

    request = {
        "system_version": "version_2_2",
        "variable": "river_discharge_in_the_last_24_hours",
        "format": "grib",
        "hydrological_model": "htessel_lisflood",
        "product_type": "control_reforecast",
        "hyear": YEARS,
        "hmonth": "",
        "hday": "",
        "leadtime_hour": LEADTIMES,
    }

    main(request)

Page tree

CDS - Best Practices

CDS Request times

CEMS-Flood data on MARS

Request strategy

Speed up retrieval through concurrency: