You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

page under construction ------------------------------

A large volume of data (100s of TB) is downloaded everyday from the CDS.

In this page we summarise:

  • The reasons underpinning the CDS queuing time 
  • The size of the CEMS-Flood datasets stored on MARS and accessible through the CDS
  • The best practices to to maximise efficiency and minimise waiting time.


CDS Request times

CDS Queuing can be monitored from the 'Your Requests' or using the 'CDS Live'

The CDS retrieval times can vary significantly depending on the number of requests that the CDS has at any one time and also based on the following factors that affect EFAS and GloFAS:

  • The priority of the dataset in question
  • The size of the request
  • The number of requests submitted by a user
  • The number of requests to retrieve data from ECMWF Archive
  • The number of requests requesting a specific dataset
  • The number of active slots
  • The size of the overall queue.

The CDS strives to deliver data as fast as possible, however, it is not an operational service and should not be relied upon to deliver data in real time as it is produced.
Here we will try to give some context of why requests can takes time:

Data for the CEMS-Floods (EFAS and GloFAS) datasets are held within MARS at ECMWF.
The MARS Service is a system designed for the request of GRIB Files based on a Disk Cache and Tape storage architecture.

Most recent data is held on disk cache with all data being available from Tape.
When a user requests data, the CDS queues the request based on the CDS's own queueing priorities using the factors described above.

Once the job becomes eligible it is passed to the MARS Service at ECMWF for extraction of the relevant fields.

It is only at this point that you will see your job as 'Running'

Selecting areas of data does not mean that you are not retrieving the whole globe. Each timestep of each date of each variable is classed as an individual grib field.

MARS extracts sub areas by retrieving the global grid and cropping the area and returning the requested area.

MARS as a separate service also has constraints on its workload and has separate QOS limits that apply to jobs for data as it is a service shared across Operational services 'ie producing ERA5 and GloFAS' and non operational services such as the CDS.


The CDS service, from time to time, can experience periods of high user activity and increasing queuing time for even small requests. In these times we ask you to kindly wait for the queue to be processed as there are fixed slots available that cannot be increased.

The Figure -1  shows a period of high user activity. GloFAS and EFAS products are served by the adaptor.mars.external service, you can see that the active users (blue line) is well above the green line of 50 slots allocated to the GLoFAS and EFAS requests. When the blue line falls again below the green line then the total queued users start decreasing until eventually there is no queuing time for any user request.


Figure - 1 

CEMS-Flood data on MARS

Table 1 - 

DatasetCDS Catalogue FormOverall SizeDays on Disk
GloFAS climatologyhttps://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=overview86GB30
GloFAS forecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-forecast7.6TB15

GloFAS reforecast

https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-reforecast6TB0 (unless recently requested)
GloFAS seasonal forecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal0.5TB10 days for most recent forecast
GloFAS seasonal reforecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-seasonal-reforecast8.75TB0 (unless recently requested)
EFAS climatologyhttps://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-historical

817GB

30
EFAS forecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-forecast26.49TB10
EFAS reforecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-reforecast25.95TB0 (unless recently requested)
EFAS seasonal forecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal0.5TB10 days for most recent forecast
EFAS seasonal reforecasthttps://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-seasonal-reforecast13.25 TB0 (unless recently requested)


Request strategy

Table 2 - Summary

DatasetAPI field limitsDownloaded data sizeRequest strategyLink to example script
GloFAS climatology5002 GBLoop over years
GloFAS forecast608.1 GBloop over years, months, days

GloFAS reforecast

95032 GB

loop over months, days

Subset to ROI


GloFAS seasonal forecast12531.5 GB

Loop over years, months

Subset to ROI


GloFAS seasonal reforecast12531.5 GB

Loop over years, months

Subset to ROI


EFAS climatology

1000

to be confirmedto be confirmedto be confirmed
EFAS forecast1000to be confirmedto be confirmedto be confirmed
EFAS reforecast200to be confirmedto be confirmedto be confirmed
EFAS seasonal forecast220to be confirmedto be confirmedto be confirmed
EFAS seasonal reforecast220to be confirmedto be confirmedto be confirmed

Speed up retrieval through concurrency:

Whilst submitting multiple requests can improve download time, overloading the system with too many requests will eventually slow down the overall system performance.

Indeed the CDS system penalises users that submit too many requests, decreasing the priority of their requests. 

                                                                                         Too many parallel requests could eventually result in a slower overall download time 


For this reason we suggest to limit to a maximum of 10 parallel requests.


Example code, download 20 years of GloFAS reforecasts using 10 threads.

import cdsapi
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings("ignore")

LEADTIMES = ["%d" % (l) for l in range(24, 1128, 24)]
YEARS = ["%d" % (y) for y in range(1999, 2019)]


def get_dates(start=[2019, 1, 1], end=[2019, 12, 31]):
    start, end = datetime(*start), datetime(*end)
    days = [start + timedelta(days=i) for i in range((end - start).days + 1)]
    dates = [
        list(map(str.lower, d.strftime("%B-%d").split("-")))
        for d in days
        if d.weekday() in [0, 3]
    ]
    return dates


DATES = get_dates()

def retrieve(client, request, date):

    month = date[0]
    day = date[1]
    print(f"requesting month: {month}, day: {day} /n")
    request.update({"hmonth": month, "hday": day})
    client.retrieve(
        "cems-glofas-reforecast", request, f"glofas_reforecast_{month}_{day}.grib"
    )
    return f"retrieved month: {month}, day: {day}"


def main(request):
    "concurrent request using 10 threads"
    client = cdsapi.Client()
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [
            executor.submit(retrieve, client, request.copy(), date) for date in DATES
        ]
        for f in as_completed(futures):
            try:
                print(f.result())
            except:
                print("could not retrieve")


if __name__ == "__main__":

    request = {
        "system_version": "version_2_2",
        "variable": "river_discharge_in_the_last_24_hours",
        "format": "grib",
        "hydrological_model": "htessel_lisflood",
        "product_type": "control_reforecast",
        "hyear": YEARS,
        "hmonth": "",
        "hday": "",
        "leadtime_hour": LEADTIMES,
    }

    main(request)



  • No labels