Page History

Explicit time limit honoured

From 18 Jan 2023 From 2022-12-?? ECMWF will enforce killing jobs if they have reached their wall time if "#SBATCH time=" or "–time=" was --time or command line option -–time was provided with the job.

Alternatively ECMWF accepts jobs without "without #SBATCH time=" or "–time=" --time or command line option -–time and ECMWF will instead use average runtime of previous jobs trying to bunch them using job name "similar" jobs by generating job tag based on user, job name, job geometry and job output path.

For new jobs we will assume 24h runtime and allow another 24h grace time, allowing a new job to run for up to 48h. After the first successful run it will use the average+ one standard deviation of the previous 20 successful runtimes of the job allowing the job 24h grace time.

Example:

Code Block

[ECMWF-INFO -sbatch] - ------------------------------
[ECMWF-INFO -sbatch] - jobtag: sycw-test-2x512-/home/sycw/slurm/slurm-_JOBID_.out
[ECMWF-INFO -sbatch] - ------------------------------
[ECMWF-INFO -sbatch] - Average Walltime 279 with a Standard Deviation 515
[ECMWF-INFO -sbatch] - Runtime history
[ECMWF-INFO -sbatch] -   Date               | Cores   CPUTime   Walltime       Mem
[ECMWF-INFO -sbatch] -   02.11.2022 - 13:45 | 512        0          1306       200M      
[ECMWF-INFO -sbatch] -   02.11.2022 - 11:27 | 512        0          1306       200M      
[ECMWF-INFO -sbatch] -   31.10.2022 - 10:21 | 512        0          1306       200M      
[ECMWF-INFO -sbatch] -   28.10.2022 - 14:39 | 512        0          1306       200M      
[ECMWF-INFO -sbatch] -   28.10.2022 - 12:28 | 512        0          135        200M      
[ECMWF-INFO -sbatch] -   25.10.2022 - 15:19 | 512        0          5          200M      
[ECMWF-INFO -sbatch] -   25.10.2022 - 15:18 | 512        0          6          200M      
[ECMWF-INFO -sbatch] -   25.10.2022 - 15:18 | 512        0          6          200M      
[ECMWF-INFO -sbatch] -   25.10.2022 - 15:05 | 512        0          7          200M      
[ECMWF-INFO -sbatch] -   20.10.2022 - 08:48 | 512        0          136        200M      
[ECMWF-INFO -sbatch] -   19.10.2022 - 09:44 | 512        2560       5          200M      
[ECMWF-INFO -sbatch] -   19.10.2022 - 09:43 | 512        3072       6          200M      
[ECMWF-INFO -sbatch] -   19.10.2022 - 09:37 | 512        3072       6          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        3072       6          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        3072       6          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        2560       5          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        4096       8          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        4096       8          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:52 | 512        4096       8          200M      
[ECMWF-INFO -sbatch] -   18.10.2022 - 08:51 | 512        3072       6          200M
[ECMWF-INFO -sbatch] - ['/usr/bin/sbatch', '--job-name=test', '--nodes=2', '--qos=np', '--time=00:05', '--mem-per-cpu=100', '--export=EC_user_time_limit=00:05', '/home/sycw/slurm/time.job']

more detail please refer to HPC2020: Job Runtime Management.

New maximum memory limit per node in parallel jobs

A maximum value of 240 GB of memory per node will be enforced from 18 Jan 2023. This will avoid potential out of memory situations, ensuring enough memory is left for the Operating System and other critical services running on those nodes.

Any parallel jobs explicitly asking for more than 240 GB with the #SBATCH --mem directive or the --mem command line option will fail at submission time with the message:

No Format
sbatch: error: Memory specification can not be satisfied sbatch: error: Batch job submission failed: Requested node configuration is not available

Since parallel jobs are assigned nodes exclusively, and therefore can use all the memory available in the nodes, it is usually easier to avoid defining that option all together in parallel jobs.ALL THIS SHOULD GO INTO A SEPARATE PAGE EXPLAINING OUR ESTIMATED RUNTIME STUFF

Space shortcuts

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Explicit time limit honoured

New maximum memory limit per node in parallel jobs