Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Include Page
UDOC:HPC2020: Batch jobs not starting - reasons
UDOC:HPC2020: Batch jobs not starting - reasons

There may be a number of reasons why a submitted job does not start running. When that happens, it is a good idea to use squeue and pay attention to the STATE and NODELIST(REASON) columns:

No Format
$> squeue -j 64243399 
    JOBID       NAME  USER   QOS    STATE       TIME TIME_LIMIT NODES      FEATURES NODELIST(REASON)
 64243399     my_job  user    nf  PENDING       0:00   03:00:00     1        (null) (Priority)

If the job is in a PENDING state, it means it has not been dispatched to any available node to run. Check the reason why this happens.

Here is a list of the most common ones:

...

You have reached a limit in the number of jobs you can submit to the system in a given project account. Your job will not be considered until your other jobs in the same project complete.

...

You have reached a limit in the number of jobs you can submit to a given QoS. Your job will not be considered until your other jobs in the same QoS complete.

...

Your job is part of an array job and the job array's limit on the number of simultaneously running tasks has been reached. Your job will not be considered until your other jobs in the same array job complete.

...

Your job depends on others to complete. Your job will not be considered until dependent jobs complete.

...

Your job has a dependency on another job that will never be satisfied. You should assess why that is and cancel the job as required.

...

There are no nodes available to dispatch your job. A System Session or outage may be going on. Check our service status on https://www.ecmwf.int/en/service-status

...

The full list of reasons can be found in the squeue man page

...