You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

The ATOS supercomputer is envisaged to absorb all the computing activities and workloads that have traditionally run not only on the HCPF, but also from other services such as ECGATE and internal ECMWF Linux clusters and workstations.

If you are testing your workload on TEMS, there are a number of the things you should pay attention at when porting your activities, both as a general advice as well as specific information for each of the origin platforms.

General considerations

No csh support

If you are still using csh, please move to a supported shell in advance. Csh is no longer available on the Atos HPCF. You may choose bash, which is now the default for any newly created user, or alternatively ksh is also supported.

See Linux Virtual Desktop VDI: Shells for more information.

No cross-mounted filesystems

The filesystems available (HOME,  PERM, SCRATCH) have the same names and corresponding environment variables, but they are not the same as the ones on older platforms such as the Cray HPC, ECGATE or other Linux Clusters and Workstations. If you need to get to data stored on those platforms, you will need to copy it over.

See Linux Virtual Desktop VDI: Filesystems and Linux Virtual Desktop VDI: File transfers for more information.

New filesystem structure

You will notice your filesystems have now a flatter and simpler name structure. If you port any scripts or code that had paths to filesystems hardcoded from older platforms, please make sure you update them. Where possible, try and use the environment variables provided, which should work on both sides pointing to the right location in each case.

See Linux Virtual Desktop VDI: Filesystems for more information.

Improved module system

TEMS uses Lmod as the module infrastructure, and it brings a number of improvements and features. You will see that most basic commands for common tasks such as load, list or unload are the same, so in the majority of cases no action is needed. However, you may find some cosmetic changes in how module reports what is done. Modules are now less verbose and only generate output on failure and if other active modules are being modified.

Another key difference is that module avail will only display the modules that can be loaded depending on the active compiler/mpi environment (prgenv). If you can't find a module there, try module spider

See The new module system: Lmod for more information.

Old and deprecated software

If you are using a legacy software package, check if it is being discontinued in Bologna - New Data Centre

Even if a certain package is available, you may still need to adapt your scripts or programs to more recent versions. Only the default versions or newer for a number of software packages and libraries on other systems in Reading will be made available.

Some environment variables corresponding to legacy packages may no longer be defined by the corresponding module.  For example, the GRIB_API_* environment variables are not exported in the ecmwf-toolbox module, but you may use ECCODES_* equivalents.

See Linux Virtual Desktop VDI: Software stack for more details.

New location for some ECMWF packages and libraries: ecmwf-toolbox

You may find that some modules that you used to load for ECMWF packages such as ecCodes,  Magics or Metview are no longer available. Please note they have been bundled together for a greater inter-compatibility in the ecmwf-toolbox module. You can just replace any loads of the old modules by a new load of ecmwf-toolbox.

See Linux Virtual Desktop VDI: ECMWF software and libraries for more details.

No Python 2 support

Python 2 reached End Of Life on the 1st of January 2020. Although there is a version of Python 2.7 installed as part of the Operating System, it does not contain any extra modules and it may not be sufficient for your needs. You must make sure your Python programs can use version 3, which is indeed supported. Currently you may choose between the traditional ECMWF maintained Python installation in the python3 module, or use conda for greater environment customisation.

See Linux Virtual Desktop VDI: Python support for more details.

Moving from ECGATE or Linux Clusters

Batch system differences

Slurm is the batch system on ATOS HPCF, so writing, submitting and managing jobs should feel very familiar. However, note that the queue names are different, so if porting existing jobs from older platforms pay attention to those. If you just want to run a simple serial job, your default queue would be enough.

The helper command sqos is not available, but you can get the same information using other commands such as squeue or sacctmgr.

See Linux Virtual Desktop VDI: Batch system for more details.

Moving from Cray XC40 - CCA / CCB

Batch system differences

Slurm is the batch system on ATOS HPCF, so you will need to translate your PBS job headers and get used to a new set of commands for your batch job management.

Main command line tools

The table summarises the main Slurm user commands and their PBS equivalents.

User commands

PBS

Slurm
Job submissionqsub [<pbs_options>] <job_script>sbatch [<sbatch_options>] <job_script>
Job cancellationqdel <job_id>scancel <job_id>
Job statusqscan [-u <uid>] [<job_id>]squeue [-u <uid>] [-j <job_id>]
Queue informationqstat -Q [-f] [<queue>]
sacctmgr show qos [name=<queue>]

Queues

The queue names are very similar to those on Cray, but note that the serial queue ns has been merged into the fractional nf queue. 

Job geometries

The node configuration in terms of number of cores and memory per core changes in respect to the Cray XC40. If running parallel workloads, make sure you take into account the Atos HPCF node configuration to efficiently use the allocated resources.

Example

If your parallel job on Cray explicitly requests 72 total tasks and 36 tasks per node, that would effectively use 2 Cray nodes and all it's physical cores. Running with the same geometry on Atos HPCF would use 2 nodes as well. However, you would be only using 36 of the 128 physical cores in each node, wasting 92 of them per node.

Directives

The table summarises the main Slurm directives and their PBS equivalents.

PBS

Slurm

Description

Default

#PBS#SBATCHPrefix for the directive in the job script-

-l EC_billing_account=<account>

--account=<account>

-A <account>

Project account for resource accounting and billing purposes.default project account for the user

-l EC_job_name=<job_name>

--job-name=<name>

-J <name>

A descriptive name of the jobScript name
no equivalent--chdir=...Working directory of the job. The output and error files can be defined relative to this directorysubmitting directory

-o <path>

--output=<path>

-o <path>

Path to the file where standard output is redirected. Special placeholders for job id (%j) and the execution node (%N)slurm-%j.out

-e <path>

--error=...

-e <path>

Path to the file where standard error is redirected. Special placeholders for job id (%j) and the execution node (%N)output value

-q <queue>

--qos=<qos>

-q <qos>

Quality of Service (or queue) where the job is to be submitted. Check the available queues for the platform.normal

-l walltime=<hh:mm:ss>

--time=<time>

-t <time>

Wall clock limit of the job. Note that this is not cpu time limit

The format can be: m, m:s, h:m:s, d-h, d-h:m or d-h:m:s

qos default time limit
-m <type>--mail-type=<type>Notify user by email when certain event types occur. Valid values are: BEGIN, END, FAIL, REQUEUE and ALLdisabled
-M <email>--mail-user=<email>email address to send the emailsubmitting user

-l EC_total_tasks=<tasks>

--ntasks=<tasks>

-n <tasks>

Allocate resources for the specified number of parallel tasks. Note that a job requesting more than one must be submitted to a parallel queue. There might not be any parallel queue configured on the cluster1

-l EC_nodes=<nodes>

--nodes=<nodes>

-N <nodes>

Allocate <nodes> number of nodes to the job1

-l EC_threads_per_task=<threads>

--cpus-per-task=<threads>

--c <threads>

Allocate <threads> number of cpus for every task. Use for threaded applications.1

-l EC_tasks_per_node=<tasks>

--ntasks-per-node=<tasks>

Allocate a maximum of <tasks> tasks on every node.node capacity

--threads-per-core=<threads>

--threads-per-core=<threads>

Allocate <threads> threads on every core (HyperThreading)core thread capacity

-l EC_hyperthreads=2 / 1

--hint=[no]multithread

Use or not hyperthreaded cores and define the binding accordingly.not defined

-l EC_memory_per_task=<memory>

--mem-per-cpu=<mem>

Allocate <mem> memory for each taskcore thread capacity

-V


--export=<vars>


Export variables to the job, comma separated entries of the form VAR=VALUE.

ALL means export the entire environment from the submitting shell into the job.

NONE means getting a fresh session.

NONE

See Linux Virtual Desktop VDI: Batch system for more details.

  • No labels