Skip to end of metadata
Go to start of metadata

 

This document describes some basics (and more) on the usage of the Cray XC40 HPC systems at ECMWF.

 

Table of contents

 

For an overview of the system see Computing Facilities pages.

The material from or training course OLD High Performance Computing Facility (HPCF) - Cray is available for download.

In June 2014 we held a webinar to introduce the Cray system and to give some hints regarding the migration from the IBM HPCF:

Access

The new HPC is called cca and is accessible through the same name. You can access cca via ecaccess, like for the other systems at ECMWF. Access is only available through 'ssh'. If you are using the NX virtual desktop option you should update the application menu to get the new entry for cca. From ecgate you can login to cca with:

ssh -Y cca

File systems

$HOME, $SCRATCH and $PERM are the main file systems available on cca. See HPCF user filesystems for more information.

Modules environment on cca

Like on ecgate, we continue to make most (ECMWF and third party) software packages available though modules on cca. Also Cray makes many of their software available though modules. These are the main commands:

  • See all the available packages and their versions:

    module avail

    The list can be very long, so it is a good practice to give a hint of what you are looking for:

    module avail grib
  • List all the modules that are currently loaded:

    module list
  • Load and unload a module. If you don't specify any version when loading a module, the default version will be chosen:

    module load grib_api
    module unload grib_api
    
    module load grib_api/1.12.0
  • Swap (switch) a loaded module by another one. This has the same effect than unloading and loading the desired modules:

    module switch PrgEnv-cray PrgEnv-gnu
    module switch grib_api/1.12.0

A set of modules will be loaded by default when starting a session or a job. The modules provided by ECMWF for certain packages can be customised, and the users can add their own. However, the system modules provided by Cray cannot be customised, including the default Programming Environment.

If you want to change the default Programming Environment, you can do so by adding the appropriate line on the file ~/.user_kshrc or ~/.user_bashrc, depending on your login shell. For example, to set the GNU Programming Environment by default:

module switch PrgEnv-cray PrgEnv-gnu

 

  • List the modules that will be loaded by default:

    module initlist
  • Add (append) or prepend a module to the list of default modules:

    module initadd
    module initprepend
  • Remove a module from the list of default modules:

    module initrm
  • Replace (switch) a module by another one on the list of default modules:

    module initswitch

If you modify the initial list of modules, it will be automatically not in sync with the official list, so  you will not benefit from any updates that could be done to that list. You can revert any changes done to that list with:

ln -sf /usr/local/etc/.modules ~

 See also our separate Modules page for more info.

Batch environment (PBS)

See Batch environment: PBS for a detailed description of the batch environment.

  • PBS support in ECaccess (to be documented)

Compiling Environment

  • 3 compiler suites
  • basic options
  • suggested options
  • basic profiling
  • basic debugging

Crontab

If you have access to ECMWF operational data in real-time please submit your retrieval jobs via the time-critical framework option 1 so that they are queued as soon as the data is available. For other regular tasks you might want to use cron with a crontab entry like, e.g.:

05 12 * * * ( . /etc/ksh.kshrc; qsub $HOME/cronjob )

This recipe works for ksh, bash and csh users. Note also that crontabs should be installed on the node 'cca-log' (or 'ccb-log' for ccb).

File transfers

  • scp, bbcp, ectrans

Accounting

For the System Billing Unit formula, tools to monitor your usage (eoj, acct_status) and general documentation of the accounting system see HPC accounting.

prepIFS work

  • cycles available
  • branches needed

Time Critical work

  • Option 1: Batch access to PBS on cca is available though the ECaccess commands. The ECaccess queue name to access the Cray is called cca. The relevant commands to submit time critical work are:
    • ecaccess-event-list to check the events available.
    • ecaccess-job-submit: to submit a job.
    ECaccess jobs can be submitted from ecgate or cca at ECMWF or from your local system if you have installed the ECaccess tools. In order to access the ECaccess tools on cca, you will have to load the module called 'ecaccess'.

  • Option 2: We have tried to set up an environment on cca as close as possible to the environment on the IBM systems. Here are some reminders and the few changes.
    • Users running Time Critical 2 work have access to the queues ts, tf and tp.
    • TC-2 users have access to special file systems, which are present on the two storage clusters.. Currently only one storage cluster is available. The two file systems are accessible via the paths /sc1 and /sc2. Like on the IBM systems, TC-2 users will need to pass a variable STHOST to each job, e.g. with '#PBS -v STHOST=sc1'.
    • One important change with the current version of the ksh on cca is that functions are not exported any longer to sub-shells. This is of particular important in SMS/ECFLOW for the error handling.One usually call an ERROR function in the trap. If you call some scripts within your job, note that the ERROR function, defined in your job, will not be available in the script you'called. Therefore the trapping of any error in the script will not work, and SMS/ECFLOW will not be notified about the error. We suggest to define the necessary functions in one directory and to define the FPATH variable in our jobs to point to this directory.
    •  We have created a new set of job handling commands on ecgate. These commands are available via a module called 'schedule'. There are two ways to invoke these commands. One can either use the commands 'task_submit, task_status and task_kill, or one can call the commands 'schedule ...' to submit a task, 'schedule ... status' to check the status on a task and 'schedule ... kill' to cancel a task. See 'schedule -h' for syntax.
    • The log server (for SMS or ECFLOW) should be started on cca-log, not on the interactive node (cca). The command is called 'start_logserver', The syntax is the same as on the IBM. See 'start_logserver -h' for help. If you run on ccb, you will need to also start (and use) the log server on ccb-log. This is so because the output files are directly accessed from the PBS spool which is specific to each cluster. See next point.
    • Job output files of running jobs are not directly visible.  The following additional commands in your SMS/ECFLOW header and tail files will allow you to have live access to the job output files form xcdp/ecflowview.


Sample head.h file:

...
# To give access to job output file in PBS spool on Cray systems

if [[ $HOST = @(cc*) ]]; then
  _real_pbs_outputfile=/var/spool/PBS/spool/${PBS_JOBID}.OU
  _pbs_outputfile=/nfs/moms/$HOST${_real_pbs_outputfile}
  _running_output=${SMSJOBOUT}.running
  ln -sf $_pbs_outputfile $_running_output
fi
... 
  • SMSJOBOUT needs to be defined as a shell variable via, for example: SMSJOBOUT=%SMSJOBOUT%
  • Do not specify separate output and error files in the PBS header.  Specify only the output and use the "-j oe" directive to join error and output.
  • The SMS log server should be started on cca-log.  The SMSLOGHOST variable needs to be set to cca-log.

 

Sample tail.h file:

...
# Cleanup of link to job output file. 

if [[ $HOST = @(cc*) ]]; then
  [[ -L $_running_output ]] && rm -f $_running_output
fi
... 

HPCF phase 2

The upgrade of the current XC30 systems has started. More information is available on the new XC40 systems.