This document describes some basics (and more) on the usage of the Cray XC40 HPC systems at ECMWF.
For an overview of the system see Computing Facilities pages.
The material from or training course OLD High Performance Computing Facility (HPCF) - Cray is available for download.
In June 2014 we held a webinar to introduce the Cray system and to give some hints regarding the migration from the IBM HPCF:
The new HPC is called cca and is accessible through the same name. You can access cca via ecaccess, like for the other systems at ECMWF. Access is only available through 'ssh'. If you are using the NX virtual desktop option you should update the application menu to get the new entry for cca. From ecgate you can login to cca with:
$HOME, $SCRATCH and $PERM are the main file systems available on cca. See HPCF user filesystems for more information.
Modules environment on cca
Like on ecgate, we continue to make most (ECMWF and third party) software packages available though modules on cca. Also Cray makes many of their software available though modules. These are the main commands:
See all the available packages and their versions:
The list can be very long, so it is a good practice to give a hint of what you are looking for:
List all the modules that are currently loaded:
Load and unload a module. If you don't specify any version when loading a module, the default version will be chosen:
Swap (switch) a loaded module by another one. This has the same effect than unloading and loading the desired modules:
A set of modules will be loaded by default when starting a session or a job. The modules provided by ECMWF for certain packages can be customised, and the users can add their own. However, the system modules provided by Cray cannot be customised, including the default Programming Environment.
If you want to change the default Programming Environment, you can do so by adding the appropriate line on the file ~/.user_kshrc or ~/.user_bashrc, depending on your login shell. For example, to set the GNU Programming Environment by default:
List the modules that will be loaded by default:
Add (append) or prepend a module to the list of default modules:
Remove a module from the list of default modules:
Replace (switch) a module by another one on the list of default modules:
If you modify the initial list of modules, it will be automatically not in sync with the official list, so you will not benefit from any updates that could be done to that list. You can revert any changes done to that list with:
Modules page for more info.ee also our separate
Batch environment (PBS)
See Batch environment: PBS for a detailed description of the batch environment.
- PBS support in ECaccess (to be documented)
- 3 compiler suites
- basic options
- suggested options
- basic profiling
If you have access to ECMWF operational data in real-time please submit your retrieval jobs via the time-critical framework option 1 so that they are queued as soon as the data is available. For other regular tasks you might want to use cron with a crontab entry like, e.g.:
This recipe works for ksh, bash and csh users. Note also that crontabs should be installed on the node 'cca-log' (or 'ccb-log' for ccb).
- scp, bbcp, ectrans
For the System Billing Unit formula, tools to monitor your usage (eoj, acct_status) and general documentation of the accounting system see HPC accounting.
- cycles available
- branches needed
Time Critical work
- Option 1: Batch access to PBS on cca is available though the ECaccess commands. The ECaccess queue name to access the Cray is called cca. The relevant commands to submit time critical work are:
- ecaccess-event-list to check the events available.
- ecaccess-job-submit: to submit a job.
- Option 2: We have tried to set up an environment on cca as close as possible to the environment on the IBM systems. Here are some reminders and the few changes.
- Users running Time Critical 2 work have access to the queues ts, tf and tp.
- TC-2 users have access to special file systems, which are present on the two storage clusters.. Currently only one storage cluster is available. The two file systems are accessible via the paths /sc1 and /sc2. Like on the IBM systems, TC-2 users will need to pass a variable STHOST to each job, e.g. with '#PBS -v STHOST=sc1'.
- One important change with the current version of the ksh on cca is that functions are not exported any longer to sub-shells. This is of particular important in SMS/ECFLOW for the error handling.One usually call an ERROR function in the trap. If you call some scripts within your job, note that the ERROR function, defined in your job, will not be available in the script you'called. Therefore the trapping of any error in the script will not work, and SMS/ECFLOW will not be notified about the error. We suggest to define the necessary functions in one directory and to define the FPATH variable in our jobs to point to this directory.
- We have created a new set of job handling commands on ecgate. These commands are available via a module called 'schedule'. There are two ways to invoke these commands. One can either use the commands 'task_submit, task_status and task_kill, or one can call the commands 'schedule ...' to submit a task, 'schedule ... status' to check the status on a task and 'schedule ... kill' to cancel a task. See 'schedule -h' for syntax.
- The log server (for SMS or ECFLOW) should be started on cca-log, not on the interactive node (cca). The command is called 'start_logserver', The syntax is the same as on the IBM. See 'start_logserver -h' for help. If you run on ccb, you will need to also start (and use) the log server on ccb-log. This is so because the output files are directly accessed from the PBS spool which is specific to each cluster. See next point.
- Job output files of running jobs are not directly visible. The following additional commands in your SMS/ECFLOW header and tail files will allow you to have live access to the job output files form xcdp/ecflowview.
Sample head.h file:
- SMSJOBOUT needs to be defined as a shell variable via, for example:
- Do not specify separate output and error files in the PBS header. Specify only the output and use the
"-j oe"directive to join error and output.
- The SMS log server should be started on cca-log. The
SMSLOGHOSTvariable needs to be set to cca-log.
Sample tail.h file:
HPCF phase 2
The upgrade of the current XC30 systems has started. More information is available on the new XC40 systems.