You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

While the availability of virtual infrastructure to run ecFlow servers remains limited, you may start your ecFlow servers in the interim HPCF dedicated node to be able to run your suites. 

At a later stage, those ecFlow servers will need to be moved to dedicated Virtual Machines outside the HPCF, where practically no local tasks will be able to run. All ecFlow tasks will need to be submitted to one of the HPCF complexes through the corresponding Batch system.

Please do keep that in mind when migrating or designing your solution.

Starting the ecFlow server

The server needs to be started using the usual procedure on one of the AA login nodes, not through an interactive job.

module load ecflow troika
ecflow_start.sh <options>

You may wish to pass extra options to configure the port or your ecflow home. 

Preparing your suites and tasks

Remember that all tasks will need to be submitted as jobs through the batch system, so you should avoid running tasks locally on the node where the server runs. Make sure that your task header contains the necessary SBATCH directives to run the job. As a minimum:

head.h snippet
#!/bin/bash
#SBATCH --job-name=%ECF_JOB%
#SBATCH --qos=%QUEUE%
#SBATCH --output=%ECF_JOBOUT%
#SBATCH --error=%ECF_JOBOUT%

You may need to add more directives for parallel jobs to define the resources needed. See HPC2020: Batch system for more examples and potential options you may wish to include.

Example of task include files enabling communication between a batch job and ecFlow servers are available from ECMWF git repository.

ecFlow delegates the job management tasks such as submission, kill or monitor the status to external applications. For your convenience, you may use troika, a tool that will take care of those tasks. To use it, just make sure you have the following variables defined at the suite level:

Job management variables in your suite.def
edit QUEUE nf
edit SCHOST aa
edit ECF_JOB_CMD troika submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%
edit ECF_KILL_CMD troika kill %SCHOST% %ECF_JOB%
edit ECF_STATUS_CMD troika monitor %SCHOST% %ECF_JOB%

Of course, you may change queue to np if you are running bigger parallel jobs, or SCHOST to eventually run on other complexes other than aa. 

By default scancel doesn't send signals other than SIGKILL to the batch step. Consequently, one should use "-b" or "-f" option to send a signal ecFlow job is designed to trap before notifying ecFlow server the job was killed:

scancel --signal=TERM -b ${SLURM_JOB_ID}

In this example, SIGTERM (15) was sent but one can use other signals as well. This option can also be used with Troika to kill jobs from ecFlow_ui. This can be specified in the configuration file such as this one:

  Copy
Expand
---
sites:
  localhost:
    type: direct
    connection: local

  hpc: &default
    type: slurm
    connection: ssh
    host: hpc-batch
    pre_submit: ["create_output_dir"]
    preprocess: ["remove_top_blank_lines", "slurm_add_output", "slurm_bubble"]
    at_exit: ["copy_submit_logfile"]
    post_kill: ["abort_on_ecflow"]
    sbatch_command: "ecsbatch"
    scancel_command: 'ecscancel'

  aa:
    << : *default
    host: aa-batch

  ab:
    << : *default
    host: ab-batch

  ac:
    << : *default
    host: ac-batch

  ad:
    << : *default
    host: ad-batch

  hpc2020:
    << : *default
    host: hpc2020-batch

  ecs:
    << : *default
    host: ecs-batch

 

To use a custom configuration file with Troika ecFlow variables should be defined like this:

Job management variables in your suite.def
edit QUEUE nf
edit SCHOST aa
edit ECF_JOB_CMD troika -c {PATH}/troika.yml submit -o %ECF_JOBOUT% %SCHOST% %ECF_JOB%
edit ECF_KILL_CMD troika -c {PATH}/troika.yml kill %SCHOST% %ECF_JOB%
edit ECF_STATUS_CMD troika -c {PATH}/troika.yml monitor %SCHOST% %ECF_JOB%


Connecting to the ecFlow server 

Due to the current limitation in network connectivity to arbitrary ports between our Reading and Bologna Data Centres, it is not possible to connect to that ecflow server in AA from your usual ecflow_ui in Reading.

There are several ways to work around this issue:

Through a graphical VNC session

You may spin up a graphical VNC session on the HPCF with ecinteractive. Once in the VNC session, you can then do the following from a terminal within that VNC session:

module load ecflow
ecflow_ui

Through an SSH tunnel 

You may alternatively use the native ecflow_ui client in your End User Device or VDI, but an additional step is required to ensure connectivity between both ends. You will need to create an SSH tunnel, forwarding the port where the ecflow server is running. 

  1. Start your ecflow server with your preferred settings on one of the login nodes of AA with ecflow_start.sh
  2. Once you know the hostname and port of the server, from your Linux Desktop or VDI create the SSH tunnel

    ssh -N -L<ecflow_port>:localhost:<ecflow_port> <ecflow_host>

    For example, if the server is started on the host aa6-100, port 34567:

    ssh -N -L34567:localhost:34567 aa6-100
  3. Open ecflow_ui on your End User Device or VDI and configure the new server, using "localhost" as the host and the ecflow port used above.

As the local port, you may use any other free port if that particular one is in use

X11 forwarding

This should be your last resort, since the experience running heavy graphical applications through X11 forwarding tends to be poor.

You may also run ecflow_ui remotely on the Atos HPCF, and use X11 forwarding to display on your screen:

ssh -X aa
module load ecflow
ecflow_ui

In this case, when adding the server remember it needs to be configured with the real name of the host running the ecflow server. 

  • No labels