Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Files Location

For a maintenable operational suite, we recommand to:

  • start the server in a local /tmp directory (/:ECF_HOME)

  • define ECF_FILES and ECF_INCLUDE variables, at suites and families level: scripts wrappers will be accessible for creation, and update under these directories.

  • define /suite:ECF_HOME as another directory location, where jobs and related outputs will be found. These dynamic files may then be tar’ed as snapshot of the effective work associated to a suite/family, for later analysis or rerun.

  • when server and remote jobs destination do not share a common directory for output-files, ECF_OUT variable needs to be present in the suite definition: it indicates the remote output path. In this situation, the suite designer is responsible to create the directory structure where the output file will be found. Most queuing system won’t start the job, if this directory is absent, and the task may remain visible as submitted, from the ecFlow server side.

  • after sending the job complete command, the job may copy its output to ECF_HOST, to enable direct access from ecFlow server.

    When a file is requested from the ecflow_server, it is limited to 15k lines, to avoid the server spending too much time delivering very large output files.

    ecFlowview can be configured (globally, Edit-Preferences-Server-Option or locally top-node-menu->Options “Read Output on other files from disk, when possible”) to get the best expected behaviour.

  • use ecf.list file to restrict access to the server for read-write or read only access

    Code Block
    #
    # ecflow_client --help=reloadwsfile
    # ecflow_client --reloadwsfile # update ecFlow server
    # $USER  # rw access, aka $LOGNAME
    # -$USER # for read only access
    # export ECF_LISTS=/path/to/file # before server starts, to change location or name
    emos
    -rdx


Log-Server

It is possible to setup a log-server, to access ‘live’ output from the jobs. ecFlow is provided with the perl script ecflow_logsvr.pl.

  • it is configured to deliver files under specific directories,

  • configration variables are

    • LOGPORT # 9316
    • LOGPATH # <path1>:<path2>:<path3>
    • LOGMAP # mapping between requested path and real actual location

    As an example, with two possible storage destination:

    export LOGPATH=${ECF_OUT:=/tmp/logs}:${ECF_OUT_ALTERNATIVE:=/sc1/logs} # two possible
    export LOGMAP=$ECF_OUT:$ECF_OUT:$ECF_HOME:$ECF_OUT:${ECF_OUT_ALTERNATIVE}:${ECF_OUT_ALTERNATIVE} # maps itself + home
    export LOGMAP=$LOGMAP:/tmp:/s2o1/logs:/tmp:/s2o2/logs     # map from /tmp
  • It is started on the remote machine and ecFlowview GUI will contact it when the variables ECF_LOGHOST and ECF_LOGPORT are defined in the suite:

    edit ECF_LOGHOST c2a
    edit ECF_LOGPORT 9316
  • it can be tested from the command line with telnet:

    telnet c2a 9316 # get <file> # list <directory>


  • list all ECF_OUT variables from one server:

    Code Block
    ls.py -V -L -N / -R -T -v --port ${ECF_PORT:=31415} --host ${ECF_HOST:=localhost} | grep  -E "(edit ECF_OUT|edit SCLOGDIR| edit STHOST)" | grep -v ECF_HOME | cut -d: -f2 | sort | uniq 2>/dev/null
    
    


...

CMD variables

CMD variables shall be set and capable to submit/kill/query a job locally and remotely. They are:

  • on the server side:

    • ECF_JOB_CMD:

      edit ECF_JOB_CMD '%ECF_JOB% > %ECF_JOBOUT% 2>&1'
      edit ECF_JOB_CMD 'rsh %ECF_JOB% > %ECF_JOBOUT% 2>&1'
    • ECF_KILL_CMD:

      edit ECF_KILL_CMD '%kill -2 %ECF_RID% && kill -15 %ECF_RID%'
    • ECF_STATUS_CMD:

      edit ECF_STATUS_CMD '%ps --sid %ECF_RID% -f'
  • on the client side:

    • ECF_CHECK_CMD:

      edit ECF_CHECK_CMD '%ps --sid %ECF_RID% -f'
    • ECF_URL_CMD (for html man pages for tasks, plots display, products arrival html page):

      edit URLBASE https://softwareconfluence.ecmwf.int/wiki/display/
      edit URL     ECFLOW/Home
      edit ECF_CHECK_CMD '${BROWSER:=firefox} -remote "openURL(%URLBASE%/%URL%)"'
  • alternatively, a script may be responsible for jobs submission/kill/query. At ECMWF, we use a submit script that tunes the generated job file to the remote destination. It does:

    • translate queuing system directives to the expected syntax,

    • tune submission timeout according to submit user and remote destination,

    • use a submition utility according to the remote system, or even the way we want the job to be submitted there: nohup, standalone, rsh, ssh, ecrcmd

    • keep memory of the remote queuing id given to the job, stores it in a ”.sub” file, that may be used later by kill and query commands

    • handle frequent or specific errors with the submission: job may have been accepted, even if the submission command is reporting an error and shall not be reported as such to the server.

    • example:

      edit ECF_JOB_CMD    '$HOME/bin/ecf_submit %USER% %HOST% %ECF_JOB% %ECF_JOBOUT%
      edit ECF_KILL_CMD   '$HOME/bin/ecf_kill %USER% %HOST% %ECF_RID% %ECF_JOB%
      edit ECF_STATUS_CMD '$HOME/bin/ecf_status %USER% %HOST% %ECF_RID% %ECF_JOB%
  • remote jobs submission needs the server administrator, or the suite designer, to communicate with the system administration team, in order to decide:

    • shared, mounted, or local file systems according to best choice or topology, in the local network.
    • main submission schemes (rsh, ssh),
    • alternative submission scheme (we may use nicknames to distinguish direct job submission from submission through a queuing system on the same host)
    • fall-back schemes (when c2a node is not available, c2a-batch is to be used, as alternative)
    • the best way to handle cluster switch (from c2a to c2b, as a variable on the top node, or multiple variables among the suites, a shell variable, or even a one-line-switch in the submit script)
    • to handle remote storage switch (from /s2o1 to /s22o, as a server variable or a shell variable in the jobs)
    • submission time-outs,
    • notification before killing a job, (sending kill -2 signal), to give a chance to send the abort command.

...

Python Debugging

Python suite definition files sometimes lead to ‘Memory fault’ message. Error can be understood running it with pdb or gdb:

python -m pdb  <script.py>

gdb python
> set args suite.def
> run
> bt

 

 

...