Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CMD variables

CMD variables shall be set and capable to submit/kill/query a job locally and remotely. They are:

  • on the server side:

    • ECF_JOB_CMD:

      edit ECF_JOB_CMD '%ECF_JOB% > %ECF_JOBOUT% 2>&1'
      edit ECF_JOB_CMD 'rsh %ECF_JOB% > %ECF_JOBOUT% 2>&1'
    • ECF_KILL_CMD:

      edit ECF_KILL_CMD '%kill -2 %ECF_RID% && kill -15 %ECF_RID%'
    • ECF_STATUS_CMD:

      edit ECF_STATUS_CMD '%ps --sid %ECF_RID% -f'
  • on the client side:

    • ECF_CHECK_CMD:

      edit ECF_CHECK_CMD '%ps --sid %ECF_RID% -f'
    • ECF_URL_CMD (for html man pages for tasks, plots display, products arrival html page):

      edit URLBASE https://software.ecmwf.int/wiki/display/
      edit URL     ECFLOW/Home
      edit ECF_CHECK_CMD '${BROWSER:=firefox} -remote "openURL(%URLBASE%/%URL%)"'
  • alternatively, a script may be responsible for jobs submission/kill/query. At ECMWF, we use a submit script that tunes the generated job file to the remote destination. It does:

    • translate queuing system directives to the expected syntax,

    • tune submission timeout according to submit user and remote destination,

    • use a submition utility according to the remote system, or even the way we want the job to be submitted there: nohup, standalone, rsh, ssh, ecrcmd

    • keep memory of the remote queuing id given to the job, stores it in a ”.sub” file, that may be used later by kill and query commands

    • handle frequent or specific errors with the submission: job may have been accepted, even if the submission command is reporting an error and shall not be reported as such to the server.

    • example:

      edit ECF_JOB_CMD    '$HOME/bin/ecf_submit %USER% %HOST% %ECF_JOB% %ECF_JOBOUT%
      edit ECF_KILL_CMD   '$HOME/bin/ecf_kill %USER% %HOST% %ECF_RID% %ECF_JOB%
      edit ECF_STATUS_CMD '$HOME/bin/ecf_status %USER% %HOST% %ECF_RID% %ECF_JOB%
  • remote jobs submission needs the server administrator, or the suite designer, to communicate with the system administration team, in order to decide:

    • shared, mounted, or local file systems according to best choice or topology, in the local network.
    • main submission schemes (rsh, ssh),
    • alternative submission scheme (we may use nicknames to distinguish direct job submission from submission through a queueing queuing system on the same host)
    • fall-back schemes (when c2a node is not available, c2a-batch is to be used, as alternative)
    • the best way to handle cluster switch (from c2a to c2b, as a variable on the top node, or multiple variables among the suites, a shell variable, or even a one-line-switch in the submit script)
    • to handle remote storage switch (from /s2o1 to /s22o, as a server variable or a shell variable in the jobs)
    • submission time-outs,
    • notification before killing a job, (sending kill -2 signal), to give a chance to send the abort command.

...