Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
HTML

<div class="section" id="toward-an-operational-server">
<span id="operational"></span><span id="index-0"></span>
<p>At the beginning of the course, the ecflow server is started in a
directory, and tasks/header scripts are created, below the same
directory. it will rapidly lead to a situation with many files, in one
place (administrative server files, checkpoints files, log-file, white
list, tasks wrappers, but also all jobs/output/aliases files).</p>
<p>This may be difficult to <strong>maintain</strong> or update after some time.</p>
<div class="section" id="files-location">
<h2>Files Location<a class="headerlink" href="#files-location" title="Permalink to this headline">¶</a></h2>
<p>For a maintenable operational suite, we recommand to:</p>
<ul>
<li><p class="first">start the server in a local /tmp directory (<strong>/:ECF_HOME</strong>)</p>
</li>
<li><p class="first">define <strong>ECF_FILES</strong> and <strong>ECF_INCLUDE</strong> variables, at suites
and families level: scripts wrappers will be accessible for
creation, and update under these directories.</p>
</li>
<li><p class="first">define <strong>/suite:ECF_HOME</strong> as another directory location, where
jobs and related outputs will be found. These dynamic files may
then be tar&#8217;ed as snapshot of the effective work associated to a
suite/family, for later analysis or rerun.</p>
</li>
<li><p class="first">when server and remote jobs destination do not share a common
directory for output-files, <strong>ECF_OUT</strong> variable needs to be present
in the suite definition: it indicates the <strong>remote output path</strong>. In
this situation, the suite designer is responsible to create the
directory structure where the output file will be found. Most
queueing system won&#8217;t start the job, if this directory is absent,
and the task may remain visible as submitted, from the ecFlow server
side.</p>
</li>
<li><p class="first">after seding the job complete command, the job may copy its output
to <strong>ECF_NODE</strong>, to enable direct access from ecFlow server.</p>
<p>When a file is requested from the ecflow-server, it is limited to
15k lines, to avoid the server spending too much time delivering
very large output files.</p>
<p>ecFlowview can be configured (<strong>globally</strong>,
Edit-Preferences-Server-Option or <strong>locally</strong> top-node-menu-&gt;Options
&#8220;Read Output an other files from disk when possible&#8221;) to get the
best expected behaviour.</p>
</li>
</ul>
</div>
<div class="section" id="log-server">
<h2>Log-Server<a class="headerlink" href="#log-server" title="Permalink to this headline">¶</a></h2>
<p>It is possible to setup a <strong>log-server</strong>, to access &#8216;live&#8217;
output from the jobs. ecFlow is provided with the perl script
logsvr.pl.</p>
<ul>
<li><p class="first">it is configured to deliver files under specific directories,</p>
</li>
<li><p class="first">configration variables are</p>
<ul class="simple">
<li>LOGPORT # 9316</li>
<li>LOGPATH # &lt;path1&gt;:&lt;path2&gt;:&lt;path3&gt;</li>
<li>LOGMAP  # mapping betwen requested path and real actual location</li>
</ul>
<p>As an example, with two possible storage destination:</p>
<div class="highlight-python"><pre>export LOGPATH=/s2o1/logs:/s2o2/logs # two possible
export LOGMAP=/s2o1/logs:/s2o1/logs:/s2o2/logs:/s2o2/logs # maps itself
export LOGMAP=$LOGMAP:/tmp:/s2o1/logs:/tmp:/s2o2/logs     # map from /tmp</pre>
</div>
</li>
<li><p class="first">It is started on the remote machine and ecFlowview GUI will
contact it when the variables ECF_LOGHOST and ECF_LOGPORT are
defined in the suite:</p>
<div class="highlight-python"><pre>edit ECF_LOGHOST c2a
edit ECF_LOGPORT 9316</pre>
</div>
</li>
<li><p class="first">it can be tested from the command line with telnet:</p>
<div class="highlight-python"><pre>telnet c2a 9316 # get &lt;file&gt; # list &lt;directory&gt;</pre>
</div>
</li>
</ul>
</div>
<div class="section" id="backup-server">
<h2>Backup Server<a class="headerlink" href="#backup-server" title="Permalink to this headline">¶</a></h2>
<p>it can be useful to have a backup ecFlow server in case of network,
disk or host machine crash. The backup server shall be activated on
another workstation, with the most recent check-point-file.</p>
<p>The &#8216;ecf_hostfile&#8217; files can be created with the list of hostnames to
contact, when the link with the original ecFlow server is broken.</p>
<p>common task header head.h may be updated with:</p>
<div class="highlight-python"><pre>export ECF_HOSTFILE=$HOME/.ecf_hostfile</pre>
</div>
</div>
<div class="section" id="system">
<h2>System<a class="headerlink" href="#system" title="Permalink to this headline">¶</a></h2>
<p>As soon as the basic principles of ecflow are understood and mastered,
setting up a project with operational constaints may face serveral
challenges:</p>
<ul class="simple">
<li>I/O and disk access is critical for the server:</li>
<li>use a local file system (/:ECF_HOME /tmp),</li>
<li>use a mounted file system, capable of handling demanding I/O,</li>
<li>use a snapshot capable file system.</li>
</ul>
</div>
<div class="section" id="error-handling">
<h2>Error Handling<a class="headerlink" href="#error-handling" title="Permalink to this headline">¶</a></h2>
<p>A task should abort as close to the problem as possible:</p>
<ul>
<li><p class="first">trap is used to intercept external signals received by the jobs. For
Linux, trapped signals are 1 2 3 4 5 6 7 8 13 15 24 31. A signal
may be sent, when the job exceeds a cpu-timeout, a memory
consumption threshold, a kill request from the server (kill -2),
or a &#8216;command line&#8217; kill from the (root) user.</p>
</li>
<li><p class="first">loosing the traping capability is easy:</p>
<ul>
<li><p class="first">traping inheritance between the main ksh script and ksh function
is system dependent. To maintain deterministic behaviour, do not
hesitate to repeat trap setting:</p>
<div class="highlight-python"><pre># ... a function in a task wrapper ...
function make
{
  %include &lt;trap_func.h&gt;
  # body of the function
  set -e; trap 0; return 0 ##### reset trap
}

# trap_func.h example:
for sgn in $SIGNAL_LIST 0 ; do
trap "{ echo \"Error in function with signal $sgn\"; exit 1; }" $sgn
done</pre>
</div>
</li>
<li><p class="first">calling rsh or ssh within a task will not propagate a remote
error locally.</p>
<p>In most cases, a suite may run &#8220;as requested&#8221;, with
jobs completing. It is only possible to identify the problem
through job output analysis, or when a task aborts later, in the
absence of the expected products, or when a product user is
reporting.</p>
<p>Splitting the job into simple units (tasks), submitted directly
to the expected destination is part of the suite design. It will
lead to clear identification of submission problems, followed by
red tasks, that can be rerun later when the problem has been
solved.</p>
</li>
</ul>
</li>
<li><p class="first">Early exits must be a choice of the task designer, calling &#8216;trap 0;
ecflow_client &#8211;complete; exit 0&#8217;. Using &#8216;trap ERROR 0&#8217;, early exit
will call the ERROR function, and then &#8216;ecflow_client &#8211;abort&#8217;</p>
</li>
<li><p class="first">unset variables can be detected thanks to &#8216;set -u&#8217;</p>
</li>
<li><p class="first">time stamps may be added on a per line bases with variable PS4</p>
</li>
<li><p class="first">ECF_TRIES variables can be increased to allow multiple submission
attemps (some jobs may become more verbose on second submission, or
it can be a &#8216;network glitch&#8217;</p>
</li>
</ul>
</div>
<div class="section" id="server-administration">
<h2>Server Administration<a class="headerlink" href="#server-administration" title="Permalink to this headline">¶</a></h2>
<p>an &#8216;admin&#8217; suite will be required:</p>
<ul>
<li><p class="first">to ensure that ecflow logfile is not filling up the disk, nor
touching a quota limit, issuing regularly the command:</p>
<div class="highlight-python"><div class="highlight"><pre><span class="n">ecflow_client</span> <span class="o">--</span><span class="n">port</span> <span class="o">%</span><span class="n">ECF_PORT</span><span class="o">%</span> <span class="o">--</span><span class="n">host</span> <span class="o">%</span><span class="n">ECF_NODE</span><span class="o">%</span> <span class="o">--</span><span class="n">log</span><span class="o">=</span><span class="n">new</span>
</pre></div>
</div>
</li>
<li><p class="first">to duplicate the checkpoint file, on a remote, backup server, or a
slower long term archive system. (to handle the case when disk
failure, hosting workstation problem, or network issue that does
require backup server start).</p>
</li>
<li><p class="first">a white list file to control access for read-write users or read-only users</p>
</li>
</ul>
</div>
<div class="section" id="cmd-variables">
<h2>CMD variables<a class="headerlink" href="#cmd-variables" title="Permalink to this headline">¶</a></h2>
<p>CMD variables shall be set and capable to submit/kill/query a job
locally and remotely. They are:</p>
<ul>
<li><p class="first">on the server side:</p>
<ul>
<li><p class="first">ECF_JOB_CMD:</p>
<div class="highlight-python"><pre>edit ECF_JOB_CMD '%ECF_JOB% &gt; %ECF_JOBOUT% 2&gt;&amp;1'
edit ECF_JOB_CMD 'rsh %ECF_JOB% &gt; %ECF_JOBOUT% 2&gt;&amp;1'</pre>
</div>
</li>
<li><p class="first">ECF_KILL_CMD:</p>
<div class="highlight-python"><pre>edit ECF_KILL_CMD '%kill -2 %ECF_RID% &amp;&amp; kill -15 %ECF_RID%'</pre>
</div>
</li>
<li><p class="first">ECF_STATUS_CMD:</p>
<div class="highlight-python"><pre>edit ECF_STATUS_CMD '%ps --sid %ECF_RID% -f'</pre>
</div>
</li>
</ul>
</li>
<li><p class="first">on the client side:</p>
<ul>
<li><p class="first">ECF_CHECK_CMD:</p>
<div class="highlight-python"><pre>edit ECF_CHECK_CMD '%ps --sid %ECF_RID% -f'</pre>
</div>
</li>
<li><p class="first">ECF_URL_CMD (for html man pages for tasks, plots display, products
arrival html page):</p>
<div class="highlight-python"><pre>edit URLBASE https://software.ecmwf.int/wiki/display/
edit URL     ECFLOW/Home
edit ECF_CHECK_CMD '${BROWSER:=firefox} -remote "openURL(%URLBASE%/%URL%)"'</pre>
</div>
</li>
</ul>
</li>
<li><p class="first">alternatively, a script may be responsible for jobs
submission/kill/query. At ECMWF, we use a submit script that tunes
the generated job file to the remote destination.  It does:</p>
<ul>
<li><p class="first">translate queuing system directives to the expected syntax,</p>
</li>
<li><p class="first">tune submission timeout according to submit user and remote destination,</p>
</li>
<li><p class="first">use a submition utility according to the remote system, or even
the way we want the job to be submitted there: nohup,
standalone, rsh, ssh, ecrcmd</p>
</li>
<li><p class="first">keep memory of the <strong>remote queuing id</strong> given to the job, stores it in a
&#8221;.sub&#8221; file, that may be used later by kill and query commands</p>
</li>
<li><p class="first">handle frequent or specific errors with the submission: job may
have been accepted, even if the submission command is reporting
an error and shall not be reported as such to the server.</p>
</li>
<li><p class="first">example:</p>
<div class="highlight-python"><pre>edit ECF_JOB_CMD    '$HOME/bin/ecf_submit %USER% %HOST% %ECF_JOB% %ECF_JOBOUT%
edit ECF_KILL_CMD   '$HOME/bin/ecf_kill %USER% %HOST% %ECF_RID% %ECF_JOB%
edit ECF_STATUS_CMD '$HOME/bin/ecf_status %USER% %HOST% %ECF_RID% %ECF_JOB%</pre>
</div>
</li>
</ul>
</li>
<li><p class="first">remote jobs submission needs the server administrator, or the suite
designer, to communicate with the system administration team, in
order to decide:</p>
<ul class="simple">
<li>shared, mounted, or local file systems according to best choice or
topology, in the local network.</li>
<li>main submission schemes (rsh, ssh),</li>
<li>alternative submission scheme (we may use nicknames to distinguish
direct job submission from submission through a queueing system on
the same host)</li>
<li>fall-back schemes (when c2a node is not available, c2a-batch is to
be used, as alternative)</li>
<li>the best way to handle cluster switch (from c2a to c2b, as a
variable on the top node, or multiple variables among the suites,
a shell variable, or even a one-line-switch in the submit script)</li>
<li>to handle remote storage switch (from /s2o1 to /s22o, as a server
variable or a shell variable in the jobs)</li>
<li>submission time-outs,</li>
<li>notification before killing a job, (sending kill -2 signal), to
give a chance to send the abort command.</li>
</ul>
</li>
</ul>
</div>
<div class="section" id="task-design">
<h2>Task Design<a class="headerlink" href="#task-design" title="Permalink to this headline">¶</a></h2>
<p>most tasks should be re-runnable and they should have an up to date
&#8216;manual section&#8217;.</p>
</div>
<div class="section" id="micro">
<h2>micro<a class="headerlink" href="#micro" title="Permalink to this headline">¶</a></h2>
<p>micro character (%) is used as variable delimiter, or to start
preprocessing directives (include, manual, end, nopp) in task wrappers.</p>
<ul>
<li><p class="first">It can be changed in the definition file, as ECF_MICRO variable:</p>
<div class="highlight-python"><pre>edit ECF_MICRO @ # we shall find @include in the affected wrappers</pre>
</div>
</li>
<li><p class="first">micro may change trough the job thanks to the directive
%ecf_micro:</p>
<div class="highlight-python"><pre>%ecf_micro ^ # change micro to exponant character
^include "standalone_script.pl"
^ecf_micro % # revert back to original character</pre>
</div>
</li>
<li><dl class="first docutils">
<dt>%nopp can be used to avoid duplicating the &#8216;%&#8217; in some sections of</dt>
<dd><p class="first last">the task wrapper where it can be frequently used (date, perl)</p>
</dd>
</dl>
</li>
<li><p class="first">%includenopp &lt;file&gt; is also a simple way to import a script that do
not contain ecFlow prepprocessing directive, and that may contain
the micro &#8216;%&#8217; character</p>
</li>
</ul>
</div>
<div class="section" id="python-debugging">
<h2>Python Debugging<a class="headerlink" href="#python-debugging" title="Permalink to this headline">¶</a></h2>
<p>Python suite definition files sometimes lead to &#8216;Memory fault&#8217;
message. Error can be understood running it with pdb or gdb:</p>
<div class="highlight-python"><pre>python -m pdb  &lt;script.py&gt;

gdb python
&gt; set args suite.def
&gt; run
&gt; bt</pre>
</div>
</div>
</div>