Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
HTML

<div class="section" id="zombie">
<span id="index-0"></span><span id="id1"></span>
<p>A <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> is a running job that fails authentication when communicating with the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a></p>
<div class="section" id="how-are-zombies-created">
<h2>How are zombies created ?<a class="headerlink" href="#how-are-zombies-created" title="Permalink to this headline">¶</a></h2>
<div class="line-block">
<div class="line">There are wide variety of reasons why a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> is created.</div>
<div class="line">The most common causes are due to user action:</div>
</div>
<ul class="simple">
<li>The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-node"><em class="xref std std-term">node</em></a> tree is deleted, replaced or reloaded whilst jobs are running</li>
<li>A <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a> is rerun, whilst in a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-submitted"><em class="xref std std-term">submitted</em></a> or <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-active"><em class="xref std std-term">active</em></a> state</li>
<li>A job is forced to new state, i.e <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-complete"><em class="xref std std-term">complete</em></a></li>
</ul>
<p>More rarer causes might be:</p>
<ul class="simple">
<li><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecf-script"><em class="xref std std-term">ecf script</em></a> errors, where we have multiple calls to init and complete <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> s</li>
<li>The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> s in the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecf-script"><em class="xref std std-term">ecf script</em></a> are placed in the background.
In this case order in which the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> contact the server, may be indeterminate.</li>
<li>Load leveler submitting a job twice</li>
<li>Server crash and recovered <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-check-point"><em class="xref std std-term">check point</em></a> file is out of date</li>
<li>Machine crash</li>
</ul>
</div>
<div class="section" id="how-can-zombie-s-be-handled">
<h2>How can zombie&#8217;s be handled ?<a class="headerlink" href="#how-can-zombie-s-be-handled" title="Permalink to this headline">¶</a></h2>
<p>The default behaviour is to <strong>block</strong> the job.</p>
<div class="line-block">
<div class="line">The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> continues attempting to contact the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a>.</div>
<div class="line">This is done for period of 24 hours. (This period is configurable see ECF_TIMEOUT on <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-client"><em class="xref std std-term">ecflow_client</em></a>).</div>
</div>
<div class="line-block">
<div class="line">The jobs can also configured, so that if the server denies the communication, then</div>
<div class="line">the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> can be set to fail immediately. (See ECF_DENIED on <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-client"><em class="xref std std-term">ecflow_client</em></a>)</div>
</div>
<p><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a> provides a dialog which lists all the zombies and the actions that can be taken. These include:</p>
<ul>
<li><p class="first">Terminate:</p>
<div class="line-block">
<div class="line">The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> is asked to <strong>fail</strong>.</div>
<div class="line">Depending on your scripts,this may cause the abort <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> to be called.</div>
<div class="line">Which again will be flagged as a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a>.</div>
</div>
</li>
<li><p class="first">Fob:</p>
<p>Allow the job to continue. The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> completes and hence no longer blocks the job.</p>
<div class="line-block">
<div class="line">Great care should be taken when this action is chosen.</div>
<div class="line">If we have two jobs running, they may cause data corruption.</div>
<div class="line">Even when we have a single job, issues can arise.</div>
<div class="line">i.e if the associated command was an event <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a>, then the</div>
<div class="line"><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-event"><em class="xref std std-term">event</em></a> would not be set. If this <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-event"><em class="xref std std-term">event</em></a> was used in a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-trigger"><em class="xref std std-term">trigger</em></a> expression,</div>
<div class="line">it would never evaluate.</div>
</div>
</li>
<li><p class="first">Delete:</p>
<div class="line-block">
<div class="line">Remove the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> from the server. The job will continue blocking, hence</div>
<div class="line">when the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> next contacts the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a>, the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> will re-appear.</div>
<div class="line">If the job is killed manually, then this option can be used.</div>
</div>
</li>
<li><p class="first">Rescue:</p>
<div class="line-block">
<div class="line"><strong>Adopt</strong> the zombie and update the node tree.</div>
<div class="line">The ECF_PASS on the zombie is copied over to the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a>, so that the next</div>
<div class="line"><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> will continue as normal.</div>
</div>
</li>
<li><p class="first">Kill:</p>
<div class="line-block">
<div class="line">Applies the kill command (ECF_KILL_CMD ) using the process id stored on the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a>.</div>
<div class="line">If the script has correct signal trapping, this should end up calling abort.</div>
<div class="line">Note: path zombies will need to be killed manually.</div>
</div>
</li>
</ul>
<div class="admonition warning">
<p class="first admonition-title">Warning</p>
<p class="last">Of the four action above, only Rescue will allow <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> to change the state of the node tree.</p>
</div>
<p><strong>What to do:</strong></p>
<ol class="arabic simple">
<li>Create a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> by starting a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a>, and setting it to <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-complete"><em class="xref std std-term">complete</em></a> immediately via <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a></li>
<li>Inspect the log file, it will show you how the zombie has arisen.</li>
<li>Inspect the zombie dialog in <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a> (right mouse button selection on the host node)</li>
<li>Experiment with the different actions on the zombie</li>
<li>Select host node and invoke the <strong>option...</strong> menu selection.
Select the Zombies button. This enables zombie notification via window pop up</li>
</ol>
</div>
</div>