Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A zombie is a running job that fails authentication when communicating with the ecflow_server

How are zombies created ?

There are wide variety of reasons why a zombie is created.
The most common causes are due to user action:
  • The node tree is deleted, replaced or reloaded whilst jobs are running
  • A task is rerun, whilst in a submitted or active state
  • A job is forced to new state, i.e complete

More rarer causes might be:

How can zombie’s be handled ?

The default behaviour is to block the job.

The child command continues attempting to contact the ecflow_server.
This is done for period of 24 hours. (This period is configurable see ECF_TIMEOUT on ecflow_client).
The jobs can also configured, so that if the server denies the communication, then
the child command can be set to fail immediately. (See ECF_DENIED on ecflow_client)

ecflowview provides a dialog which lists all the zombies and the actions that can be taken. These include:

  • Terminate:

    The child command is asked to fail.
    Depending on your scripts,this may cause the abort child command to be called.
    Which again will be flagged as a zombie.
  • Fob:

    Allow the job to continue. The child command completes and hence no longer blocks the job.

    Great care should be taken when this action is chosen.
    If we have two jobs running, they may cause data corruption.
    Even when we have a single job, issues can arise.
    i.e if the associated command was an event child command, then the
    event would not be set. If this event was used in a trigger expression,
    it would never evaluate.
  • Delete:

    Remove the zombie from the server. The job will continue blocking, hence
    when the child command next contacts the ecflow_server, the zombie will re-appear.
    If the job is killed manually, then this option can be used.
  • Rescue:

    Adopt the zombie and update the node tree.
    The ECF_PASS on the zombie is copied over to the task, so that the next
    child command will continue as normal.
  • Kill:

    Applies the kill command (ECF_KILL_CMD ) using the process id stored on the zombie.
    If the script has correct signal trapping, this should end up calling abort.
    Note: path zombies will need to be killed manually.

Warning

Of the four action above, only Rescue will allow child command to change the state of the node tree.

What to do:

  1. Create a zombie by starting a task, and setting it to complete immediately via ecflowview
  2. Inspect the log file, it will show you how the zombie has arisen.
  3. Inspect the zombie dialog in ecflowview (right mouse button selection on the host node)
  4. Experiment with the different actions on the zombie
  5. Select host node and invoke the option... menu selection. Select the Zombies button. This enables zombie notification via window pop up

 

HTML
HTML

<div class="section" id="zombie">
<span id="index-0"></span><span id="id1"></span>
<p>A <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> is a running job that fails authentication when communicating with the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a></p>
<div class="section" id="how-are-zombies-created">
<h2>How are zombies created ?<a class="headerlink" href="#how-are-zombies-created" title="Permalink to this headline">¶</a></h2>
<div class="line-block">
<div class="line">There are wide variety of reasons why a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> is created.</div>
<div class="line">The most common causes are due to user action:</div>
</div>
<ul class="simple">
<li>The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-node"><em class="xref std std-term">node</em></a> tree is deleted, replaced or reloaded whilst jobs are running</li>
<li>A <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a> is rerun, whilst in a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-submitted"><em class="xref std std-term">submitted</em></a> or <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-active"><em class="xref std std-term">active</em></a> state</li>
<li>A job is forced to new state, i.e <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-complete"><em class="xref std std-term">complete</em></a></li>
</ul>
<p>More rarer causes might be:</p>
<ul class="simple">
<li><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecf-script"><em class="xref std std-term">ecf script</em></a> errors, where we have multiple calls to init and complete <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> s</li>
<li>The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> s in the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecf-script"><em class="xref std std-term">ecf script</em></a> are placed in the background.
In this case order in which the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> contact the server, may be indeterminate.</li>
<li>Load leveler submitting a job twice</li>
<li>Server crash and recovered <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-check-point"><em class="xref std std-term">check point</em></a> file is out of date</li>
<li>Machine crash</li>
</ul>
</div>
<div class="section" id="how-can-zombie-s-be-handled">
<h2>How can zombie&#8217;s be handled ?<a class="headerlink" href="#how-can-zombie-s-be-handled" title="Permalink to this headline">¶</a></h2>
<p>The default behaviour is to <strong>block</strong> the job.</p>
<div class="line-block">
<div class="line">The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> continues attempting to contact the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a>.</div>
<div class="line">This is done for period of 24 hours. (This period is configurable see ECF_TIMEOUT on <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-client"><em class="xref std std-term">ecflow_client</em></a>).</div>
</div>
<div class="line-block">
<div class="line">The jobs can also configured, so that if the server denies the communication, then</div>
<div class="line">the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> can be set to fail immediately. (See ECF_DENIED on <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-client"><em class="xref std std-term">ecflow_client</em></a>)</div>
</div>
<p><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a> provides a dialog which lists all the zombies and the actions that can be taken. These include:</p>
<ul>
<li><p class="first">Terminate:</p>
<div class="line-block">
<div class="line">The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> is asked to <strong>fail</strong>.</div>
<div class="line">Depending on your scripts,this may cause the abort <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> to be called.</div>
<div class="line">Which again will be flagged as a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a>.</div>
</div>
</li>
<li><p class="first">Fob:</p>
<p>Allow the job to continue. The <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> completes and hence no longer blocks the job.</p>
<div class="line-block">
<div class="line">Great care should be taken when this action is chosen.</div>
<div class="line">If we have two jobs running, they may cause data corruption.</div>
<div class="line">Even when we have a single job, issues can arise.</div>
<div class="line">i.e if the associated command was an event <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a>, then the</div>
<div class="line"><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-event"><em class="xref std std-term">event</em></a> would not be set. If this <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-event"><em class="xref std std-term">event</em></a> was used in a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-trigger"><em class="xref std std-term">trigger</em></a> expression,</div>
<div class="line">it would never evaluate.</div>
</div>
</li>
<li><p class="first">Delete:</p>
<div class="line-block">
<div class="line">Remove the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> from the server. The job will continue blocking, hence</div>
<div class="line">when the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> next contacts the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflow-server"><em class="xref std std-term">ecflow_server</em></a>, the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> will re-appear.</div>
<div class="line">If the job is killed manually, then this option can be used.</div>
</div>
</li>
<li><p class="first">Rescue:</p>
<div class="line-block">
<div class="line"><strong>Adopt</strong> the zombie and update the node tree.</div>
<div class="line">The ECF_PASS on the zombie is copied over to the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a>, so that the next</div>
<div class="line"><a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> will continue as normal.</div>
</div>
</li>
<li><p class="first">Kill:</p>
<div class="line-block">
<div class="line">Applies the kill command (ECF_KILL_CMD ) using the process id stored on the <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a>.</div>
<div class="line">If the script has correct signal trapping, this should end up calling abort.</div>
<div class="line">Note: path zombies will need to be killed manually.</div>
</div>
</li>
</ul>
<div class="admonition warning">
<p class="first admonition-title">Warning</p>
<p class="last">Of the four action above, only Rescue will allow <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-child-command"><em class="xref std std-term">child command</em></a> to change the state of the node tree.</p>
</div>
<p><strong>What to do:</strong></p>
<ol class="arabic simple">
<li>Create a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-zombie"><em class="xref std std-term">zombie</em></a> by starting a <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-task"><em class="xref std std-term">task</em></a>, and setting it to <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-complete"><em class="xref std std-term">complete</em></a> immediately via <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a></li>
<li>Inspect the log file, it will show you how the zombie has arisen.</li>
<li>Inspect the zombie dialog in <a class="reference internal" href="/wiki/display/ECFLOW/Glossary#term-ecflowview"><em class="xref std std-term">ecflowview</em></a> (right mouse button selection on the host node)</li>
<li>Experiment with the different actions on the zombie</li>
<li>Select host node and invoke the <strong>option...</strong> menu selection.
Select the Zombies button. This enables zombie notification via window pop up</li>
</ol>
</div>
</div>