How are zombies created ?
- The node tree is deleted, replaced or reloaded whilst jobs are running
- A task is rerun, whilst in a submitted or active state
- A job is forced to new state, i.e.. complete
More rarer causes might be:
- ecf script errors, where we have multiple calls to init and complete child command s
- The child command s in the ecf script are placed in the background. In this case order in which the child command contact the server, may be indeterminate.
- Load leveller submitting a job twice
- Server crash and recovered check point file is out of date
- Machine crash
How can zombie’s be handled ?
The default behaviour for init, complete, abort and wait child commands, is to block the job, and for event, label, meter to continue(fob). (from version 4.0.4, previously all zombie, child commands, blocked)
ecflow_ui provides a tab which lists all the zombies and the actions that can be taken.
The zombies tab is shown, in the info panel when the server node( i.e. top most) is selected.
The actions include:
Allow the job to continue. The child command completes and hence no longer blocks the job.Great care should be taken when this action is chosen.If we have two jobs running, they may cause data corruption.Even when we have a single job, issues can arise.i.e.. if the associated command was an event child command, then the
Kill:Applies the kill command (ECF_KILL_CMD ) using the process id stored on the zombie.If the script has correct signal trapping, this should end up calling abort.Note: path zombies will need to be killed manually.