When zombie s arise they can be handled manually by ecflowview. (See Zombie) or via the command line interface:
It is also possible to ask ecflow_server to make the same response in an automated fashion. How ever very careful consideration should be made before doing this. Otherwise it could mask a serious underlying problem.
The automated response can be defined with:
python interface( See ecflow.ZombieAttr)
text interface ( See Definition file Grammar)
zombie ::= "zombie" >> `zombie_type` >> ":" >> !(`client_side_action` | `server_side_action`) >> ":" >> *`child` >> ":" >> !`zombie_life_time`
zombie_type ::= "user" | "ecf" | "path"
child ::= "init" | "event" | "meter" | "label" | "wait" | "abort" | "complete"
client_side_action ::= "fob" | "fail" | "block"
server_side_action ::= "adopt" | "delete"
zombie_life_time ::= unsigned integer ( default: user(300), ecf(3600), path(900) )
The zombie attribute is inherited in the same manner as Variable inheritance.
Example: For tasks under suite “s1” add a zombie attribute, such that child label commands(i.e ecflow_client –label) never blocks the job: (not strictly needed as this is the default behaviour in release 4.0.5 onwards)
python
s1 = ecflow.Suite('s1')
child_list = [ ChildCmdType.label ]
zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
s1.add_zombie(zombie_attr)
text
suite s1
zombie ecf:fob:label:
Example: For tasks under suite “s1” add a zombie attribute, such that job that issues the child commands( event, meter, label) never blocks: (not strictly needed as this is the default behaviour in release 4.0.5 onwards)
python
s1 = ecflow.Suite('s1')
child_list = [ ChildCmdType.label, ChildCmdType.event, ChildCmdType.meter ]
zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
s1.add_zombie(zombie_attr)
text
suite s1
zombie ecf:fob:label,event,meter:
Example: For all tasks under family “critical”, if any zombies arise then fail the job:
python
with ecflow.Suite('s1') as s1:
with s1.add_family("critical") as crit :
child_list = [ ] # empty child list means apply to all child commands
crit.add_zombie(ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fail, 300))
crit.add_zombie(ZombieAttr(ZombieType.path, child_list, ZombieUserActionType.fail, 300))
crit.add_zombie(ZombieAttr(ZombieType.user, child_list, ZombieUserActionType.fail, 300))
text
suite s1
family critical
zombie ecf:fail::
zombie path:fail::
zombie user:fail::
Here are some further example of using --alter:
You can only add one zombie attribute of each time(ecf,path,user).
To delete a zombie attribute, please use one of:
Here are some more examples:
ecflow_client --alter add zombie "ecf:kill:init,complete:" /suiteZ
ecflow_client --alter add zombie "user:kill::" /suiteZ
ecflow_client --alter add zombie "ecf:adopt:complete:" /suiteZ
Sometimes zombies can arise for more obscure reasons. i.e The job sends a --init message to the server, meanwhile the server is busy(i.e processing jobs), when finally the server makes the task active, and sends a message back to the client/job the ecflow_client has timed out. This causes the ecflow_client to send the same message again. However this time the server treats the command as a zombie, since the task is already active.
These scenario's are very rare, but tends to happen, for the following situations:
To diagnose these cases, we need to look at the log file. Typically you will see two or more --init/complete commands, where the second will then be treated as a zombie.
To get round these issue you can add a variable ECF_NONSTRICT_ZOMBIES, which will reduce these false zombies.
ecflow_client --alter add variable ECF_NONSTRICT_ZOMBIES 1 / # adds the variable to the root/server level, and hence affect all suites on the server
ecflow_client --alter add variable ECF_NONSTRICT_ZOMBIES 1 /suiteX # adds the variable at the suite level,, and hence only affects this suite.