When zombie s arise they can be handled manually by ecflowview. (See Zombie) or via the command line interface:
Sometimes we may want the job to proceed but "ecflow_client –zombie_adopt=<task-path>" does not work. i.e. we have the case where zombies password matches, but the process id (ECF_RID) are different.
ecflow_client –zombie_adopt=<task-path>, will not allow this, due to the potential for data corruption.
In this case the normal behaviour would be kill both process, and re-queue the task.
In the extreme, we can by pass the authentication. (i.e. allowing the request to be handled by the server).
This should ONLY be done when you are sure the zombie has been killed, and you don’t want to re-queue the job.
> ecflow_client --alter=add variable ECF_PASS FREE < path to task> |
This is also available from the GUI. Select the task. RMB->Special-> Free password.
After the job has completed, be sure to delete this variable. Otherwise if zombies arise again, there is a considerable risk of data corruption.
It is also possible to ask ecflow_server to make the same response in an automated fashion. How ever very careful consideration should be made before doing this. Otherwise it could mask a serious underlying problem.
The automated response can be defined statically using python and text interface or dynamically (add/remove) via alter.:
python interface( See ecflow.ZombieAttr)
text interface ( See Definition file Grammar)
zombie ::= "zombie" >> `zombie_type` >> ":" >> !(`client_side_action` | `server_side_action`) >> ":" >> *`child` >> ":" >> !`zombie_life_time`
zombie_type ::= "user" | "ecf" | "path"
child ::= "init" | "event" | "meter" | "label" | "wait" | "abort" | "complete"
client_side_action ::= "fob" | "fail" | "block"
server_side_action ::= "adopt" | "delete | "kill"
zombie_life_time ::= unsigned integer ( default: user(300), ecf(3600), path(900) ), the server poll timer runs every 60 seconds, hence this is the effective real minimum.
The zombie attribute is inherited in the same manner as Variable inheritance.
Example: For tasks under suite “s1” add a zombie attribute, such that child label commands(i.e. ecflow_client –label) never blocks the job: (not strictly needed as this is the default behaviour from release 4.0.5 onwards)
python
s1 = ecflow.Suite('s1')
child_list = [ ChildCmdType.label ]
zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
s1.add_zombie(zombie_attr)
text
suite s1
zombie ecf:fob:label:
Example: For tasks under suite “s1” add a zombie attribute, such that job that issues the child commands( event, meter, label) never blocks: (not strictly needed as this is the default behaviour from release 4.0.5 onwards)
python
s1 = ecflow.Suite('s1')
child_list = [ ChildCmdType.label, ChildCmdType.event, ChildCmdType.meter ]
zombie_attr = ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fob, 300)
s1.add_zombie(zombie_attr)
text
suite s1
zombie ecf:fob:label,event,meter:
Example: For all tasks under family “critical”, if any zombies arise then fail the job:
python
with ecflow.Suite('s1') as s1:
with s1.add_family("critical") as crit :
child_list = [ ] # empty child list means apply to all child commands
crit.add_zombie(ZombieAttr(ZombieType.ecf, child_list, ZombieUserActionType.fail, 300))
crit.add_zombie(ZombieAttr(ZombieType.path, child_list, ZombieUserActionType.fail, 300))
crit.add_zombie(ZombieAttr(ZombieType.user, child_list, ZombieUserActionType.fail, 300))
text
suite s1
family critical
zombie ecf:fail::
zombie path:fail::
zombie user:fail::
Here are some further example of using --alter:
You can only add one zombie attribute of each time(ecf,path,user).
To delete a zombie attribute, please use one of:
Here are some more examples:
ecflow_client --alter=add zombie "ecf:kill:init,complete:" /suiteZ
ecflow_client --alter=add zombie "user:kill::" /suiteZ
ecflow_client --alter=add zombie "ecf:adopt:complete:" /suiteZ