Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • ecflow_client –zombie_get
    This will list all the zombies in the server.
  • ecflow_client –zombie_fail <task-path>
    Ask the zombie to fail. This may result in another zombie because abort child command in the job, will be called.
  • ecflow_client –zombie_fob <task-path>
    Used to unblock the child, allows the job to proceed.
    However this will only work for zombies where the password does not match.
  • ecflow_client –zombie_adopt <task-path>
    Copies the password stored on the zombie onto the task.
    Allows the job to proceed, and update the state in the server( i.e due to init,complete,abort).
    It is up to the user, to ensure that the zombie has been dealt with  before doing this.
  • ecflow_client –zombie_remove <task-path>
    Remove the zombie representation in the server.
    Typically this is done, when we are sure we have handled the zombie.
    The zombie will re-appear next time it communicates with server, if this is not the case.
  • ecflow_client –zombie_block <task-path>
    Ask the jobs to block at the child command in the job.
    Prevents the job from proceeding. (This is the default behaviour for the init, complete and abort child commands)

...

Sometimes zombies can arise for more obscure reasons. i.e The job sends a --init message to the server, meanwhile the server is busy(i.e processing jobs), when finally the server makes the task active, and sends a message back to the client/job, the ecflow_client has timed out. This causes the ecflow_client to send the same message again. However this time the server treats the child command as a zombie, since the task is already active. Hence we get these false zombies.

These scenario's are very rare, but tends to happen, for the following situations:

  • High disk latencies  (i.e.e  Check pointing takes a lot of time,   Check pointing  or job processing take to long. Typically happens when using virtual machines, with non local data)
  • very large scripts ( i.e. in the megabytes), this   can inflate the server memory, and cause job processing to take longer.
  • Extremely large definitions, which are requested by many users, via the GUI. (   The download size, can be reduced, by only requesting the suite you are interested in)
  • Very busy machine and/or not enough memory available. ( i.e. server is competing for the resources)
  • Server is overloaded. ( this can  be visualised if you have gnuplot installed, and available on $PATH,  i.e invoke  ecflow_client --server_load=<path to the log file> )

...