You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Problem

Pure python tasks, the server does not honour ECF_TRIES ?

Solution

When ecflow calls the default ECF_JOB_CMD this is spawned as a separate child process. This child process is then monitored for abnormal termination.

When this happens ecflow will call abort, and sets a special flag which prevents ECF_TRIES from working.

It should be noted that the default error trapping( bask/korn shells) will call exit(0),  hence the default  ECF_JOB_CMD will correctly handle ECF_TRIES.

head.h
....
# Defined a error handler
ERROR() {
  echo "ERROR called"
  set +e                                # Clear -e flag, so we don't fail
  wait                                  # wait for background process to stop
  trap 0 1 2 3 4 5 6 7 8 10 12 13 15    # when the following signals arrive do nothing, stops recursive signals/error function being called
  ecflow_client --abort                 # Notify ecflow that something went wrong
  trap 0                                # Remove the trap
  exit 0                                # End the script. Notice that we call exit(0)
}

In our operations we have a specialized script(trimurti), that will detach from the spawned of process, i.e. by nohup or via a special program called ecflow_standalone.

This bypass the spawned process termination issue. Also the korn shell error trap uses wait, i.e. to wait for background process to stop.

Hence when using pure python JOBS with the default ECF_JOB_CMD, after the python program called errors/aborts, the process terminates abnormally. 

This process termination was captured by the ecflow server, causing an abort, i.e. either when node state is aborted or submitted(due to ECF_TRIES)

This abnormal job termination prevents the aborted job from rerunning. When second process starts running, the task in the server  is already aborted, leading to zombies.


To fix your problem.

Use a dedicated script for job submission. and use a bash/korn shell to invoke your python scripts. Using korn shell trapping for robust error handling. Like above.

Alternatively If you want to stick with pure python tasks, you need to detach from the spawned of process.  Modify your ECF_JOB_CMD

edit ECF_EXTN .py # to correctly locate your script
edit ECF_JOB_CMD "nohup python $ECF_JOB$ > $ECF_JOBOUT$ 2>&1 &"

Alternatively always make sure your python jobs exits cleanly after calling ecflow abort.  by calling exit(0)

 def signal_handler(self,signum, frame):
    print 'Aborting: Signal handler called with signal ', signum
    self.ci.child_abort("Signal handler called with signal " + str(signum));
    sys.exit(0)
    def __exit__(self,ex_type,value,tb):
        print "Client:__exit__: ex_type:" + str(ex_type) + " value:" + str(value) + "\n" + str(tb)
        if ex_type != None:
            self.ci.child_abort("Aborted with exception type " + str(ex_type) + ":" + str(value))
            sys.exit(0)
            return False
        self.ci.child_complete()
        return False 



  • No labels