...
Excerpt | ||
---|---|---|
| ||
When ecflow calls the default ECF_JOB_CMD this is spawned as a separate child process. This child process is then monitored for abnormal termination. When this happens ecflow will call abort, and sets a special flag which that prevents ECF_TRIES from working. |
...
When this happens ecflow will call abort, and sets a special flag which that prevents ECF_TRIES from working.
It should be noted that the default error trapping( baskbash/korn shells) will call exit(0), hence the default ECF_JOB_CMD will correctly handle ECF_TRIES.
...
In our operations we have a specialized script(trimurti), that will detach from the spawned of process, i.e. by nohup or via a special program called ecflow_standalone.
This bypass bypasses the spawned process termination issue. Also, the korn Korn shell error trap uses wait, i.e. to wait for the background process to stop.
...
This abnormal job termination prevents the aborted job from rerunning. When the second process starts running, the task in the server server is already aborted, leading to zombies.
...
Use a dedicated script for job submission. and use a bash/korn Korn shell to invoke your python scripts. Using korn Korn shell trapping for robust error handling. Like above.
Alternatively, If you want to stick with pure python tasks, you need to detach from the spawned of process. Modify your ECF_JOB_CMD
Code Block |
---|
edit ECF_EXTN .py # to correctly locate your script edit ECF_JOB_CMD "nohup python $ECF_JOB$ > $ECF_JOBOUT$ 2>&1 &" |
Alternatively, always make sure your python jobs exits cleanly after calling ecflow abort. by calling sys.exit(0)
Code Block |
---|
def signal_handler(self,signum, frame): print 'Aborting: Signal handler called with signal ', signum self.ci.child_abort("Signal handler called with signal " + str(signum)); sys.exit(0) |
...