Page tree
Skip to end of metadata
Go to start of metadata

How do I check a suite definition without loading it into the ecflow_server?

  •   If the suite definition is in a text file::

      > ecflow_client --load=test.def check_only

  • Load the test.def into python and check

      import ecflow      
      defs = ecflow.Defs("test.def")
      theCheckValue = defs.check();
      assert len(theCheckValue) != 0,  "Error in expression,limits,etc," + theCheckValue
  •   If the suite definition is built using python api.
  import ecflow      
  defs = ecflow.Defs()
  suite = defs.add_suite("s1");
  suite.add_task("t1").add_trigger("t2 == active)")   
  theCheckValue = defs.check();
  print "Message: '" + theCheckValue + "'"
  assert len(theCheckValue) != 0,  "Expected Error: mis-matched brackets in expression."

I have loaded my suite definition but I cannot see it on ecflow_ui?

  If you can see the suite using the suites or get command in CLI::

ecflow_client --suites   # list suites, this is faster than using --get        
ecflow_client --get       

 but not in ecflow_ui you should right click on the “server” node, and select “suite filter” and select your suite.

My task fails when I submit it and I do not see any job output?

   Generally it is always a good idea to look at the output of the ecflow_server log directly or in ecfow_ui by right clicking on the server and selecting “History”.

   Also look at whether a job is created. Also check that server is not halted.

   If you click on the script button you get a “file read error” message then ecflow_server cannot find your script.

   Check the file permissions or location given in variable ECF_SCRIPT (using variables  in ecflow_ui).  This is normally based on the ECF_FILES variable

   If you right click on the node and click the edit button and get the error message“send failed for the node” you may not be able to access your include files.  

   Check the location of the include files and how they are defined (see ecFlow Pre-processor symbols ).

My task stays in a submitted state when I submit a job?

   This could be caused by a number of reasons, please check::

  •     The job being unable to submit because the queuing system used cannot schedule the task at the time or because the task is failing before the child command

              > ecflow_client --init   # command is sent

  • Check that the ecflow_server is not halted
  • Test, running the job from the command line or check the status in the queuing system used, e.g. llq for loadleveller, qstat for PBS etc.
    Run the submission command from the command line.  This is based on how the ECF_JOB_CMD variable is set. The script could be failing before the ecflow_client --init  command is sent.
  • If you are using ECF_OUT to define the output directory. Then make sure that the directory exists, including the directories corresponding to suite/family nodes.

My task stays in a active state when I submit a job?

   This can be caused if the job is unable to send an ecflow_client --complete child command to the ecflow_server.  

   This can be caused by an error that is not trapped, such as the job being killed with a -9 option or the host system crashing.

   Check that the job is running and if not rerun or synchronise the task status in ecFlow as appropriate.

How can I check the status of an ecflow_server?

 Invoke::

      > ecflow_client --stats   

   This will display some standard information regarding the ecflow_server including the version number,node information, status, security information, usage, load, setup and up time.

How can I check the load on the ecflow_server?

   Invoke::

      > ecflow_client --server_load   

 This relies on gnuplot. If you know the location of your log path and it’s accessible, to avoid overloading server call

      > ecflow_client --server_load=/path/to/log/file

This command will produce a <hist>.<port>.png file, which can then be viewed

Jobs run locally remain submitted when the variable ECF_OUT is used?

ECF_OUT variable should be used in situations where job output, is not located in the same directory as job files.

This is necessary for remote job submission when local and remote hosts do not share a common file system.

By using ECF_OUT, the user is then responsible to create the directory structure (all directories) in *advance*, so that output files can be created.

It is enough to copy-paste the directory path for the variable ECF_JOBOUT, and use it to execute the command mkdir -p on the remote host.

ecFlow server is "target agnostic" and does not know on its own how to log appropriately on the remote machine ; it keeps the suite designer responsible for directory creation.

A message 'locale::facet::_S_create_c_locale name not valid' is displayed, when the ecflow_server or ecflow_client command are run. How to prevent it ?

To see the list of locale's on your system use

    > locale -a

Then set the LANG environment variable. i.e.

LANG=en_GB.UTF-8 (ksh: export LANG=en_GB.UTF-8)

How can I logically or time dependencies of different types ?

It is important to understand how time dependencies work first.

When we have multiple time dependencies of the same type, they are 'or'ed. i.e.

       time 10.00

       time 12:00             # Task will run when time is 10:00 OR 12:00

  

Likewise if we have:

       date 1.12.2012

       date 2.12.2012         # Task is free to run only on first or second of December.

  

When *different* types of time dependencies are added, then the task is only free to run when both are satisfied:

       time 10:00

       date 1.12.2015         # task is only free to run at 10 am on the first of December

This effectively means that time dependencies are logically anded.

  

 Now suppose we wanted to run the task, at 10.00  *OR* when the date is 1.12.2015.

This can be done by adding a dummy task.

  

       task dummy_time_trigger

          edit ECF_DUMMY_TASK ""   # Tell server & checking not to expect .ecf file

          time 10:00

       task dummy_date_trigger

          edit ECF_DUMMY_TASK ""   # Tell server & checking not to expect .ecf file

          date 1.12.2015

       task time_or_date

          trigger dummy_time_trigger == complete or dummy_date_trigger == complete

  

By using a combination of a dummy task and trigger, we can achieve the effect of 'OR' in time dependencies of different types.

This technique will work for any complex dependency and has the added advantage of allowing us to manually free the dependencies via the GUI.  

In the python API what's the difference between sync_local(),get_server_defs(), get_defs() ?

First it is important to understand that the 'ecflow.Client' class **stores** the suite definition returned from the server.

The suite definition can be retrieved from the server using 'sync_local()' or 'get_server_defs()'

While 'ecflow.Client' exists the suite definition is retained.

 ecflow.Client.get_defs()

Returns the defs stored on the client. Hence either sync_local() or get_server_defs() should be called first, otherwise a Null object is returned.

 

  ecflow.Client.sync_local()

  #. The very *first* call always retrieves the *full* suite definition

  #. The second and subsequent calls *may* return delta/incremental *or* less typically the full suite definition.

     If there there only event, meter,label and state changes in the server, then calling sync_local(), will retrieve these *small* incremental changes and synchronise them with thesuite definition held in the ecflow.Client() object.

    Typically these changes are a very small fraction, when compared with the full suite. This is the normal scenario.

     The incremental sync reduces the network bandwidth and hence improves speed.

     If however the user make large scale changes, i.e. by deleting or adding nodes, then sync_local() will return the full suite definition.

     Hence if your python code needs to continually poll the server, please use the same ecflow.Client() object and *always* use sync_local().

   try:         
         ci = Client()                       # use default host(ECF_NODE) & port(ECF_PORT)
         ci.sync_local()                     # Very first call gets the full Defs
         client_defs = ci.get_defs()         # End user access to the returned Defs
             ... after a period of time
         ci.sync_local()                     # Subsequent calls retrieve incremental or full suite, but typically incremental
         if ci.in_sync():                    # returns true if server changed and changes applied to client
            print 'Client is now in sync with server'
         client_defs = ci.get_defs()         # End user access to the returned Defs
   except RuntimeError, e:
         print str(e)

  ecflow.Client.get_server_defs()

   This *always* returns the full suite definition. For single use of suite definition in the python code there is no difference between sync_local()and get_server_defs().

 *HOWEVER* if you wish to monitor the server in python then you *MUST* uses sync_local() as it will be considerably faster and will avoid overloading the server.

 try:         
    ci = Client()         # use default host(ECF_NODE) & port(ECF_PORT)
    ci.get_server_defs()  # retrieve definition from the server and store on 'ci'
    print ci.get_defs()   # print out definition stored in the client
 except RuntimeError, e:
   print str(e)

It might be useful in some circumstances to introduce a time delay when a job is submitted again thanks to ECF_TRIES.

We may not want such skill as part of the ecFlow server. In most cases, it might be enough to update the submit script to add a "sleep" when job2 is being submitted. Updating such script shall facilitate changing time interval, the list of remote hosts which cannot stand immediate resubmit, filtering this feature for some (or all) suites. So that there would be no need to update-compile-install-stop-restart ecFlow server.At ECMWF, submit variable is often defined as "edit ECF_JOB_CMD '

 

Upgrading my system (new Linux flavour, new ecFlow version), ecflowview popping windows are ignored when virtual displays are used.

With KDE, open the window "System Settings - Window Behaviour" check "Focus Window Prevention" is set as expected (None)

Also ecflowview title might be changed so that the "Window Rules - Appearance&Fixes - Accept Focus" is activated. "Window Rules" enables ecflowview window size and position settings too.

How can I monitor a suite as file system?

We can do this using the ecFlow python API and python-fuse library: How can I monitor my suite independent of the GUI. The status is slightly delayed while it is accessed through a python client.

 

How do I advance a REPEAT even when some of the tasks are aborted?

There are several ways to do this.

1/ If the tasks are not very robust and are known to fail regularly then, we can have a custom ERROR trapping, which instead of aborting the task, log the aborts, and sets the task to complete. The logging could be anything

# Defined a error handler
ERROR() {
    echo "ERROR called"
    set +e                     # Clear -e flag, so we don't fail
    wait                       # wait for background process to stop

    # Record the failure in the log file for later analysis
    ecflow_client --msg="ERROR task %ECF_NAME% failed"

    ecflow_client --complete   # replace abort with a complete
    trap 0                     # Remove the trap
    exit 0                     # End the script
}

2/ Have a special task whose job is to monitor failure in the other tasks. This task will then log the failures and then automatically set the family/repeat to complete.

suite suite
  family main
    repeat date YMD 20170101 20180101 1
    task dodgy    # this task may fail
    task ok
    task fix                                          # handle failures, so repeat will advance even if other tasks fail
      complete dodgy == complete and ok  == complete  # If there are no failures, complete fix, so repeat will advance
      time 23:30                                      # run at 23:30, allowing users to address task dodgy otherwise automatically advance the REPEAT
  endfamily
endsuite
fix.ecf
# task fix.ecf
%include <head.h>

trap 0
ecflow_client --force=complete recursive %SUITE%/%FAMILY%
exit 0

%include <tail.h>

 "Ran out of end point" error message?

Check the remote server is started. Did you try to issue a command with a wrong port/host/ECF_PORT/ECF_HOST?

# ecflow:ClientInvoker: Connection error: (Client::handle_connect: Ran out of end points: connection error( Connection refused ) 

How do i renew the logfile?

ecflow_client --port $(($(id -u) + 1500)) --log=new

How to i check current connections to the server?

ecflow_client --port $(($(id -u) + 1500)) --ch_suites

How do i cleanup a server from all its connections?

export ECF_PORT=$(($(id -u) + 1500)) ECF_HOST=${ECF_HOST:=localhost}
handles="$(ecflow_client --ch_suites | cut -c7-18 | grep -v handle)"
for handle in $handles ; do ecflow_client --ch_drop $handle; done

Pure python tasks, the server does not honour ECF_TRIES ?

When ecflow calls the default ECF_JOB_CMD this is spawned as a separate child process. This child process is then monitored for abnormal termination.

When this happens ecflow will call abort, and sets a special flag which prevents ECF_TRIES from working.

It should be noted that the default error trapping will call exit(0),  hence the default  ECF_JOB_CMD will correct;y handle ECF_TRIES.

head.h
....
# Defined a error handler
ERROR() {
  echo "ERROR called"
  set +e                                # Clear -e flag, so we don't fail
  wait                                  # wait for background process to stop
  trap 0 1 2 3 4 5 6 7 8 10 12 13 15    # when the following signals arrive do nothing, stops recursive signals/error function being called
  ecflow_client --abort                 # Notify ecflow that something went wrong
  trap 0                                # Remove the trap
  exit 0                                # End the script. Notice that we call exit(0)
}

In our operations we have a specialized script(trimurti), that will detach from the spawned of process, i.e. by nohup or via a special program called standalone.

This bypass the spawned process termination issue. Also the korn shell error trap uses wait, i.e to wait for background process to stop.

Hence when using pure python JOBS with the default ECF_JOB_CMD, after the python program called errors/aborts, the process terminates abnormally. 

This process termination was captured by the ecflow server, causing an abort. i.e. either when node state is aborted or submitted(due to ECF_TRIES)

This abnormal job termination prevents the aborted job from rerunning. When second process starts running, the task in the server  is already aborted, leading to zombies.


To fix your problem.

  • Use a dedicated script for job submission. and use a bash/korn shell to invoke your python scripts. Using korn shell trapping for robust error handling. Like above.
  • Alternatively If you want to stick with pure python tasks, you need to detach from the spawned of process.  Modify your ECF_JOB_CMD

    edit ECF_JOB_CMD "nohup python $ECF_JOB$ > $ECF_JOBOUT$ 2>&1 &"
  • Alternatively always make sure your python jobs exits cleanly after calling ecflow abort.  by calling exit(0)

     def signal_handler(self,signum, frame):
        print 'Aborting: Signal handler called with signal ', signum
        self.ci.child_abort("Signal handler called with signal " + str(signum));
        sys.exit(0)
        def __exit__(self,ex_type,value,tb):
            print "Client:__exit__: ex_type:" + str(ex_type) + " value:" + str(value) + "\n" + str(tb)
            if ex_type != None:
                self.ci.child_abort("Aborted with exception type " + str(ex_type) + ":" + str(value))
                sys.exit(0)
                return False
            self.ci.child_complete()
            return False 

How can I automatically kill long running jobs ?

One way of doing this is a combination of the late flag and triggers. This shows an example with/without a script. This example show both, hence you need to choose one.

suite s 
      task long_running                   # if task takes takes longer than 2 hours set late flag
           late -c +02:00                 
      task kill_long_running              # only triggered if task long_running is late
          trigger long_running<flag>late   
      task kill_long_running_noScript     # only triggered if task long_running is late
          trigger long_running<flag>late  
          edit ECF_NO_SCRIPT 1 
          edit ECF_JOB_CMD "export ECF_PASS=%ECF_PASS%;export ECF_PORT=%ECF_PORT%;export ECF_HOST=%ECF_HOST%;export ECF_NAME=%ECF_NAME%;export ECF_TRYNO=%ECF_TRYNO%; ecflow_client --init=$$; %ECF_CLIENT% --kill /s/long_running; %ECF_CLIENT% --complete"

The script for kill_ long_running task

kill_long_running.ecf
<head.h>
ecflow_client --kill /s/long_running
<tail.h>

1 Comment