How do I check a suite definition without loading it into the ecflow_server?
- If the suite definition is in a text file::
> ecflow_client --load=test.def check_only
Load the test.def into python and check
- If the suite definition is built using python api.
I have loaded my suite definition but I cannot see it on ecflow_ui?
If you can see the suite using the suites or get command in CLI::
but not in ecflow_ui you should right click on the “server” node, and select “suite filter” and select your suite.
My task fails when I submit it and I do not see any job output?
Generally it is always a good idea to look at the output of the ecflow_server log directly or in ecfow_ui by right clicking on the server and selecting “History”.
Also look at whether a job is created. Also check that server is not halted.
If you click on the script button you get a “file read error” message then ecflow_server cannot find your script.
Check the file permissions or location given in variable ECF_SCRIPT (using variables in ecflow_ui). This is normally based on the ECF_FILES variable
If you right click on the node and click the edit button and get the error message“send failed for the node” you may not be able to access your include files.
Check the location of the include files and how they are defined (see ecFlow Pre-processor symbols ).
My task stays in a submitted state when I submit a job?
This could be caused by a number of reasons, please check::
- The job being unable to submit because the queuing system used cannot schedule the task at the time or because the task is failing before the child command
> ecflow_client --init # command is sent
- Check that the ecflow_server is not halted
- Test, running the job from the command line or check the status in the queuing system used, e.g. llq for loadleveller, qstat for PBS etc.
Run the submission command from the command line. This is based on how the ECF_JOB_CMD variable is set. The script could be failing before the ecflow_client --init command is sent.
- If you are using ECF_OUT to define the output directory. Then make sure that the directory exists, including the directories corresponding to suite/family nodes.
My task stays in a active state when I submit a job?
This can be caused by an error that is not trapped, such as the job being killed with a -9 option or the host system crashing.
Check that the job is running and if not rerun or synchronise the task status in ecFlow as appropriate.
How can I check the status of an ecflow_server?
> ecflow_client --stats
This will display some standard information regarding the ecflow_server including the version number,node information, status, security information, usage, load, setup and up time.
How can I check the load on the ecflow_server?
> ecflow_client --server_load
This relies on gnuplot. If you know the location of your log path and it’s accessible, to avoid overloading server call
> ecflow_client --server_load=/path/to/log/file
This command will produce a <hist>.<port>.png file, which can then be viewed
Jobs run locally remain submitted when the variable ECF_OUT is used?
ECF_OUT variable should be used in situations where job output, is not located in the same directory as job files.
This is necessary for remote job submission when local and remote hosts do not share a common file system.
By using ECF_OUT, the user is then responsible to create the directory structure (all directories) in *advance*, so that output files can be created.
It is enough to copy-paste the directory path for the variable ECF_JOBOUT, and use it to execute the command mkdir -p on the remote host.
ecFlow server is "target agnostic" and does not know on its own how to log appropriately on the remote machine ; it keeps the suite designer responsible for directory creation.
A message 'locale::facet::_S_create_c_locale name not valid' is displayed, when the ecflow_server or ecflow_client command are run. How to prevent it ?
To see the list of locale's on your system use
> locale -a
Then set the LANG environment variable. i.e.
LANG=en_GB.UTF-8 (ksh: export LANG=en_GB.UTF-8)
How can I logically or time dependencies of different types ?
It is important to understand how time dependencies work first.
When we have multiple time dependencies of the same type, they are 'or'ed. i.e.
time 12:00 # Task will run when time is 10:00 OR 12:00
Likewise if we have:
date 2.12.2012 # Task is free to run only on first or second of December.
When *different* types of time dependencies are added, then the task is only free to run when both are satisfied:
date 1.12.2015 # task is only free to run at 10 am on the first of December
This effectively means that time dependencies are logically anded.
Now suppose we wanted to run the task, at 10.00 *OR* when the date is 1.12.2015.
This can be done by adding a dummy task.
edit ECF_DUMMY_TASK "" # Tell server & checking not to expect .ecf file
edit ECF_DUMMY_TASK "" # Tell server & checking not to expect .ecf file
trigger dummy_time_trigger == complete or dummy_date_trigger == complete
By using a combination of a dummy task and trigger, we can achieve the effect of 'OR' in time dependencies of different types.
This technique will work for any complex dependency and has the added advantage of allowing us to manually free the dependencies via the GUI.
In the python API what's the difference between sync_local(),get_server_defs(), get_defs() ?
First it is important to understand that the 'ecflow.Client' class **stores** the suite definition returned from the server.
The suite definition can be retrieved from the server using 'sync_local()' or 'get_server_defs()'
While 'ecflow.Client' exists the suite definition is retained.
Returns the defs stored on the client. Hence either sync_local() or get_server_defs() should be called first, otherwise a Null object is returned.
#. The very *first* call always retrieves the *full* suite definition
#. The second and subsequent calls *may* return delta/incremental *or* less typically the full suite definition.
If there there only event, meter,label and state changes in the server, then calling sync_local(), will retrieve these *small* incremental changes and synchronise them with thesuite definition held in the ecflow.Client() object.
Typically these changes are a very small fraction, when compared with the full suite. This is the normal scenario.
The incremental sync reduces the network bandwidth and hence improves speed.
If however the user make large scale changes, i.e. by deleting or adding nodes, then sync_local() will return the full suite definition.
Hence if your python code needs to continually poll the server, please use the same ecflow.Client() object and *always* use sync_local().
This *always* returns the full suite definition. For single use of suite definition in the python code there is no difference between sync_local()and get_server_defs().
*HOWEVER* if you wish to monitor the server in python then you *MUST* uses sync_local() as it will be considerably faster and will avoid overloading the server.
It might be useful in some circumstances to introduce a time delay when a job is submitted again thanks to ECF_TRIES.
We may not want such skill as part of the ecFlow server. In most cases, it might be enough to update the submit script to add a "sleep" when job2 is being submitted. Updating such script shall facilitate changing time interval, the list of remote hosts which cannot stand immediate resubmit, filtering this feature for some (or all) suites. So that there would be no need to update-compile-install-stop-restart ecFlow server.At ECMWF, submit variable is often defined as "edit ECF_JOB_CMD '
Upgrading my system (new Linux flavour, new ecFlow version), ecflowview popping windows are ignored when virtual displays are used.
With KDE, open the window "System Settings - Window Behaviour" check "Focus Window Prevention" is set as expected (None)
Also ecflowview title might be changed so that the "Window Rules - Appearance&Fixes - Accept Focus" is activated. "Window Rules" enables ecflowview window size and position settings too.
How can I monitor a suite as file system?
We can do this using the ecFlow python API and python-fuse library: How can I monitor my suite independent of the GUI. The status is slightly delayed while it is accessed through a python client.
How do I advance a REPEAT even when some of the tasks are aborted?
There are several ways to do this.
1/ If the tasks are not very robust and are known to fail regularly then, we can have a custom ERROR trapping, which instead of aborting the task, log the aborts, and sets the task to complete. The logging could be anything
2/ Have a special task whose job is to monitor failure in the other tasks. This task will then log the failures and then automatically set the family/repeat to complete.
"Ran out of end point" error message?
Check the remote server is started. Did you try to issue a command with a wrong port/host/ECF_PORT/ECF_HOST?
How do i renew the logfile?
How to i check current connections to the server?
How do i cleanup a server from all its connections?
Pure python tasks, the server does not honour ECF_TRIES ?
When ecflow calls the default ECF_JOB_CMD this is spawned as a separate child process. This child process is then monitored for abnormal termination.
When this happens ecflow will call abort, and sets a special flag which prevents ECF_TRIES from working.
It should be noted that the default error trapping will call exit(0), hence the default ECF_JOB_CMD will correct;y handle ECF_TRIES.
In our operations we have a specialized script(trimurti), that will detach from the spawned of process, i.e. by nohup or via a special program called standalone.
This bypass the spawned process termination issue. Also the korn shell error trap uses wait, i.e to wait for background process to stop.
Hence when using pure python JOBS with the default ECF_JOB_CMD, after the python program called errors/aborts, the process terminates abnormally.
This process termination was captured by the ecflow server, causing an abort. i.e. either when node state is aborted or submitted(due to ECF_TRIES)
This abnormal job termination prevents the aborted job from rerunning. When second process starts running, the task in the server is already aborted, leading to zombies.
To fix your problem.
- Use a dedicated script for job submission. and use a bash/korn shell to invoke your python scripts. Using korn shell trapping for robust error handling. Like above.
Alternatively If you want to stick with pure python tasks, you need to detach from the spawned of process. Modify your ECF_JOB_CMD
Alternatively always make sure your python jobs exits cleanly after calling ecflow abort. by calling exit(0)
How can I automatically kill long running jobs ?
One way of doing this is a combination of the late flag and triggers. This shows an example with/without a script. This example show both, hence you need to choose one.
The script for kill_ long_running task