ecFlow's documentation is now on readthedocs!

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 65 Next »

aborted

Is a node status .

When the ECF_JOB_CMD fails or the job file sends a ecflow_client –abort child command , then the task is placed into a aborted state.

active

Is a node status .

If job creation was successful, and job file has started, then the ecflow_client –init child command is received by the ecflow_server and the task is placed into a active state

autocancel

autocancel is a way to automatically delete a node which has completed.

The delete may be delayed by an amount of time in hours and minutes or expressed in days. Any node may have a single autocancel attribute. If the auto cancelled node is referenced in the trigger expression of other nodes it may leave the node waiting. This can be solved by making sure the trigger expression also checks for the unknown state. i.e.:

trigger node_to_cancel == complete or
node_to_cancel == unknown

This guards against the ‘node_to_cancel’ being undefined or deleted

For python see ecflow.Autocancel and ecflow.Node.add_autocancel . For text BNF see autocancel

check point

The check point file is like the suite definition , but includes all the state information.

It is periodically saved by the ecflow_server .

It can be used to recover the state of the node tree should server die, or machine crash.

By default when a ecflow_server is started it will look to load the check point file.

The default check point file name is <host>.<port>.ecf.check. This can be overridden by the ECF_CHECK environment variable

child command

Child command’s(or task requests) are called from within the ecf script files. The table also includes the the default action(from version 4.0.4) if the child command is part of a zombie. They include:

 

Child CommandDescriptionZombie(default action)
ecflow_client –initSets the task to the active status block
ecflow_client –eventSet an eventfob
ecflow_client –meterChange a meterfob
ecflow_client –labelChange a labelfob
ecflow_client –waitwait for a expression to evaluate.block
ecflow_client –abortSets the task to the aborted status block
ecflow_client –completeSets the task to the complete status block

 

The following environment variables must be set for the child commands. ECF_NODE, ECF_NAME ,ECF_PASS and ECF_RID. See ecflow_client .

clock

A clock is an attribute of a suite .

A gain can be specified to offset from the given date.

The hybrid and real clock’s always runs in phase with the system clock (UTC in UNIX) but can have any offset from the system clock.

The clock can be :

time , day and date and cron dependencies work a little differently under the clocks.

If the ecflow_server is shutdown or halted the job scheduling is suspended. If this suspension is left for period of time, then it can affect task submission under hybrid and real clocks. In particular it will affect task s with time , today or cron dependencies .

  • dependencies with time series, can result in missed time slots:

    time 10:00 20:00 00:15    # If server is
    suspended > 15 minutes, time slots can be missed
    time +00:05 20:00 00:15   # start 5 minutes after the start of the suite, then
    every 15m until 20:00
    
  • When the server is placed back into running state any time dependencies with an expired time slot are submitted straight away. i.e if ecflow_server is halted at 10:59 and then placed back into running state at 11:20

    time 11:00

    Then any task with a expired single time slot dependency will be submitted straight away.

For python see ecflow.Clock and ecflow.Suite.add_clock . For text BNF see clock

complete

Is a node status .

The node can be set to complete:

By the complete expression

At job end when the task receives the ecflow_client –complete child command

complete expression

Force a node to be complete if the expression evaluates, without running any of the nodes.

This allows you to have tasks in the suite which a run only if others fail. In practice the node would need to have a trigger also.

For python see ecflow.Expression and ecflow.Node.add_complete

cron

Like time , cron defines time dependency for a node , but it will be repeated indefinitely

cron 11:00

cron 10:00 22:00 00:30 # <start> <finish> <increment>

When the node becomes complete it will be queued immediately. This means that the suite will never complete, and the output is not directly accessible through ecflowview

If tasks abort, the ecflow_server will not schedule it again.

If the time the job takes to complete is longer than the interval a time "slot" is missed, e.g.

cron 10:00 20:00 01:00

if the 10:00 run takes more than an hour, the 11:00 run will never occur.

If the cron defines months, days of the month, or week days or a single time slot the it relies on a day change, hence if a hybrid clock is defined, then it will be set to complete at the beginning of the suite , without running the corresponding job. Otherwise under a hybrid clock the suite would never complete .

For python see ecflow.Cron and ecflow.Node.add_cron . For text BNF see cron

date

This defines a date dependency for a node.

There can be multiple date dependencies. The European format is used for dates, which is: dd.mm.yy as in 31.12.2007. Any of the three number fields can be expressed with a wildcard * to mean any valid value. Thus, 01.*.* means the first day of every month of every year.

If a hybrid clock is defined, any node held by a date dependency will be set to complete at the beginning of the suite , without running the corresponding job. Otherwise under a hybrid clock the suite would never complete .

For python see: ecflow.Date and ecflow.Node.add_date . For text BNF see date

day

This defines a day dependency for a node.

There can be multiple day dependencies.

If a hybrid clock is defined, any node held by a day dependency will be set to complete at the beginning of the suite , without running the corresponding job. Otherwise under a hybrid clock the suite would never complete .

For python see: ecflow.Day and ecflow.Node.add_day . For text BNF see day

defstatus

Defines the default status for a task/family to be assigned to the node when the begin command is issued.

By default node gets queued when you use begin on a suite . defstatus is useful in preventing suites from running automatically once begun or in setting tasks complete so they can be run selectively.

For python see ecflow.DState and ecflow.Node.add_defstatus . For text BNF see defstatus

dependencies

Dependencies are attributes of node, that can suppress/hold a task from taking part in job creation .

They include trigger , date , day , time , today , cron , complete expression , inlimit and limit .

A task that is dependent can not be started as long as some dependency is holding it or any of its parent node s.

The ecflow_server will check the dependencies every minute, during normal scheduling and when any child command causes a state change in the suite definition .

directives

Directives start with a % charater. This is referred to as ECF_MICRO character.

The directives are used in two main context.

  • Preprocessing directives. In this case the directive starts as the first character on a line in a ecf script file. See the table below which shows the allowable values. Only one directive is allowed on the line.

  • Variable directives. We use two ECF_MICRO characters ie %VAR%, in this case they can occur anywhere on the line and in any number.

    %CAR% %TYPE% %WISHLIST%
    

    These directives take part in variable substitution .

    If the micro characters are not paired (i.e uneven) then variable substitution can not take place hence an error message is issued.

    port=%ECF_PORT       # error issued since
    '%' micro character are not paired.
    

    However an uneven number of micro character are allowed, If the line begins with ‘#’ comment charcter.

                          # This is a comment line with a single micro character % no error issued
    # port=%ECF_PORT again no error issued
    

Directives are expanded during pre-processing . Examples include:

SymbolMeaning
%include <filename>%ECF_INCLUDE% directory is searched for the filename and the contents included into the job file. If that variable is not defined ECF_HOME is used. If the ECF_INCLUDE is defined but the file does not exist, then we look in ECF_HOME. This allows specific files to be placed in ECF_INCLUDE and the more general/common include files to be placed in ECF_HOME. This is the recommended format
%include “filename”Include the contents of the file: %ECF_HOME%/%SUITE%/%FAMILY%/filename into the job.
%include filenameInclude the contents of the file filename into the output. The only form that can be used safely must start with a slash ‘/’
%includenopp filenameSame as %include, but the file is not interpreted at all.
%commentStart’s a comment, which is ended by %end directive. The section enclosed by %comment - %end is removed during pre-processing
%manualStart’s a manual, which is ended by %end directive. The section enclosed by %manual - %end is removed during pre-processing . The manual directive is used to create the manual page show in ecflowview .
%noppStop pre-processing until a line starting with %end is found. No interpretation of the text will be done( i.e. no variable substitutions)
%endEnd processing of %comment or %manual or %nopp
%ecfmicro CHARChange the directive character, to the character given. If set in an include file the effect is retained for the rest of the job( or until set again). It should be noted that the ecfmicro directive specified in the ecf script file, does not effect the variable substitution for ECF_JOB_CMD, ECF_KILL_CMD or ECF_STATUS_CMD variables. They still use ECF_MICRO . If no ecfmicro directive exists, we default to using ECF_MICRO from the suite definition
ecf file location algorithm

ecflow_server and job creation checking uses the following algorithm to locate the ‘.ecf’ file corresponding to a task :

  • ECF_SCRIPT

    First it uses the generated variable ECF_SCRIPT to locate the script. This variable is generated from: ECF_HOME/<path to task>.ecf

    Hence if the task path is /suite/f1/f2/t1, then ECF_SCRIPT=ECF_HOME/suite/f1/f2/t1.ecf

  • ECF_FILES

    Second it checks for the user defined ECF_FILES variable. If defined the value of this variable must correspond to a directory. This directory is searched in reverse order.

    i.e lets assume we have a task: /o/12/fc/model and ECF_FILES is defined as: /home/ecmwf/emos/def/o/ECFfiles

    The ecFlow will use the following search pattern.

    1. /home/ecmwf/emos/def/o/ECFfiles/o/12/fc/model.ecf
    2. /home/ecmwf/emos/def/o/ECFfiles/12/fc/model.ecf
    3. /home/ecmwf/emos/def/o/ECFfiles/fc/model.ecf
    4. /home/ecmwf/emos/def/o/ECFfiles/model.ecf
  • ECF_HOME

    Thirdly it searchs for the script in reverse order using ECF_HOME (i.e like ECF_FILES)

If this fails, than the task is placed into the aborted state. We can check that file can be located before loading the suites into the server.

ecf script

The ecFlow script refers to an '.ecf' file.

The script file is transformed into the job file by the job creation process.

The base name of the script file must match its corresponding task . i.e t1.ecf , corresponds to the task of name ‘t1’. The script if placed in the ECF_FILES directory, may be re-used by multiple tasks belonging to different families, providing the task name matches.

The ecFlow script is similar to a UNIX shell script.

The differences, however, includes the addition of “C†like pre-processing directives and ecFlow variable ‘s. Also the script must include calls to the init and complete child command s so that the ecflow_server is aware when the job starts(i.e changes state to active ) and finishes ( i.e changes state to complete )

ECF_DUMMY_TASK

This is a user variable that can be added to task to indicate that there is no associated ecf script file.

If this variable is added to suite or family then all child tasks are treated as dummy.

This stops the server from reporting an error during job creation .

edit ECF_DUMMY_TASK ''

ECF_JOB

This is a generated variable . If defines the path name location of the job file.

The variable is composed as: ECF_HOME/ECF_NAME.job<ECF_TRYNO>

ECF_JOBOUT

This is a generated variable . This variable defines the path name for the job output file. The variable is composed as following. 

If ECF_OUT is specified:

         ECF_OUT/ECF_NAME.ECF_TRYNO

otherwise:

         ECF_HOME/ECF_NAME.ECF_TRYNO

ECF_MICROThis is a suite and generated variable . The default value is %. This variable is used in variable substitution during command invocation and default directive character during pre-processing . It can be overriden, but must be replaced by a single character.
ECF_NAMEThis is a generated variable . It defines the path name of the task.
ECF_NONSTRICT_ZOMBIES

When the server is heavily overloaded, or when the server is being run on a virtual machines, where the scripts/.ecf are not local, then the server can spend a lot of time in job generation processing. This can end up affecting clients who try to communicate with the server.  Which results in the client calls in the jobs to time out, resulting in zombies.

i.e If a child command 'ecflow_client --int <process_id>' is called in the job, and the server is busy, then when the server process the request, however by the time the server replies back to the job/ecflow_client. It has already timed out.

The job then retries to send the same message again, however this time the server treat's the request as a zombie. The net result being the job suspends, since server treats it as a zombie. The default behaviour of zombies is to block for (init,complete,abort) and fob for event,label and meter child commands.

However when ECF_NONSTRICT_ZOMBIES is added as a variable to defs/suite/family/task, then the server behaves as follows:

  • --init , if the password and process id, matches and the task is already active, the server responds back with fob, allowing the job to continue.
  • --complete, if the task is already complete, the server responds back with fob, allowing the job to continue.
  • --abort, if the task is already aborted, the server responds back fob, allowing the job to continue

This variable can be added/removed using the alter functionality. The following example adds the variable at the server level, and hence affects all suites.

ecflow_client --alter add variable ECF_NONSTRICT_ZOMBIES 1 /

ECF_SCRIPTThis is a generated variable . If defines the path name for the ecf script
ECF_TRYNO

This is a generated variable that is used in file name generation. It represents the current try number for the task .

After begin it is set to 1. The number is advanced if the job is re-run. It is re-set back to 1 after a re-queue. It is used in output and job file numbering. (i.e It avoids overwriting the job file output during multiple re-runs)

ECF_OUT

This is user/suite variable that specifies a directory PATH. It controls the location of job output(stdout and stderr of the process) on a remote file system.

It provides an alternate location for the job and cmd output files. If it exists, it is used as a base for ECF_JOBOUT, but it is also used to search for the output by ecflow, when asked by ecflowview/CLI.

If the output is in ECF_OUT/ECF_NAME.ECF_TRYNO  it is returned, otherwise ECF_HOME/ECF_NAME.ECF_TRYNO is used.

The user must ensure that all the directories exists, including suite/family. If this is not done, you may well find task remains stuck in a submitted state.

At ECMWF our submission scripts will ensure that directories exists.

ecFlow

Is the ECMWF work flow manager.

A general purpose application designed to schedule a large number of computer process in a heterogeneous environment.

Helps computer jobs design, submission and monitoring both in the research and operation departments.

ecflow_client

This executable is a command line program; it is used for all communication with the ecflow_server .

To see the full range of commands that can be sent to the ecflow_server type the following in a UNIX shell:

ecflow_client –help

This functionality is also provided by the Client Server API .

The following variables affect the execution of ecflow_client.

Since the ecf script can call ecflow_client( i.e child command ) then typically some are set in an include header. i.e. head.h .

Variable NameExplanationCompulsoryExample
ECF_NODEName of the host running the ecflow_server Yespikachu
ECF_NAMEPath to the taskYes/suite/family/task
ECF_PASSJobs password.Yes(generated)
ECF_RIDRemote id. Allow easier job kill, and disambiguate a zombie from the real job.Yes(generated)
ECF_PORTPort number of the ecflow_server No3141.Must match ecflow_server
ECF_TRYNOThe number of times the job has run. This is allocated by the server and used in job/output file name generation.No(generated)
ECF_HOSTFILEFile that lists alternate hosts to try, if connection to main host failsNo/home/user/avi/.ecfhostfile
ECF_TIMEOUTMaximum time is seconds for the client to deliver messageNo24*3600 (default value)
ECF_DENIEDIf server denies client communication and this flag is set, exit with an error. Avoids 24hr hour connection attempt to ecflow_server .No1
NO_ECFIf set exit’s ecflow_client immediately with success. This allows the scripts to be tested independent of the serverNo1
ecflow_server

This executable is the server.

It is responsible for scheduling the jobs and responding to ecflow_client requests

Multiple servers can be run on the same machine/host providing they are assigned a unique port number.

The server record’s all request’s in the log file.

The server will periodically(See ECF_CHECKINTERVAL) write out a check point file.

The following environment variables control the execution of the server and may be set before the start of the server. ecflow_server will start happily with out any of these variables being set, since all of them have default values.

Variable NameExplanationDefault value
ECF_HOMEHome for all the ecFlow filesCurrent working directory
ECF_PORTServer port number. Must be unique3141
ECF_LOGHistory or log file<host>.<port>.ecf.log
ECF_CHECKName of the checkpoint file<host>.<port>.ecf.check
ECF_CHECKOLDName of the backup checkpoint file<host>.<port>.ecf.check.b
ECF_CHECKINTERVALInterval in second to save check point file120
ECF_LISTSWhite list file. Controls read/write access to the server for each user<host>.<port>.ecf.lists
ECF_TASK_THRESHOLDReport in log file all task/job that take longer than given threshold. Used to debug/instrument, those scripts that are very large.4000ms, (release 4.0.6 default was 2000ms), where 1000ms = 1 second

The server can be in several states. The default when first started is halted , See server states

ecflowview

ecflowview executable is the GUI based client, that is used to visualise and monitor the hierarchical structure of the suite definition

state changes in the node ‘s and the ecflow_server , using colour coding

Attributes of the nodes and any dependencies

ecf script file and the corresponding job file

event

The purpose of an event is to signal partial completion of a task and to be able to trigger another job which is waiting for this partial completion.

Only tasks can have events and they can be considered as an attribute of a task .

There can be many events and they are displayed as nodes.

The event is updated by placing the –event child command in a ecf script .

An event has a number and possibly a name. If it is only defined as a number, its name is the text representation of the number without leading zeroes.

For python see: ecflow.Event and ecflow.Node.add_event For text BNF see event

If the event child command s, results in a zombie , then the default action if for the server to  fob, this allows the ecflow_client command to exit normally. (i,e without any errors). This default can be overridden by using a zombie attribute.

Events can be referenced in trigger and complete expression s.

extern

This allows an external node to be used in a trigger expression.

All node ‘s in trigger ‘s must be known to ecflow_server by the end of the load command. No cross-suite dependencies are allowed unless the names of tasks outside the suite are declared as external. An external trigger reference is considered unknown if it is not defined when the trigger is evaluated. You are strongly advised to avoid cross-suite dependencies .

Families and suites that depend on one another should be placed in a single suite . If you think you need cross-suite dependencies, you should consider merging the suites together and have each as a top-level family in the merged suite. For BNF see extern

family

A family is an organisational entity that is used to provide hierarchy and grouping. It consists of a collection of task ‘s and families.

Typically you place tasks that are related to each other inside the same family, analogous to the way you create directories to contain related files. For python see ecflow.Family . For BNF see family

It serves as an intermediate node in a suite definition .

haltedIs a ecflow_server state. See server states
hybrid clock

A hybrid clock is a complex notion: the date and time are not connected.

The date has a fixed value during the complete execution of the suite . This will be mainly used in cases where the suite does not complete in less than 24 hours. This guarantees that all tasks of this suite are using the same date . On the other hand, the time follows the time of the machine.

Hence the date never changes unless specifically altered or unless the suite restarts, either automatically or from a begin command.

Under a hybrid clock any node held by a date , day  or cron dependency will be set to complete at the beginning of the suite. (i.e without its job ever running). Otherwise the suite would never complete .

inlimit

The inlimit works in conjunction with limit / ecflow.Limit for providing simple load management

inlimit is added to the node that needs to be limited.

For python see ecflow.InLimit and ecflow.Node.add_inlimit . For text BNF see inlimit

job creation

Job creation or task invocation can be initiated manually via ecflowview but also by the ecflow_server during scheduling when a task (and all of its parent node s) is free of its dependencies .

The process of job creation includes:

o Generating a unique password ECF_PASS, which is placed in ecf script during pre-processing . See head.h

o Locating ecf script files , corresponding to the task in the suite definition , See ecf file location algorithm

o pre-processing the contents of the ecf script file

The steps above transforms an ecf script to a job file that can be submitted by performing variable substitution on the ECF_JOB_CMD variable and invoking the command.

The running jobs will communicate back to the ecflow_server by calling child command ‘s.

This causes status changes on the node ‘s in the ecflow_server and flags can be set to indicate various events.

If a task is to be treated as a dummy task( i.e. is used as a scheduling task) and is not meant to to be run, then a variable of name ECF_DUMMY_TASK can be added.

             task.add_variable("ECF_DUMMY_TASK", "")
 
job file

The job file is created by the ecflow_server during job creation using the ECF_TRYNO variable

It is derived from the ecf script after expanding the pre-processing directives .

It has the form <task name>.job< ECF_TRYNO >”, i.e. t1.job1.

Note job creation checking will create a job file with an extension with zero. i.e ‘.job0’. See ecflow.Defs.check_job_creation

When the job is run the output file has the ECF_TRYNO as the extension. i.e t1.1 where ‘t1’ represents the task name and ‘1’ the ECF_TRYNO

label

A label has a name and a value and is a way of displaying information in ecflowview

By placing a label child command s in the ecf script the user can be informed about progress in ecflowview .

If the label child command s, results in a  zombie then the default action if for the server to  fob, this allows the ecflow_client command to exit normally. (i,e without any errors). This default can be overridden by using a zombie attribute.

For python see ecflow.Label and ecflow.Node.add_label . For text BNF see label

late

Define a tag for a node to be late.

Suites cannot be late, but you can define a late tag for submitted in a suite, to be inherited by the families and tasks. When a node is classified as being late, the only action ecflow_server takes is to set a flag. ecflowview will display these alongside the node name as an icon (and optionally pop up a window).

For python see ecflow.Late and ecflow.Node.add_late . For text BNF see late

limit

Limits provide simple load management by limiting the number of tasks submitted by a specific ecflow_server . Typically you either define limits on suite level or define a separate suite to hold limits so that they can be used by multiple suites.

The limit max value can be changed on the command line

>ecflow_client --alter change limit_max <limit-name> <new-limit-value> <path-to-limit>
>ecflow_client --alter change limit_max limit 2 /suite

It can also be changed in python:

#!/usr/bin/env python2.7
import ecflow
try:
   ci = ecflow.Client()
   ci.alter("/suite","change","limit_max","limit", "2")
except RuntimeError, e:
   print "Failed: " + str(e)
                                     

For python see ecflow.Limit and ecflow.Node.add_limit . For BNF see limit and inlimit

manual page

Manual pages are part of the ecf script .

This is to ensure that the manual page is updated when the ecf script is updated. The manual page is a very important operational tool allowing you to view a description of a task, and possibly describing solutions to common problems. The pre-processing can be used to extract the manual page from the script file and is visible in ecflowview . The manual page is the text contained within the %manual and %end directives . They can be seen using the manual button on ecflowview .

The text in the manual page in not included in the job file .

There can be multiple manual sections in the same ecf script file. When viewed they are simply concatenated. It is good practice to modify the manual pages when the script changes.

The manual page may have the %include directives .

meter

The purpose of a meter is to signal proportional completion of a task and to be able to trigger another job which is waiting on this proportional completion.

The meter is updated by placing the –meter child command in a ecf script .

For python see: ecflow.Meter and ecflow.Node.add_meter . For text BNF see meter

If the meter child command s, results in a zombie, then the default action if for the server to  fob , this allows the ecflow_client command to exit normally. (i,e without any errors). This default can be overridden by using a zombie attribute.

Meter’s can be referenced in trigger and complete expression expressions.

node

suite , family and task form a hierarchy. Where a suite serves as the root of the hierarchy. The family provides the intermediate nodes, and the task provide the leaf’s.

Collectively suite , family and task can be referred to as nodes.

For python see ecflow.Node .

pre-processing

Pre-processing takes place during job creation and acts on directives specified in ecf script file.

This involves:

o expanding any include file directives . i.e similar to ‘c’ language pre-processing

o removing comments and manual directives

o performing variable substitution

queued

Is a node status .

After the begin command, the task without a defstatus are placed into the queued state

real clockA suite using a real clock will have its clock matching the clock of the machine. Hence the date advances by one day at midnight.
repeat

Repeats provide looping functionality. There can only be a single repeat on a node .

repeat day step [ENDDATE] # only for suites

repeat integer VARIABLE start end [step]

repeat enumerated VARIABLE first [second [third ...]]

repeat string VARIABLE str1 [str2 ...]

repeat file VARIABLE filename

repeat date VARIABLE yyyymmdd yyyymmdd [delta]

The repeat VARIABLE can be used in trigger and complete expression expressions.

If a “repeat date” VARIABLE is used in a trigger expression then date arithmetic is used, when the expression uses addition and subtraction. i.e

defs = ecflow.Defs()
s1 = defs.add_suite("s1");
t1 = s1.add_task("t1").add_repeat( ecflow.RepeatDate("YMD",20090101,20091231,1) );
t2 = s1.add_task("t2").add_trigger("t1:YMD - 1 eq 20081231");
assert t2.evaluate_trigger(), "Expected trigger to evaluate. 20090101 - 1  == 20081231"

 

runningIs a ecflow_server state. See server states
scheduling

The ecflow_server is responsible for task scheduling.

It will check dependencies in the suite definition every minute. If these dependencies are free, the ecflow_server will submit the task. See job creation .

server states

The following tables reflects the ecflow_server capabilities in the different states

StateUser RequestTask RequestJob SchedulingAuto-Check-pointing
running yesyesyesyes
shutdown yesyesnoyes
halted yesnonono
shutdownIs a ecflow_server state. See server states
status

Each node in suite definition has a status.

Status reflects the state of the node . In ecflowview the background colour of the text reflects the status.

task status are: unknown , queued , submitted , active , complete , aborted and suspended

ecflow_server status are: shutdown , halted , running this is shown on the root node in ecflowview

submitted

Is a node status .

When the task dependencies are resolved/free the ecflow_server places the task into a submitted state. However if the ECF_JOB_CMD fails, the task is placed into the aborted state

suite

A suite is organisational entity. It is serves as the root node in a suite definition . It should be used to hold a set of jobs that achieve a common function. It can be used to hold user variable s that are common to all of its children.

Only a suite node can have a clock .

It is a collection of family ‘s, variable ‘s, repeat and a single clock definition. For a complete list of attributes look at BNF for suite . For python see ecflow.Suite .

suite definition

The suite definition is the hierarchical node tree.

It describes how your task ‘s run and interact.

It can built up using:

Once the definition is built, it can be loaded into the ecflow_server , and started. It can be monitored by ecflowview

suspended

Is a node state. A node can be placed into the suspended state via a defstatus or via ecflowview

A suspended node including any of its children can not take part in scheduling until the node is resumed.

task

A task represents a job that needs to be carried out. It serves as a leaf node in a suite definition

Only tasks can be submitted.

A job inside a task ecf script should generally be re-entrant so that no harm is done by rerunning it, since a task may be automatically submitted more than once if it aborts.

For python see ecflow.Task . For text BNF see task

time

This defines a time dependency for a node.

Time is expressed in the format [h]h:mm. Only numeric values are allowed. There can be multiple time dependencies for a node, but overlapping times may cause unexpected results. To define a series of times, specify the start time, end time and a time increment. If the start time begins with ‘+’, times are relative to the beginning of the suite or, in repeated families, relative to the beginning of the repeated family.

If the time the job takes to complete is longer than the interval a time 'slot' is missed, e.g.

time 10:00 20:00 01:00

if the 10:00 run takes more than an hour, the 11:00 run will never occur.

For python see ecflow.Time and ecflow.Node.add_time . For BNF see time

today

Like time , but If the suites begin time is past the time given for the “today” command the node is free to run (as far as the time dependency is concerned).

For example

task x
   today 10:00

If we begin or re-queue the suite at 9.00 am, then the task in held until 10.00 am. However if we begin or re-queue the suite at 11.00am, the task is run immediately.

No lets look at time

task x
   time 10:00

If we begin or re-queue the suite at 9.00am, then the task in held until 10.00 am. If we begin or re-queue the suite at 11.00am, the task is still held.

If the time the job takes to complete is longer than the interval a “slot†is missed, e.g.

today 10:00 20:00 01:00

if the 10:00 run takes more than an hour, the 11:00 run will never occur.

For python see ecflow.Today . For text BNF see today

trigger

Triggers defines a dependency for a task or family .

There can be only one trigger dependency per node , but that can be a complex boolean expression of the status of several nodes. Triggers should be avoided on suites. A node with a trigger can only be activated when its trigger has expired. A trigger holds the node as long as the trigger’s expression evaluation returns false.

Trigger evaluation occurs when ever the child command communicates with the server. i.e whenever there is a state change in the suite definition.

The keywords in trigger expressions are: unknown , suspended , complete , queued , submitted , active , aborted and clear and set for event status.

Triggers can also reference Node attributes like event , meter , variable , repeat and generated variables. Trigger evaluation for node attributes uses integer arithmetic:

  • event has the integer value of 0(clear) and set(1)
  • meter values are integers hence they are used as is
  • variable value is converted to an integer, otherwise 0 is used. See example below
  • repeat string : We use the index values as integers. See example below
  • repeat enumerated : We use the index values as integers. See example below
  • repeat integer : Use the implicit integer values
  • repeat date : Use the date values as integers. Use of plus/minus on repeat date variable uses date arithmetic

Here are some examples

suite trigger_suite
   task a
      event EVENT
      meter METER 1 100 50
      edit  VAR_INT 12
      edit  VAR_STRING "captain scarlett"         # This is not convertible to an integer, if referenced will use '0'
   family f1
      edit SLEEP 2
      repeat string NAME a b c d e f              # This has values: a(0),b(1), c(3), d(4), e(5), f(6) i.e index
      family f2
         repeat integer VALUE 5 10                # This has values: 5,6,7,8,9,10
         family f3
            repeat enumerated red green blue      # red(0), green(1), blue(2)
            task t1
               repeat date DATE 19991230 20000102 # This has values: 19991230,19991231,20000101,20000102
         endfamily
      endfamily
   endfamily
   family f2
      task event_meter
          trigger /suite/a:EVENT == set and /suite/a:METER >= 30
      task variable
          trigger /suite/a:VAR_INT >= 12 and /suite/a:VAR_STRING == 0
      task repeat_string
          trigger /suite/f1:NAME >= 4
      task repeat_integer
          trigger /suite/f1/f2:VALUE >= 7
      task repeat_date
          trigger /suite/f1/f2/f3/t1:DATE >= 19991231
      task repeat_date2
          # Using plus/minus on a repeat DATE will use date arithmetic
          # Since the starting value of DATE is 19991230, this task will run
          # straight away
          trigger /suite/f1/f2/f3/t1:DATE - 1 == 19991229
   endfamily
endsuite

What happens when we have multiple node attributes of the same name, referenced in trigger expressions ?

task foo
   event blah
   meter blah 0 200 50
   edit  blah 10
task bar
   trigger foo:blah >= 0

In this case ecFlow will use the following precedence:

Hence in the example above expression ‘foo:blah >= 0’ will reference the event.

For python see ecflow.Expression and ecflow.Node.add_trigger

unknown

Is a node status .

This is the default node status when a suite definition is loaded into the ecflow_server

user commandsUser commands are any client to server requests that are not child command s.
variable

ecFlow makes heavy use of different kinds of variables.There are several kinds of variables:

Environment variables: which are set in the UNIX shell before the ecFlow starts. These control ecflow_server , and ecflow_client .

suite definition variables: Also referred to as user variables. These control ecflow_server , and ecflow_client and are available for use in job file .

Generated variables: These are generated within the suite definition node tree during job creation and are available for use in the job file .

Variables can be referenced in trigger and complete expression s . The value part of the variable should be convertible to an integer otherwise a default value of 0 is used.

For python see ecflow.Node.add_variable . For BNF see variable

variable inheritance

When a variable is needed at job creation time, it is first sought in the task itself.

If it is not found in the task , it is sought from the task’s parent and so on, up through the node levels until found.

For any node , there are two places to look for variables.

Suite definition variables are looked for first, and then any generated variables.

variable substitution

Takes place during pre-processing or command invocation.(i.e ECF_JOB_CMD,ECF_KILL_CMD,etc)

It involves searching each line of ecf script file or command, for ECF_MICRO character. typically ‘%’

The text between two % character, defines a variable. i.e %VAR%

This variable is searched for in the suite definition .

First the suite definition variables( sometimes referred to as user variables) are searched and then Repeat variable name, and finally the generated variables.If no variable is found then the same search pattern is repeated up the node tree.

The value of the variable is replaced between the % characters.

If the micro character are not paired and an error message is written to the log file, and the task is placed into the aborted state.

If the variable is not found in the suite definition during pre-processing then job creation fails, and an error message is written to the log file, and the task is placed into the aborted state. To avoid this, variables in the ecf script can be defined as:

%VAR:replacement%

This is similar to %VAR% but if VAR is not found in the suite definition then ‘replacement’ is used.

virtual clock

Like real clock until the ecflow_server is suspended (i.e shutdown or halted ), the suites clock is also suspended.

Hence will honour relative times in cron , today and time dependencies. It is possible to have a combination of hybrid/real and virtual.

More useful when we want complete adherence to time related dependencies at the expense being out of sync with system time.

zombie

Zombies are running jobs that fail authentication when communicating with the ecflow_server

child command s like (init, event,meter, label, abort,complete) are placed in the ecf script file and are used to communicate with the ecflow_server .

The ecflow_server authenticates each connection attempt made by the child command . Authentication can fail for a number of reasons:

When authentication fails the job is considered to be a zombie. The ecflow_server will keep a note of the zombie for a period of time, before it is automatically removed. However the removed zombie, may well re-appear. ( this is because each child command will continue attempting to contact the ecflow_server for 24 hours. This is configurable see ECF_TIMEOUT on ecflow_client )

For python see ecflow.ZombieAttr , ecflow.ZombieUserActionType

There are several types of zombies see zombie type and ecflow.ZombieType

zombie attribute

The zombie attribute defines how a zombie should be handled in an automated fashion. Very careful consideration should be taken before this attribute is added as it may hide a genuine problem. It can be added to any node . But is best defined at the suite or family level. If there is no zombie attribute the default behaviour for init,complete,wait and abort  child command s, is to block, whereas for label, event, meter the default behaviour is to fob. (from version 4.0.4, previously all  child command s  blocked).

To add a zombie attribute in python, please see: ecflow.ZombieAttr

zombie type

See zombie and class ecflow.ZombieAttr for further information. There are several types of zombies:

  • path
    • The task path can not be found in the server, because node tree was deleted, replaced,reload, server crashed or backup server does not have node tree.
    • Jobs could have been created, via server scheduling or by user commands
  • user

    Job is created by user commands like, rerun, re-queue. User zombies are differentiated from server(scheduled) since they are automatically created when the force option is used and we have tasks in an active or submitted states.

  • ecf

    Jobs are created as part of the normal scheduling

    • Server crashed ( or terminated and restarted) and the recovered check point file is out of date.
    • A task is repeatedly re-run, earlier copies will not be remembered.
    • Job sent by another ecflow_server , but which can not talk to the original ecflow_server
    • Network glitches/network down

The type of the zombie is not fixed and may change.

  • No labels