Host Management

ecFlow is ultimately a framework for executing tasks, but task execution requires a context. pyflow makes use of a Host object to supply the context for this execution. As such pyflow requires a host object to be defined before it will generate any executable nodes in the tree. The host can be set at any level (Suite, Family or Task) and is inherited unless overridden.

If the default behaviour of ecFlow is required, and task execution is being managed explicitly, the host may be set to NullHost() at the Suite level. This will suppress all host-related behaviour inside pyflow.

For task handling, it is important that the ecflow_client is configured (via appropriate environment variables) and that it is correctly called to trigger changes of state in the server. Further, any and all errors that may occur in a script must be correctly caught and reported to the ecFlow server.

Host objects must also know how to transfer data to/from the host to be able to implement the Deployable Resources functionality.

Host Arguments

Host classes have many configurable options, but some of these options are available for all host classes and configure the base Host class. Other than name, all of these are optional, keyword arguments with plausible defaults.

  • name - the name used for the host. Required (non keyword argument).

  • hostname - The hostname to run the task on. Defaults to name if not supplied

  • scratch_directory - The path in which tasks will be run, unless otherwise specified. Also to be used within suites when a scratch location is needed.

  • log_directory - The directory to use for script output. Defaults to ECF_HOME, but may need to be changed on systems with scheduling systems to make the output visible to the ecFlow server.

  • limit - How many tasks can run on the node simultaneously.

  • extra_paths - Paths that are to be added to PATH on the host.

  • extra_variables - A dictionary of additional ECFLOW variables that should be set to configure the host (e.g. {'SCHOST': 'hpc'}).

  • environment_variables - Additional environment variables to export into all scripts.

  • modules - Modules to module load

  • module_purge - Should a module purge command be run (before loading any modules). Default False.

  • module_source - The shell script to source to initialise the module system. Default None.

  • ecflow_path - The directory containing the ecflow_client executable

  • label_host - When the host property is changed on a node, should a Label be created in the tree. Default True.

Existing Host Classes

A number of existing host clases have been defined. These can be extended, and alternatives provided.

LocalHost

This is essentially a trivial host. It runs tasks as background processes on the current node - i.e. on the ecflow server, and running as the same user as the server. Other than for examples, this is extremely useful for running tasks that update labels, meters, events and variables on a node that is certain to have the ecflow_client working correctly and with no job queuing delay.

[2]:
host = pf.LocalHost()

SSHHost

Run a script on a remote host which has been accessed by SSH. The name argument is treated as the target hostname unless the hostname keyword argument is explicitly supplied. By default the user that generated the pyflow suite is used, unless the user argument is supplied.

The SSHHost is special in that it does not require the ecflow_client to be installed on the remote host and does not require the presence of any shared filesystems or log servers to make output logs visible to the user. All of the ecflow_client commands required are executed on the server side, and the script output is piped back through the SSH command.

For these connections to be established, it is necessary that the ecflow server is configured to have SSH access to the target systems using SSH keys. Further, as this requires an SSH connection to be maintained for each of the running commands, it imposes a practical limit on the number of commands that can be run simultaneously on any remote host. There may be value in setting up SSH connections that persist across multiple commands, by making use of the ControlMaster, ControlPath and ControlPersist options in the ssh config file.

[3]:
host = pf.SSHHost('dhs9999', user='max', scratch_directory='/data/a_mounted_filesystem/tmp')

The SSHHost class can also take additional optional arguments indirect_host and indirect_user. If indirect_host is supplied then a two-hop connection is made, such that a connection is made to the indirect_host, and then a further SSH connection is made to the real host. Note that this is not the same as using a ProxyCommand configured to a normal SSH connection - the credentials for the second hop are held on the intermediate system. indirect_user defaults to user if it is not supplied.

[4]:
host = pf.SSHHost('cloud-mvr001',
                  user='mover-user',
                  indirect_host='cloud-gateway',
                  indirect_user='cloud-user')

PBSHost

Connects to a remote host by SSH, and submits a job on the batch scheduling system. As this task will run asynchronously on a remote system this requires the ecflow_client to be available, and if it is not at the default location this should be configured with the ecflow_path keyword argument.

It is anticipated that for real use this class will be derived from to add and configure site-specific functionality (such as knowledge of, and handling of, queues).

It is likely that the log_directory will need to be modified, and the ECF_LOGHOST and ECF_LOGPORT variables are likely to be needed to operate with a log server to get output working fully.

SLURMHost

This executes scripts on a remote system, by ssh-ing in and submitting to the SLURM job scheduling system. This is very much analagous to the PBSHost.

Limits

Host objects accept an argument limit=. This can be used to construct a limit (preferably in a sensible location within the suite). Once this has been set up then any Task that is created using this host object will automatically be added to the limit for the given host.

Note that this implies that the same host object should be used to configure Tasks throughout the suite, rather than just using host objects that refer to the same host.

[6]:
with CourseSuite('limits', host=pf.LocalHost(limit=3)) as s:

    with pf.Family('limits'):
        s.host().build_limits()

    pf.Task('t1', script='I am limited')

s
[6]:
suite limits
  defstatus suspended
  edit ECF_FILES '/path/to/scratch/files/limits'
  edit ECF_HOME '/path/to/scratch/out'
  edit ECF_JOB_CMD 'bash -c 'export ECF_PORT=%ECF_PORT%; export ECF_HOST=%ECF_HOST%; export ECF_NAME=%ECF_NAME%; export ECF_PASS=%ECF_PASS%; export ECF_TRYNO=%ECF_TRYNO%; export PATH=/usr/local/apps/ecflow/%ECF_VERSION%/bin:$PATH; ecflow_client --init="$$" && %ECF_JOB% && ecflow_client --complete || ecflow_client --abort ' 1> %ECF_JOBOUT% 2>&1 &'
  edit ECF_KILL_CMD 'pkill -15 -P %ECF_RID%'
  edit ECF_STATUS_CMD 'true'
  edit ECF_OUT '%ECF_HOME%'
  label exec_host "localhost"
  family limits
    limit localhost 3
  endfamily
  task t1
    inlimit /limits/limits:localhost
endsuite

Job Characteristics

In pyflow, a task is generated as a synthesis of multiple pieces of information:

  • The Task object in the suite - when to run

  • The Script object (script attribute on Task) - what to run

  • The Host object - how to run

The combination of these three components provides the information to determine when, what, and how a task should be executed. The Host object is important as it provides two major components:

  1. A mechanism by which a task should be executed. This reduces to the ECF_JOB_CMD and associated machinery.

  2. Preamble and Postamble material that is used for consting the script to execute.

Unfortunately, the breakdown is not nearly so clear in real life. Consider the case of one of the HPC machines. We can:

  • Run a task on the head node as a simple SSHHost

  • Submit a serial, fractional or parallel job

  • Submit jobs using various (machine specific) resource requirements

This is a problem. Conceptually properties such as the number of cores and nodes, whether to use hyperthreading or hugepages are properties of the Task but they depend very strongly on the Host.

Currently all properties that determine the execution process must belong to the Host. These can be parameterised to use ecFlow variables that are set on Families or Tasks, but this is a bit of a hack. We would like this parameterisation to only be needed if those properties should be changeable at runtime (e.g. by the operators).