Running in the cloud

Toil jobs can be run on a variety of cloud platforms. Of these, Amazon Web Services is currently the best-supported solution.

On all cloud providers, it is recommended that you run long-running jobs on remote systems using a terminal multiplexer such as screen or tmux.

Screen

Simply type screen to open a new screen session. Later, type ctrl-a and then d to disconnect from it, and run screen -r to reconnect to it. Commands running under screen will continue running even when you are disconnected, allowing you to unplug your laptop and take it home without ending your Toil jobs. See Toil Provisioner for complications that can occur when using screen within the Toil Appliance.

Autoscaling

The fastest way to get started with Toil in a cloud environment is by using Toil’s autoscaling capabilities to handle node provisioning. You can do this by using the Toil Provisioner or CGCloud.

To begin, launch a Toil leader instance using your choice of provisioners.

Once we have our leader instance launched, the only remaining step is to kick off our Toil run with special autoscaling options. Now might be an opportune time to read up on Toil’s extensive configuration options by passing --help to your toil script invocation.

There are a number of autoscaling specific options, but only 2 options are strictly necessary to enable autoscaling: --provisioner=aws and --nodeType=<>. These options, respectively, tell Toil that we are running on AWS (currently the only supported autoscaling environment) and which instance type to use for the Toil worker instances.

Preemptability

Toil can run on a heterogenous cluster of both preemptable and non-preemptable nodes. Our preemptable node type can be set by using the --preemptableNodeType=<> flag. While individual jobs can each explicitly specify whether or not they should be run on preemptable nodes via the boolean preemptable resource requirement, the --defaultPreemptable flag will allow jobs without a preemptable requirement to run on preemptable machines.

We can set the maximum number of preemptable and non-preemptable nodes via the flags --maxNodes=<> and --maxPreemptableNodes=<>.

Specify Preemptability Carefully

Ensure that your choices for --maxNodes=<> and --maxPreemptableNodes=<> make sense for your workflow and won’t cause it to hang - if the workflow requires preemptable nodes set --maxPreemptableNodes to some non-zero value and if any job requires non-preemptable nodes set --maxNodes to some non-zero value.

Finally, the --preemptableCompensation flag can be used to handle cases where preemptable nodes may not be available but are required for your workflow.

Using Mesos with Toil on AWS

The mesos master and agent processes bind to the private IP addresses of their EC2 instance, so be sure to use the master’s private IP when specifying --mesosMaster. Using the public IP will prevent the nodes from properly discovering each other.

Running on AWS

See Amazon Web Services to get setup for running on AWS.

Having followed the Quickstart: A simple workflow guide, the user can run their HelloWorld.py script on a distributed cluster just by modifying the run command. Since our cluster is distributed, we’ll use the aws job store which uses a combination of one S3 bucket and a couple of SimpleDB domains. This allows all nodes in the cluster access to the job store which would not be possible if we were to use the file job store with a locally mounted file system on the leader.

Copy HelloWorld.py to the leader node, and run:

$ python HelloWorld.py \
       --batchSystem=mesos \
       --mesosMaster=master-private-ip:5050 \
       aws:us-west-2:my-aws-jobstore

Alternatively, to run a CWL workflow:

$ cwltoil --batchSystem=mesos  \
        --mesosMaster=master-private-ip:5050 \
        --jobStore=aws:us-west-2:my-aws-jobstore \
        example.cwl \
        example-job.yml

When running a CWL workflow on AWS, input files can be provided either on the local file system or in S3 buckets using s3:// URL references. Final output files will be copied to the local file system of the leader node.

Running on Azure

See Azure to get setup for running on Azure. This section assumes that you are SSHed into your cluster’s leader node.

The Azure templates do not create a shared filesystem; you need to use the azure job store for which you need to create an Azure storage account. You can store multiple job stores in a single storage account.

To create a new storage account, if you do not already have one:

  1. Click here, or navigate to https://portal.azure.com/#create/Microsoft.StorageAccount in your browser.
  2. If necessary, log into the Microsoft Account that you use for Azure.
  3. Fill out the presented form. The Name for the account, notably, must be a 3-to-24-character string of letters and lowercase numbers that is globally unique. For Deployment model, choose Resource manager. For Resource group, choose or create a resource group different than the one in which you created your cluster. For Location, choose the same region that you used for your cluster.
  4. Press the Create button. Wait for your storage account to be created; you should get a notification in the notifications area at the upper right when that is done.

Once you have a storage account, you need to authorize the cluster to access the storage account, by giving it the access key. To do find your storage account’s access key:

  1. When your storage account has been created, open it up and click the “Settings” icon.
  2. In the Settings panel, select Access keys.
  3. Select the text in the Key1 box and copy it to the clipboard, or use the copy-to-clipboard icon.

You then need to share the key with the cluster. To do this temporarily, for the duration of an SSH or screen session:

  1. On the leader node, run export AZURE_ACCOUNT_KEY="<KEY>", replacing <KEY> with the access key you copied from the Azure portal.

To do this permanently:

  1. On the leader node, run nano ~/.toilAzureCredentials.

  2. In the editor that opens, navigate with the arrow keys, and give the file the following contents

    [AzureStorageCredentials]
    <accountname>=<accountkey>
    

    Be sure to replace <accountname> with the name that you used for your Azure storage account, and <accountkey> with the key you obtained above. (If you want, you can have multiple accounts with different keys in this file, by adding multipe lines. If you do this, be sure to leave the AZURE_ACCOUNT_KEY environment variable unset.)

  3. Press ctrl-o to save the file, and ctrl-x to exit the editor.

Once that’s done, you are now ready to actually execute a job, storing your job store in that Azure storage account. Assuming you followed the Quickstart: A simple workflow guide above, you have an Azure storage account created, and you have placed the storage account’s access key on the cluster, you can run the HelloWorld.py script by doing the following:

  1. Place your script on the leader node, either by downloading it from the command line or typing or copying it into a command-line editor.

  2. Run the command:

    $ python HelloWorld.py \
           --batchSystem=mesos \
           --mesosMaster=10.0.0.5:5050 \
           azure:<accountname>:hello-world-001
    

    To run a CWL workflow:

    $ cwltoil --batchSystem=mesos \
            --mesosMaster=10.0.0.5:5050 \
            --jobStore=azure:<accountname>:hello-world-001 \
            example.cwl \
            example-job.yml
    

    Be sure to replace <accountname> with the name of your Azure storage account.

Note that once you run a job with a particular job store name (the part after the account name) in a particular storage account, you cannot re-use that name in that account unless one of the following happens:

  1. You are restarting the same job with the --restart option.
  2. You clean the job store with toil clean azure:<accountname>:<jobstore>.
  3. You delete all the items created by that job, and the main job store table used by Toil, from the account (destroying all other job stores using the account).
  4. The job finishes successfully and cleans itself up.

Running on Open Stack

After setting up Toil on OpenStack, Toil scripts can be run by designating a job store location as shown in Quickstart: A simple workflow. Be sure to specify a temporary directory that Toil can use to run jobs in with the --workDir argument:

$ python HelloWorld.py --workDir=/tmp file:jobStore

Running on Google Compute Engine

After setting up Toil on Google Compute Engine, Toil scripts can be run just by designating a job store location as shown in Quickstart: A simple workflow.

If you wish to use the Google Storage job store, install Toil with the google extra (Extras). Then, create a file named .boto with your credentials and some configuration:

[Credentials]
gs_access_key_id = KEY_ID
gs_secret_access_key = SECRET_KEY

[Boto]
https_validate_certificates = True

[GSUtil]
content_language = en
default_api_version = 2

gs_access_key_id and gs_secret_access_key can be generated by navigating to your Google Cloud Storage console and clicking on Settings. On the Settings page, navigate to the Interoperability tab and click Enable interoperability access. On this page you can now click Create a new key to generate an access key and a matching secret. Insert these into their respective places in the .boto file and you will be able to use a Google job store when invoking a Toil script, as in the following example:

$ python HelloWorld.py google:projectID:jobStore

The projectID component of the job store argument above refers your Google Cloud Project ID in the Google Cloud Console, and will be visible in the console’s banner at the top of the screen. The jobStore component is a name of your choosing that you will use to refer to this job store.