Quickstart Examples

Running a basic CWL workflow

The Common Workflow Language (CWL) is an emerging standard for writing workflows that are portable across multiple workflow engines and platforms. Running CWL workflows using Toil is easy.

  1. First ensure that Toil is installed with the cwl extra (see Installing Toil with Extra Features):

    (venv) $ pip install 'toil[cwl]'
    

    This installs the toil-cwl-runner executable.

    Note

    Don’t actually type (venv) $ in at the beginning of each command. This is intended only to remind the user that they should have their virtual environment running.

  2. Copy and paste the following code block into example.cwl:

    cwlVersion: v1.0
    class: CommandLineTool
    baseCommand: echo
    stdout: output.txt
    inputs:
      message:
        type: string
        inputBinding:
          position: 1
    outputs:
      output:
        type: stdout
    

    and this code into example-job.yaml:

    message: Hello world!
    
  3. To run the workflow simply enter

    (venv) $ toil-cwl-runner example.cwl example-job.yaml
    

    Your output will be in output.txt:

    (venv) $ cat output.txt
    Hello world!
    

Congratulations! You’ve run your first Toil workflow using the default Batch System, single_machine, and the default file job store (which was placed in a temporary directory for you by toil-cwl-runner).

Toil uses batch systems to manage the jobs it creates.

The single_machine batch system is primarily used to prepare and debug workflows on a local machine. Once validated, try running them on a full-fledged batch system (see Batch System API). Toil supports many different batch systems such as Kubernetes and Grid Engine; its versatility makes it easy to run your workflow in all kinds of places.

Toil’s CWL runner is totally customizable! Run toil-cwl-runner --help to see a complete list of available options.

To learn more about CWL, see the CWL User Guide (from where this example was shamelessly borrowed). For information on using CWL with Toil see the section CWL in Toil. And for an example of CWL on an AWS cluster, have a look at Running a CWL Workflow on AWS.

Running a basic WDL workflow

The Workflow Description Language (WDL) is another emerging language for writing workflows that are portable across multiple workflow engines and platforms. Running WDL workflows using Toil is still in alpha, and currently experimental. Toil currently supports basic workflow syntax (see WDL in Toil for more details and examples). Here we go over running a basic WDL helloworld workflow.

  1. First ensure that Toil is installed with the wdl extra (see Installing Toil with Extra Features):

    (venv) $ pip install 'toil[wdl]'
    

    This installs the toil-wdl-runner executable.

  2. Copy and paste the following code block into wdl-helloworld.wdl:

    workflow write_simple_file {
      call write_file
    }
    task write_file {
      String message
      command { echo ${message} > wdl-helloworld-output.txt }
      output { File test = "wdl-helloworld-output.txt" }
    }
    

    and this code into wdl-helloworld.json:

    {
      "write_simple_file.write_file.message": "Hello world!"
    }
    
  3. To run the workflow simply enter

    (venv) $ toil-wdl-runner wdl-helloworld.wdl wdl-helloworld.json
    

    Your output will be in wdl-helloworld-output.txt:

    (venv) $ cat wdl-helloworld-output.txt
    Hello world!
    

This will, like the CWL example above, use the single_machine batch system and an automatically-located file job store by default. You can customize Toil’s execution of the workflow with command-line options; run toil-wdl-runner --help to learn about them.

To learn more about WDL in general, see the Terra WDL documentation . For more on using WDL in Toil, see WDL in Toil.

Running a basic Python workflow

In addition to workflow languages like CWL and WDL, Toil supports running workflows written against its Python API.

An example Toil Python workflow can be run with just three steps:

  1. Install Toil (see Installation)

  2. Copy and paste the following code block into a new file called helloWorld.py:

from toil.common import Toil
from toil.job import Job


def helloWorld(message, memory="1G", cores=1, disk="1G"):
    return f"Hello, world!, here's a message: {message}"


if __name__ == "__main__":
    parser = Job.Runner.getDefaultArgumentParser()
    options = parser.parse_args()
    options.clean = "always"
    with Toil(options) as toil:
        output = toil.start(Job.wrapFn(helloWorld, "You did it!"))
    print(output)
  1. Specify the name of the job store and run the workflow:

    (venv) $ python3 helloWorld.py file:my-job-store
    

For something beyond a “Hello, world!” example, refer to A (more) real-world example.

Toil’s customization options are available in Python workflows. Run python3 helloWorld.py --help to see a complete list of available options.

A (more) real-world example

For a more detailed example and explanation, we’ve developed a sample pipeline that merge-sorts a temporary file. This is not supposed to be an efficient sorting program, rather a more fully worked example of what Toil is capable of.

Running the example

  1. Download the example code

  2. Run it with the default settings:

    (venv) $ python3 sort.py file:jobStore
    

    The workflow created a file called sortedFile.txt in your current directory. Have a look at it and notice that it contains a whole lot of sorted lines!

    This workflow does a smart merge sort on a file it generates, fileToSort.txt. The sort is smart because each step of the process—splitting the file into separate chunks, sorting these chunks, and merging them back together—is compartmentalized into a job. Each job can specify its own resource requirements and will only be run after the jobs it depends upon have run. Jobs without dependencies will be run in parallel.

Note

Delete fileToSort.txt before moving on to #3. This example introduces options that specify dimensions for fileToSort.txt, if it does not already exist. If it exists, this workflow will use the existing file and the results will be the same as #2.

  1. Run with custom options:

    (venv) $ python3 sort.py file:jobStore \
                 --numLines=5000 \
                 --lineLength=10 \
                 --overwriteOutput=True \
                 --workDir=/tmp/
    

    Here we see that we can add our own options to a Toil Python workflow. As noted above, the first two options, --numLines and --lineLength, determine the number of lines and how many characters are in each line. --overwriteOutput causes the current contents of sortedFile.txt to be overwritten, if it already exists. The last option, --workDir, is an option built into Toil to specify where temporary files unique to a job are kept.

Describing the source code

To understand the details of what’s going on inside. Let’s start with the main() function. It looks like a lot of code, but don’t worry—we’ll break it down piece by piece.

def main(options=None):
    if not options:
        # deal with command line arguments
        parser = ArgumentParser()
        Job.Runner.addToilOptions(parser)
        parser.add_argument('--numLines', default=defaultLines, help='Number of lines in file to sort.', type=int)
        parser.add_argument('--lineLength', default=defaultLineLen, help='Length of lines in file to sort.', type=int)
        parser.add_argument("--fileToSort", help="The file you wish to sort")
        parser.add_argument("--outputFile", help="Where the sorted output will go")
        parser.add_argument("--overwriteOutput", help="Write over the output file if it already exists.", default=True)
        parser.add_argument("--N", dest="N",
                            help="The threshold below which a serial sort function is used to sort file. "
                                 "All lines must of length less than or equal to N or program will fail",
                            default=10000)
        parser.add_argument('--downCheckpoints', action='store_true',
                            help='If this option is set, the workflow will make checkpoints on its way through'
                                 'the recursive "down" part of the sort')
        parser.add_argument("--sortMemory", dest="sortMemory",
                        help="Memory for jobs that sort chunks of the file.",
                        default=None)

        parser.add_argument("--mergeMemory", dest="mergeMemory",
                        help="Memory for jobs that collate results.",
                        default=None)

        options = parser.parse_args()
    if not hasattr(options, "sortMemory") or not options.sortMemory:
        options.sortMemory = sortMemory
    if not hasattr(options, "mergeMemory") or not options.mergeMemory:
        options.mergeMemory = sortMemory

    # do some input verification
    sortedFileName = options.outputFile or "sortedFile.txt"
    if not options.overwriteOutput and os.path.exists(sortedFileName):
        print(f'Output file {sortedFileName} already exists.  '
              f'Delete it to run the sort example again or use --overwriteOutput=True')
        exit()

    fileName = options.fileToSort
    if options.fileToSort is None:
        # make the file ourselves
        fileName = 'fileToSort.txt'
        if os.path.exists(fileName):
            print(f'Sorting existing file: {fileName}')
        else:
            print(f'No sort file specified. Generating one automatically called: {fileName}.')
            makeFileToSort(fileName=fileName, lines=options.numLines, lineLen=options.lineLength)
    else:
        if not os.path.exists(options.fileToSort):
            raise RuntimeError("File to sort does not exist: %s" % options.fileToSort)

    if int(options.N) <= 0:
        raise RuntimeError("Invalid value of N: %s" % options.N)

    # Now we are ready to run
    with Toil(options) as workflow:
        sortedFileURL = 'file://' + os.path.abspath(sortedFileName)
        if not workflow.options.restart:
            sortFileURL = 'file://' + os.path.abspath(fileName)
            sortFileID = workflow.importFile(sortFileURL)
            sortedFileID = workflow.start(Job.wrapJobFn(setup,
                                                        sortFileID,
                                                        int(options.N),
                                                        options.downCheckpoints,
                                                        options=options,
                                                        memory=sortMemory))
        else:
            sortedFileID = workflow.restart()
        workflow.exportFile(sortedFileID, sortedFileURL)

First we make a parser to process command line arguments using the argparse module. It’s important that we add the call to Job.Runner.addToilOptions() to initialize our parser with all of Toil’s default options. Then we add the command line arguments unique to this workflow, and parse the input. The help message listed with the arguments should give you a pretty good idea of what they can do.

Next we do a little bit of verification of the input arguments. The option --fileToSort allows you to specify a file that needs to be sorted. If this option isn’t given, it’s here that we make our own file with the call to makeFileToSort().

Finally we come to the context manager that initializes the workflow. We create a path to the input file prepended with 'file://' as per the documentation for toil.common.Toil() when staging a file that is stored locally. Notice that we have to check whether or not the workflow is restarting so that we don’t import the file more than once. Finally we can kick off the workflow by calling toil.common.Toil.start() on the job setup. When the workflow ends we capture its output (the sorted file’s fileID) and use that in toil.common.Toil.exportFile() to move the sorted file from the job store back into “userland”.

Next let’s look at the job that begins the actual workflow, setup.

def setup(job, inputFile, N, downCheckpoints, options):
    """
    Sets up the sort.
    Returns the FileID of the sorted file
    """
    RealtimeLogger.info("Starting the merge sort")
    return job.addChildJobFn(down,
                             inputFile, N, 'root',
                             downCheckpoints,
                             options = options,
                             preemptible=True,
                             memory=sortMemory).rv()

setup really only does two things. First it writes to the logs using Job.log() and then calls addChildJobFn(). Child jobs run directly after the current job. This function turns the ‘job function’ down into an actual job and passes in the inputs including an optional resource requirement, memory. The job doesn’t actually get run until the call to Job.rv(). Once the job down finishes, its output is returned here.

Now we can look at what down does.

def down(job, inputFileStoreID, N, path, downCheckpoints, options, memory=sortMemory):
    """
    Input is a file, a subdivision size N, and a path in the hierarchy of jobs.
    If the range is larger than a threshold N the range is divided recursively and
    a follow on job is then created which merges back the results else
    the file is sorted and placed in the output.
    """

    RealtimeLogger.info("Down job starting: %s" % path)

    # Read the file
    inputFile = job.fileStore.readGlobalFile(inputFileStoreID, cache=False)
    length = os.path.getsize(inputFile)
    if length > N:
        # We will subdivide the file
        RealtimeLogger.critical("Splitting file: %s of size: %s"
                % (inputFileStoreID, length))
        # Split the file into two copies
        midPoint = getMidPoint(inputFile, 0, length)
        t1 = job.fileStore.getLocalTempFile()
        with open(t1, 'w') as fH:
            fH.write(copySubRangeOfFile(inputFile, 0, midPoint+1))
        t2 = job.fileStore.getLocalTempFile()
        with open(t2, 'w') as fH:
            fH.write(copySubRangeOfFile(inputFile, midPoint+1, length))
        # Call down recursively. By giving the rv() of the two jobs as inputs to the follow-on job, up,
        # we communicate the dependency without hindering concurrency.
        result = job.addFollowOnJobFn(up,
                                    job.addChildJobFn(down, job.fileStore.writeGlobalFile(t1), N, path + '/0',
                                                      downCheckpoints, checkpoint=downCheckpoints, options=options,
                                                      preemptible=True, memory=options.sortMemory).rv(),
                                    job.addChildJobFn(down, job.fileStore.writeGlobalFile(t2), N, path + '/1',
                                                      downCheckpoints, checkpoint=downCheckpoints, options=options,
                                                      preemptible=True, memory=options.mergeMemory).rv(),
                                    path + '/up', preemptible=True, options=options, memory=options.sortMemory).rv()
    else:
        # We can sort this bit of the file
        RealtimeLogger.critical("Sorting file: %s of size: %s"
                % (inputFileStoreID, length))
        # Sort the copy and write back to the fileStore
        shutil.copyfile(inputFile, inputFile + '.sort')
        sort(inputFile + '.sort')
        result = job.fileStore.writeGlobalFile(inputFile + '.sort')

    RealtimeLogger.info("Down job finished: %s" % path)
    return result

Down is the recursive part of the workflow. First we read the file into the local filestore by calling job.fileStore.readGlobalFile(). This puts a copy of the file in the temp directory for this particular job. This storage will disappear once this job ends. For a detailed explanation of the filestore, job store, and their interfaces have a look at Managing files within a workflow.

Next down checks the base case of the recursion: is the length of the input file less than N (remember N was an option we added to the workflow in main)? In the base case, we just sort the file, and return the file ID of this new sorted file.

If the base case fails, then the file is split into two new tempFiles using job.fileStore.getLocalTempFile() and the helper function copySubRangeOfFile. Finally we add a follow on Job up with job.addFollowOnJobFn(). We’ve already seen child jobs. A follow-on Job is a job that runs after the current job and all of its children (and their children and follow-ons) have completed. Using a follow-on makes sense because up is responsible for merging the files together and we don’t want to merge the files together until we know they are sorted. Again, the return value of the follow-on job is requested using Job.rv().

Looking at up

def up(job, inputFileID1, inputFileID2, path, options, memory=sortMemory):
    """
    Merges the two files and places them in the output.
    """

    RealtimeLogger.info("Up job starting: %s" % path)

    with job.fileStore.writeGlobalFileStream() as (fileHandle, outputFileStoreID):
        fileHandle = codecs.getwriter('utf-8')(fileHandle)
        with job.fileStore.readGlobalFileStream(inputFileID1) as inputFileHandle1:
            inputFileHandle1 = codecs.getreader('utf-8')(inputFileHandle1)
            with job.fileStore.readGlobalFileStream(inputFileID2) as inputFileHandle2:
                inputFileHandle2 = codecs.getreader('utf-8')(inputFileHandle2)
                RealtimeLogger.info("Merging %s and %s to %s"
                    % (inputFileID1, inputFileID2, outputFileStoreID))
                merge(inputFileHandle1, inputFileHandle2, fileHandle)
        # Cleanup up the input files - these deletes will occur after the completion is successful.
        job.fileStore.deleteGlobalFile(inputFileID1)
        job.fileStore.deleteGlobalFile(inputFileID2)

        RealtimeLogger.info("Up job finished: %s" % path)

        return outputFileStoreID

we see that the two input files are merged together and the output is written to a new file using job.fileStore.writeGlobalFileStream(). After a little cleanup, the output file is returned.

Once the final up finishes and all of the rv() promises are fulfilled, main receives the sorted file’s ID which it uses in exportFile to send it to the user.

There are other things in this example that we didn’t go over such as Checkpoints and the details of much of the Toil Class API.

At the end of the script the lines

if __name__ == '__main__'
    main()

are included to ensure that the main function is only run once in the ‘__main__’ process invoked by you, the user. In Toil terms, by invoking the script you created the leader process in which the main() function is run. A worker process is a separate process whose sole purpose is to host the execution of one or more jobs defined in that script. In any Toil workflow there is always one leader process, and potentially many worker processes.

When using the single-machine batch system (the default), the worker processes will be running on the same machine as the leader process. With full-fledged batch systems like Kubernetes the worker processes will typically be started on separate machines. The boilerplate ensures that the pipeline is only started once—on the leader—but not when its job functions are imported and executed on the individual workers.

Typing python3 sort.py --help will show the complete list of arguments for the workflow which includes both Toil’s and ones defined inside sort.py. A complete explanation of Toil’s arguments can be found in Commandline Options.

Logging

By default, Toil logs a lot of information related to the current environment in addition to messages from the batch system and jobs. This can be configured with the --logLevel flag. For example, to only log CRITICAL level messages to the screen:

(venv) $ python3 sort.py file:jobStore \
             --logLevel=critical \
             --overwriteOutput=True

This hides most of the information we get from the Toil run. For more detail, we can run the pipeline with --logLevel=debug to see a comprehensive output. For more information, see Commandline Options.

Error Handling and Resuming Pipelines

With Toil, you can recover gracefully from a bug in your pipeline without losing any progress from successfully completed jobs. To demonstrate this, let’s add a bug to our example code to see how Toil handles a failure and how we can resume a pipeline after that happens. Add a bad assertion at line 52 of the example (the first line of down()):

def down(job, inputFileStoreID, N, downCheckpoints, memory=sortMemory):
    ...
    assert 1 == 2, "Test error!"

When we run the pipeline, Toil will show a detailed failure log with a traceback:

(venv) $ python3 sort.py file:jobStore
...
---TOIL WORKER OUTPUT LOG---
...
m/j/jobonrSMP    Traceback (most recent call last):
m/j/jobonrSMP      File "toil/src/toil/worker.py", line 340, in main
m/j/jobonrSMP        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
m/j/jobonrSMP      File "toil/src/toil/job.py", line 1270, in _runner
m/j/jobonrSMP        returnValues = self._run(jobGraph, fileStore)
m/j/jobonrSMP      File "toil/src/toil/job.py", line 1217, in _run
m/j/jobonrSMP        return self.run(fileStore)
m/j/jobonrSMP      File "toil/src/toil/job.py", line 1383, in run
m/j/jobonrSMP        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
m/j/jobonrSMP      File "toil/example.py", line 30, in down
m/j/jobonrSMP        assert 1 == 2, "Test error!"
m/j/jobonrSMP    AssertionError: Test error!

If we try and run the pipeline again, Toil will give us an error message saying that a job store of the same name already exists. By default, in the event of a failure, the job store is preserved so that the workflow can be restarted, starting from the previously failed jobs. We can restart the pipeline by running

(venv) $ python3 sort.py file:jobStore \
             --restart \
             --overwriteOutput=True

We can also change the number of times Toil will attempt to retry a failed job:

(venv) $ python3 sort.py file:jobStore \
             --retryCount 2 \
             --restart \
             --overwriteOutput=True

You’ll now see Toil attempt to rerun the failed job until it runs out of tries. --retryCount is useful for non-systemic errors, like downloading a file that may experience a sporadic interruption, or some other non-deterministic failure.

To successfully restart our pipeline, we can edit our script to comment out line 30, or remove it, and then run

(venv) $ python3 sort.py file:jobStore \
             --restart \
             --overwriteOutput=True

The pipeline will run successfully, and the job store will be removed on the pipeline’s completion.

Collecting Statistics

Please see the Status Command section for more on gathering runtime and resource info on jobs.

Launching a Toil Workflow in AWS

After having installed the aws extra for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run the basic helloWorld.py script (Running a basic Python workflow) on a VM in AWS just by modifying the run command.

Note that when running in AWS, users can either run the workflow on a single instance or run it on a cluster (which is running across multiple containers on multiple AWS instances). For more information on running Toil workflows on a cluster, see Running in AWS.

Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.

  1. Launch a cluster in AWS using the Launch-Cluster Command command:

    (venv) $ toil launch-cluster <cluster-name> \
                 --clusterType kubernetes \
                 --keyPairName <AWS-key-pair-name> \
                 --leaderNodeType t2.medium \
                 --nodeTypes t2.medium -w 1 \
                 --zone us-west-2a
    

    The arguments keyPairName, leaderNodeType, and zone are required to launch a cluster.

  2. Copy helloWorld.py to the /tmp directory on the leader node using the Rsync-Cluster Command command:

    (venv) $ toil rsync-cluster --zone us-west-2a <cluster-name> helloWorld.py :/tmp
    

    Note that the command requires defining the file to copy as well as the target location on the cluster leader node.

  3. Login to the cluster leader node using the Ssh-Cluster Command command:

    (venv) $ toil ssh-cluster --zone us-west-2a <cluster-name>
    

    Note that this command will log you in as the root user.

  4. Run the workflow on the cluster:

    $ python3 /tmp/helloWorld.py aws:us-west-2:my-S3-bucket
    

    In this particular case, we create an S3 bucket called my-S3-bucket in the us-west-2 availability zone to store intermediate job results.

    Along with some other INFO log messages, you should get the following output in your terminal window: Hello, world!, here's a message: You did it!.

  5. Exit from the SSH connection.

    $ exit
    
  6. Use the Destroy-Cluster Command command to destroy the cluster:

    (venv) $ toil destroy-cluster --zone us-west-2a <cluster-name>
    

    Note that this command will destroy the cluster leader node and any resources created to run the job, including the S3 bucket.

Running a CWL Workflow on AWS

After having installed the aws and cwl extras for Toil during the Installation and set up AWS (see Preparing your AWS environment), the user can run a CWL workflow with Toil on AWS.

Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.

  1. First launch a node in AWS using the Launch-Cluster Command command:

    (venv) $ toil launch-cluster <cluster-name> \
                 --clusterType kubernetes \
                 --keyPairName <AWS-key-pair-name> \
                 --leaderNodeType t2.medium \
                 --nodeTypes t2.medium -w 1 \
                 --zone us-west-2a
    
  2. Copy example.cwl and example-job.yaml from the CWL example to the node using the Rsync-Cluster Command command:

    (venv) $ toil rsync-cluster --zone us-west-2a <cluster-name> example.cwl :/tmp
    (venv) $ toil rsync-cluster --zone us-west-2a <cluster-name> example-job.yaml :/tmp
    
  3. SSH into the cluster’s leader node using the Ssh-Cluster Command utility:

    (venv) $ toil ssh-cluster --zone us-west-2a <cluster-name>
    
  4. Once on the leader node, command line tools such as kubectl will be available to you. It’s also a good idea to update and install the following:

    sudo apt-get update
    sudo apt-get -y upgrade
    sudo apt-get -y dist-upgrade
    sudo apt-get -y install git
    
  5. Now create a new virtualenv with the --system-site-packages option and activate:

    virtualenv --system-site-packages venv
    source venv/bin/activate
    
  6. Now run the CWL workflow with the Kubernetes batch system:

    (venv) $ toil-cwl-runner \
                 --provisioner aws \
                 --batchSystem kubernetes \
                 --jobStore aws:us-west-2:any-name \
                 /tmp/example.cwl /tmp/example-job.yaml
    

    Tip

    When running a CWL workflow on AWS, input files can be provided either on the local file system or in S3 buckets using s3:// URI references. Final output files will be copied to the local file system of the leader node.

  7. Finally, log out of the leader node and from your local computer, destroy the cluster:

    (venv) $ toil destroy-cluster --zone us-west-2a <cluster-name>
    

Running a Workflow with Autoscaling - Cactus

Cactus is a reference-free, whole-genome multiple alignment program that can be run on any of the cloud platforms Toil supports.

Note

Cloud Independence:

This example provides a “cloud agnostic” view of running Cactus with Toil. Most options will not change between cloud providers. However, each provisioner has unique inputs for --leaderNodeType, --nodeType and --zone. We recommend the following:

Option

Used in

AWS

Google

--leaderNodeType

launch-cluster

t2.medium

n1-standard-1

--zone

launch-cluster

us-west-2a

us-west1-a

--zone

cactus

us-west-2

--nodeType

cactus

c3.4xlarge

n1-standard-8

When executing toil launch-cluster with gce specified for --provisioner, the option --boto must be specified and given a path to your .boto file. See Running in Google Compute Engine (GCE) for more information about the --boto option.

Also! Remember to use the Destroy-Cluster Command command when finished to destroy the cluster! Otherwise things may not be cleaned up properly.

  1. Download pestis.tar.gz

  2. Launch a cluster using the Launch-Cluster Command command:

    (venv) $ toil launch-cluster <cluster-name> \
                 --provisioner <aws, gce> \
                 --keyPairName <key-pair-name> \
                 --leaderNodeType <type> \
                 --nodeType <type> \
                 -w 1-2 \
                 --zone <zone>
    

    Note

    A Helpful Tip

    When using AWS, setting the environment variable eliminates having to specify the --zone option for each command. This will be supported for GCE in the future.

    (venv) $ export TOIL_AWS_ZONE=us-west-2c
    
  3. Create appropriate directory for uploading files:

    (venv) $ toil ssh-cluster --provisioner <aws, gce> <cluster-name>
    $ mkdir /root/cact_ex
    $ exit
    
  4. Copy the required files, i.e., seqFile.txt (a text file containing the locations of the input sequences as well as their phylogenetic tree, see here), organisms’ genome sequence files in FASTA format, and configuration files (e.g. blockTrim1.xml, if desired), up to the leader node:

    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> pestis-short-aws-seqFile.txt :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000169655.1_ASM16965v1_genomic.fna :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000006645.1_ASM664v1_genomic.fna :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000182485.1_ASM18248v1_genomic.fna :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> GCF_000013805.1_ASM1380v1_genomic.fna :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> setup_leaderNode.sh :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim1.xml :/root/cact_ex
    (venv) $ toil rsync-cluster --provisioner <aws, gce> <cluster-name> blockTrim3.xml :/root/cact_ex
    
  5. Log in to the leader node:

    (venv) $ toil ssh-cluster --provisioner <aws, gce> <cluster-name>
    
  6. Set up the environment of the leader node to run Cactus:

    $ bash /root/cact_ex/setup_leaderNode.sh
    $ source cact_venv/bin/activate
    (cact_venv) $ cd cactus
    (cact_venv) $ pip install --upgrade .
    
  7. Run Cactus as an autoscaling workflow:

    (cact_venv) $ cactus \
                      --retry 10 \
                      --batchSystem kubernetes \
                      --logDebug \
                      --logFile /logFile_pestis3 \
                      --configFile \
                      /root/cact_ex/blockTrim3.xml <aws, google>:<zone>:cactus-pestis \
                      /root/cact_ex/pestis-short-aws-seqFile.txt \
                      /root/cact_ex/pestis_output3.hal
    

    Note

    Pieces of the Puzzle:

    --logDebug — equivalent to --logLevel DEBUG.

    --logFile /logFile_pestis3 — writes logs in a file named logFile_pestis3 under / folder.

    --configFile — this is not required depending on whether a specific configuration file is intended to run the alignment.

    <aws, google>:<zone>:cactus-pestis — creates a bucket, named cactus-pestis, with the specified cloud provider to store intermediate job files and metadata. NOTE: If you want to use a GCE-based jobstore, specify google here, not gce.

    The result file, named pestis_output3.hal, is stored under /root/cact_ex folder of the leader node.

    Use cactus --help to see all the Cactus and Toil flags available.

  8. Log out of the leader node:

    (cact_venv) $ exit
    
  9. Download the resulted output to local machine:

    (venv) $ toil rsync-cluster \
                 --provisioner <aws, gce> <cluster-name> \
                 :/root/cact_ex/pestis_output3.hal \
                 <path-of-folder-on-local-machine>
    
  10. Destroy the cluster:

    (venv) $ toil destroy-cluster --provisioner <aws, gce> <cluster-name>