Commandline Options

A quick way to see all of Toil’s commandline options is by executing the following on a toil script:

$ toil example.py --help

For a basic toil workflow, Toil has one mandatory argument, the job store. All other arguments are optional.

The Job Store

Running toil scripts requires a filepath or url to a centralizing location for all of the files of the workflow. This is Toil’s one required positional argument: the job store. To use the quickstart example, if you’re on a node that has a large /scratch volume, you can specify that the jobstore be created there by executing: python HelloWorld.py /scratch/my-job-store, or more explicitly, python HelloWorld.py file:/scratch/my-job-store.

Syntax for specifying different job stores:

Local: file:job-store-name

AWS: aws:region-here:job-store-name

Google: google:projectID-here:job-store-name

Different types of job store options can be found below.

Commandline Options

Core Toil Options

--workDir WORKDIR
 Absolute path to directory where temporary files generated during the Toil run should be placed. Temp files and folders, as well as standard output and error from batch system jobs (unless –noStdOutErr), will be placed in a directory toil-<workflowID> within workDir. The workflowID is generated by Toil and will be reported in the workflow logs. Default is determined by the variables (TMPDIR, TEMP, TMP) via mkdtemp. This directory needs to exist on all machines running jobs; if capturing standard output and error from batch system jobs is desired, it will generally need to be on a shared file system.
--noStdOutErr Do not capture standard output and error from batch system jobs.
--stats Records statistics about the toil workflow to be used by ‘toil stats’.
--clean=STATE Determines the deletion of the jobStore upon completion of the program. Choices: ‘always’, ‘onError’,’never’, or ‘onSuccess’. The --stats option requires information from the jobStore upon completion so the jobStore will never be deleted with that flag. If you wish to be able to restart the run, choose ‘never’ or ‘onSuccess’. Default is ‘never’ if stats is enabled, and ‘onSuccess’ otherwise
--cleanWorkDir STATE
 Determines deletion of temporary worker directory upon completion of a job. Choices: ‘always’, ‘never’, ‘onSuccess’. Default = always. WARNING: This option should be changed for debugging only. Running a full pipeline with this option could fill your disk with intermediate data.
--clusterStats FILEPATH
 If enabled, writes out JSON resource usage statistics to a file. The default location for this file is the current working directory, but an absolute path can also be passed to specify where this file should be written. This option only applies when using scalable batch systems.
--restart If --restart is specified then will attempt to restart existing workflow at the location pointed to by the --jobStore option. Will raise an exception if the workflow does not exist.

Logging Options

Toil hides stdout and stderr by default except in case of job failure. Log levels in toil are based on priority from the logging module:

--logOff Only CRITICAL log levels are shown. Equivalent to --logLevel=OFF or --logLevel=CRITICAL.
--logCritical Only CRITICAL log levels are shown. Equivalent to --logLevel=OFF or --logLevel=CRITICAL.
--logError Only ERROR, and CRITICAL log levels are shown. Equivalent to --logLevel=ERROR.
--logWarning Only WARN, ERROR, and CRITICAL log levels are shown. Equivalent to --logLevel=WARNING.
--logInfo All log statements are shown, except DEBUG. Equivalent to --logLevel=INFO.
--logDebug All log statements are shown. Equivalent to --logLevel=DEBUG.
--logLevel=LOGLEVEL
 May be set to: OFF (or CRITICAL), ERROR, WARN (or WARNING), INFO, or DEBUG.
--logFile FILEPATH
 Specifies a file path to write the logging output to.
--rotatingLogging
 Turn on rotating logging, which prevents log files from getting too big (set using --maxLogFileSize BYTESIZE).
--maxLogFileSize BYTESIZE
 Sets the maximum log file size in bytes (--rotatingLogging must be active).

Batch System Options

--batchSystem BATCHSYSTEM
 The type of batch system to run the job(s) with, currently can be one of LSF, Mesos, Slurm, Torque, HTCondor, singleMachine, parasol, gridEngine’. (default: singleMachine)
--parasolCommand PARASOLCOMMAND
 The name or path of the parasol program. Will be looked up on PATH unless it starts with a slash. (default: parasol)
--parasolMaxBatches PARASOLMAXBATCHES
 Maximum number of job batches the Parasol batch is allowed to create. One batch is created for jobs with a unique set of resource requirements. (default: 1000)
--scale SCALE A scaling factor to change the value of all submitted tasks’ submitted cores. Used in singleMachine batch system. (default: 1)
--linkImports When using Toil’s importFile function for staging, input files are copied to the job store. Specifying this option saves space by sym-linking imported files. As long as caching is enabled Toil will protect the file automatically by changing the permissions to read-only.
--mesosMaster MESOSMASTERADDRESS
 The host and port of the Mesos master separated by a colon. (default: 169.233.147.202:5050)

Autoscaling Options

--provisioner CLOUDPROVIDER
 The provisioner for cluster auto-scaling. The currently supported choices are ‘aws’ or ‘gce’. The default is None.
--nodeTypes NODETYPES
 List of node types separated by commas. The syntax for each node type depends on the provisioner used. For the cgcloud and AWS provisioners this is the name of an EC2 instance type, optionally followed by a colon and the price in dollars to bid for a spot instance of that type, for example ‘c3.8xlarge:0.42’. If no spot bid is specified, nodes of this type will be non-preemptable. It is acceptable to specify an instance as both preemptable and non-preemptable, including it twice in the list. In that case, preemptable nodes of that type will be preferred when creating new nodes once the maximum number of preemptable-nodes have been reached.
--nodeOptions NODEOPTIONS
 Options for provisioning the nodes. The syntax depends on the provisioner used. Neither the CGCloud nor the AWS provisioner support any node options.
--minNodes MINNODES
 Minimum number of nodes of each type in the cluster, if using auto-scaling. This should be provided as a comma-separated list of the same length as the list of node types. default=0
--maxNodes MAXNODES
 Maximum number of nodes of each type in the cluster, if using autoscaling, provided as a comma-separated list. The first value is used as a default if the list length is less than the number of nodeTypes. default=10
--preemptableCompensation PREEMPTABLECOMPENSATION
 The preference of the autoscaler to replace preemptable nodes with non-preemptable nodes, when preemptable nodes cannot be started for some reason. Defaults to 0.0. This value must be between 0.0 and 1.0, inclusive. A value of 0.0 disables such compensation, a value of 0.5 compensates two missing preemptable nodes with a non-preemptable one. A value of 1.0 replaces every missing pre-emptable node with a non-preemptable one.
--nodeStorage NODESTORAGE
 Specify the size of the root volume of worker nodes when they are launched in gigabytes. You may want to set this if your jobs require a lot of disk space. The default value is 50.
--metrics Enable the prometheus/grafana dashboard for monitoring CPU/RAM usage, queue size, and issued jobs.
--defaultMemory INT
 The default amount of memory to request for a job. Only applicable to jobs that do not specify an explicit value for this requirement. Standard suffixes like K, Ki, M, Mi, G or Gi are supported. Default is 2.0G
--defaultCores FLOAT
 The default number of CPU cores to dedicate a job. Only applicable to jobs that do not specify an explicit value for this requirement. Fractions of a core (for example 0.1) are supported on some batch systems, namely Mesos and singleMachine. Default is 1.0
--defaultDisk INT
 The default amount of disk space to dedicate a job. Only applicable to jobs that do not specify an explicit value for this requirement. Standard suffixes like K, Ki, M, Mi, G or Gi are supported. Default is 2.0G
--maxCores INT The maximum number of CPU cores to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
--maxMemory INT
 The maximum amount of memory to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
--maxDisk INT The maximum amount of disk space to request from the batch system at any one time. Standard suffixes like K, Ki, M, Mi, G or Gi are supported.
--retryCount RETRYCOUNT
 Number of times to retry a failing job before giving up and labeling job failed. default=1
--maxJobDuration MAXJOBDURATION
 Maximum runtime of a job (in seconds) before we kill it (this is a lower bound, and the actual time before killing the job may be longer).
--rescueJobsFrequency RESCUEJOBSFREQUENCY
 Period of time to wait (in seconds) between checking for missing/overlong jobs, that is jobs which get lost by the batch system.
--maxServiceJobs MAXSERVICEJOBS
 The maximum number of service jobs that can be run concurrently, excluding service jobs running on preemptable nodes. default=9223372036854775807
--maxPreemptableServiceJobs MAXPREEMPTABLESERVICEJOBS
 The maximum number of service jobs that can run concurrently on preemptable nodes. default=9223372036854775807
--deadlockWait DEADLOCKWAIT
 Time, in seconds, to tolerate the workflow running only the same service jobs, with no jobs to use them, before declaring the workflow to be deadlocked and stopping. default=60
--deadlockCheckInterval DEADLOCKCHECKINTERVAL
 Time, in seconds, to wait between checks to see if the workflow is stuck running only service jobs, with no jobs to use them. Should be shorter than –deadlockWait. May need to be increased if the batch system cannot enumerate running jobs quickly enough, or if polling for running jobs is placing an unacceptable load on a shared cluster. default=30
--statePollingWait STATEPOLLINGWAIT
 Time, in seconds, to wait before doing a scheduler query for job state. Return cached results if within the waiting period.

Miscellaneous Options

--disableCaching
 Disables caching in the file store. This flag must be set to use a batch system that does not support caching such as Grid Engine, Parasol, LSF, or Slurm.
--disableChaining
 Disables chaining of jobs (chaining uses one job’s resource allocation for its successor job if possible).
--maxLogFileSize MAXLOGFILESIZE
 The maximum size of a job log file to keep (in bytes), log files larger than this will be truncated to the last X bytes. Setting this option to zero will prevent any truncation. Setting this option to a negative value will truncate from the beginning. Default=62.5 K
--writeLogs FILEPATH
 Write worker logs received by the leader into their own files at the specified path. Any non-empty standard output and error from failed batch system jobs will also be written into files at this path. The current working directory will be used if a path is not specified explicitly. Note: By default only the logs of failed jobs are returned to leader. Set log level to ‘debug’ to get logs back from successful jobs, and adjust ‘maxLogFileSize’ to control the truncation limit for worker logs.
--writeLogsGzip FILEPATH
 Identical to --writeLogs except the logs files are gzipped on the leader.
--realTimeLogging
 Enable real-time logging from workers to masters.
--sseKey SSEKEY
 Path to file containing 32 character key to be used for server-side encryption on awsJobStore or googleJobStore. SSE will not be used if this flag is not passed.
--setEnv NAME NAME=VALUE or NAME, -e NAME=VALUE or NAME are also valid. Set an environment variable early on in the worker. If VALUE is omitted, it will be looked up in the current environment. Independently of this option, the worker will try to emulate the leader’s environment before running a job. Using this option, a variable can be injected into the worker process itself before it is started.
--servicePollingInterval SERVICEPOLLINGINTERVAL
 Interval of time service jobs wait between polling for the existence of the keep-alive flag (default=60)
--debugWorker Experimental no forking mode for local debugging. Specifically, workers are not forked and stderr/stdout are not redirected to the log. (default=False)
--disableProgress
 Disables the progress bar shown when standard error is a terminal.

Restart Option

In the event of failure, Toil can resume the pipeline by adding the argument --restart and rerunning the python script. Toil pipelines can even be edited and resumed which is useful for development or troubleshooting.

Running Workflows with Services

Toil supports jobs, or clusters of jobs, that run as services to other accessor jobs. Example services include server databases or Apache Spark Clusters. As service jobs exist to provide services to accessor jobs their runtime is dependent on the concurrent running of their accessor jobs. The dependencies between services and their accessor jobs can create potential deadlock scenarios, where the running of the workflow hangs because only service jobs are being run and their accessor jobs can not be scheduled because of too limited resources to run both simultaneously. To cope with this situation Toil attempts to schedule services and accessors intelligently, however to avoid a deadlock with workflows running service jobs it is advisable to use the following parameters:

  • --maxServiceJobs: The maximum number of service jobs that can be run concurrently, excluding service jobs running on preemptable nodes.
  • --maxPreemptableServiceJobs: The maximum number of service jobs that can run concurrently on preemptable nodes.

Specifying these parameters so that at a maximum cluster size there will be sufficient resources to run accessors in addition to services will ensure that such a deadlock can not occur.

If too low a limit is specified then a deadlock can occur in which toil can not schedule sufficient service jobs concurrently to complete the workflow. Toil will detect this situation if it occurs and throw a toil.DeadlockException exception. Increasing the cluster size and these limits will resolve the issue.

Setting Options directly with the Toil Script

It’s good to remember that commandline options can be overridden in the Toil script itself. For example, toil.job.Job.Runner.getDefaultOptions() can be used to run toil with all default options, and in this example, it will override commandline args to run the default options and always run with the “./toilWorkflow” directory specified as the jobstore:

options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object

with Toil(options) as toil:
    toil.start(Job())  # Run the script

However, each option can be explicitly set within the script by supplying arguments (in this example, we are setting logLevel = "DEBUG" (all log statements are shown) and clean="ALWAYS" (always delete the jobstore) like so:

options = Job.Runner.getDefaultOptions("./toilWorkflow") # Get the options object
options.logLevel = "DEBUG" # Set the log level to the debug level.
options.clean = "ALWAYS" # Always delete the jobStore after a run

with Toil(options) as toil:
    toil.start(Job())  # Run the script

However, the usual incantation is to accept commandline args from the user with the following:

parser = Job.Runner.getDefaultArgumentParser() # Get the parser
options = parser.parse_args() # Parse user args to create the options object

with Toil(options) as toil:
    toil.start(Job())  # Run the script

Which can also, of course, then accept script supplied arguments as before (which will overwrite any user supplied args):

parser = Job.Runner.getDefaultArgumentParser() # Get the parser
options = parser.parse_args() # Parse user args to create the options object
options.logLevel = "DEBUG" # Set the log level to the debug level.
options.clean = "ALWAYS" # Always delete the jobStore after a run

with Toil(options) as toil:
    toil.start(Job())  # Run the script