Toil Utilities¶
Toil includes some utilities for inspecting or manipulating workflows during and after their execution. (There are additional Toil Cluster Utilities available for working with Toil-managed clusters in the cloud.)
The generic toil
subcommand utilities are:
stats
— Reports runtime and resource usage for all jobs in a specified jobstore (workflow must have originally been run using the--stats
option).
status
— Inspects a job store to see which jobs have failed, run successfully, etc.
debug-job
— Runs a failing job on your local machine.
clean
— Delete the job store used by a previous Toil workflow invocation.
kill
— Kills any running jobs in a rogue toil.
For information on a specific utility, run it with the --help
option:
toil stats --help
Stats Command¶
To use the stats command, a workflow must first be run using the --stats
option. Using this command makes certain
that toil does not delete the job store, no matter what other options are specified (i.e. normally the option
--clean=always
would delete the job store, but --stats
will override this).
Running an Example¶
We can run an example workflow and record stats:
python3 workflow.py file:my-jobstore --stats
Where workflow.py
is the following:
import os
from toil.common import Toil
from toil.job import Job
import math
import time
from multiprocessing import Process
def think(seconds):
start = time.time()
while time.time() - start < seconds:
# Use CPU
math.sqrt(123456)
class TimeWaster(Job):
def __init__(self, time_to_think, time_to_waste, space_to_waste, *args, **kwargs):
self.time_to_think = time_to_think
self.time_to_waste = time_to_waste
self.space_to_waste = space_to_waste
super().__init__(*args, **kwargs)
def run(self, fileStore):
# Waste some space
file_path = fileStore.getLocalTempFile()
with open(file_path, 'w') as stream:
for i in range(self.space_to_waste):
stream.write("X")
# Do some "useful" compute
processes = []
for core_number in range(max(1, self.cores)):
# Use all the assigned cores to think
p = Process(target=think, args=(self.time_to_think,))
p.start()
processes.append(p)
for p in processes:
p.join()
# Also waste some time
time.sleep(self.time_to_waste)
def main():
options = Job.Runner.getDefaultArgumentParser().parse_args()
job1 = TimeWaster(0, 0, 0, displayName='doNothing')
job2 = TimeWaster(10, 0, 4096, displayName='efficientJob')
job3 = TimeWaster(10, 0, 1024, cores=4, displayName='multithreadedJob')
job4 = TimeWaster(1, 9, 65536, displayName='inefficientJob')
job1.addChild(job2)
job1.addChild(job3)
job3.addChild(job4)
with Toil(options) as toil:
if not toil.options.restart:
toil.start(job1)
else:
toil.restart()
if __name__ == '__main__':
main()
Notice the displayName
key, which can rename a job, giving it an alias when it is finally displayed in stats.
Displaying Stats¶
To see the runtime and resources used for each job when it was run, type
toil stats file:my-jobstore
This should output something like the following:
Batch System: single_machine
Default Cores: 1 Default Memory: 2097152KiB
Max Cores: unlimited
Local CPU Time: 55.54 core·s Overall Runtime: 26.23 s
Worker
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
3 | 0.34 10.83 10.80 21.23 32.40 | 0.33 10.43 17.94 43.07 53.83 | 0.01 0.40 14.08 41.85 42.25 | 177168Ki 179312Ki 178730Ki 179712Ki 536192Ki | 0Ki 4Ki 22Ki 64Ki 68Ki
Job
Worker Jobs | min med ave max
| 1 1 1.3333 2
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
4 | 0.33 10.83 8.10 10.85 32.38 | 0.33 10.43 13.46 41.70 53.82 | 0.01 1.68 2.78 9.02 11.10 | 177168Ki 179488Ki 178916Ki 179696Ki 715664Ki | 0Ki 4Ki 18Ki 64Ki 72Ki
multithreadedJob
Total Cores: 4.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.85 10.85 10.85 10.85 10.85 | 41.70 41.70 41.70 41.70 41.70 | 1.68 1.68 1.68 1.68 1.68 | 179488Ki 179488Ki 179488Ki 179488Ki 179488Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
efficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.83 10.83 10.83 10.83 10.83 | 10.43 10.43 10.43 10.43 10.43 | 0.40 0.40 0.40 0.40 0.40 | 179312Ki 179312Ki 179312Ki 179312Ki 179312Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
inefficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.38 10.38 10.38 10.38 10.38 | 1.36 1.36 1.36 1.36 1.36 | 9.02 9.02 9.02 9.02 9.02 | 179696Ki 179696Ki 179696Ki 179696Ki 179696Ki | 64Ki 64Ki 64Ki 64Ki 64Ki
doNothing
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 0.33 0.33 0.33 0.33 0.33 | 0.33 0.33 0.33 0.33 0.33 | 0.01 0.01 0.01 0.01 0.01 | 177168Ki 177168Ki 177168Ki 177168Ki 177168Ki | 0Ki 0Ki 0Ki 0Ki 0Ki
This report gives information on the resources used by your workflow. Note that right now it does NOT track CPU and memory used inside Docker containers, only Singularity containers.
There are three parts to this report.
Overall Summary¶
At the top is a section with overall summary statistics for the run:
Batch System: single_machine
Default Cores: 1 Default Memory: 2097152KiB
Max Cores: unlimited
Local CPU Time: 55.54 core·s Overall Runtime: 26.23 s
This lists some important the settings for the Toil batch system that actually executed jobs. It also lists:
The CPU time used on the local machine, in core seconds. This includes time used by the Toil leader itself (excluding some startup time), and time used by jobs that run under the leader (which, for the
single_machine
batch system, is all jobs). It does not include CPU used by jobs that ran on other machines.The overall wall-clock runtime of the workflow in seconds, as measured by the leader.
These latter two numbers don’t count some startup/shutdown time spent loading and saving files, so you still may want to use the time
shell built-in to time your Toil runs overall.
Worker Summary¶
After the overall summary, there is a section with statistics about the Toil worker processes, which Toil used to execute your workflow’s jobs:
Worker
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
3 | 0.34 10.83 10.80 21.23 32.40 | 0.33 10.43 17.94 43.07 53.83 | 0.01 0.40 14.08 41.85 42.25 | 177168Ki 179312Ki 178730Ki 179712Ki 536192Ki | 0Ki 4Ki 22Ki 64Ki 68Ki
The
Count
column shows that, to run this workflow, Toil had to submit 3 Toil worker processes to the backing scheduler. (In this case, it ran them all on the local machine.)The
Real Time
column shows satistics about the wall clock times that all the worker process took. All the sub-column values are in seconds.The
CPU Time
column shows statistics about the CPU usage amounts of all the worker processes. All the sub-column values are in core seconds.The
CPU Wait
column shows statistics about CPU time reserved for but not consumed by worker processes. In this example, themax
andtotal
are relatively high compared to both real time and CPU time, indicating that a lot of reserved CPU time went unused. This can indicate that the workflow is overestimating its required cores, that small jobs are running in the same resource reservations as large jobs via chaining, or that the workflow is having to wait around for slow disk I/O.The
Memory
column shows the peak memory usage of each worker process and its child processes.The
Disk
column shows the disk usage in each worker. This is polled at the end of each job that is run by the worker, so it may not always reflect the actual peak disk usage.
Job Breakdown¶
Finally, there is the breakdown of resource usage by jobs. This starts with a table summarizing the counts of jobs that ran on each worker:
Job
Worker Jobs | min med ave max
| 1 1 1.3333 2
In this example, most of the workers ran one job each, but one worker managed to run two jobs, via chaining. (Jobs will chain when a job has only one dependant job, which in turn depends on only that first job, and the second job needs no more resources than the first job did.)
Next, we have statistics for resource usage over all jobs together:
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
4 | 0.33 10.83 8.10 10.85 32.38 | 0.33 10.43 13.46 41.70 53.82 | 0.01 1.68 2.78 9.02 11.10 | 177168Ki 179488Ki 178916Ki 179696Ki 715664Ki | 0Ki 4Ki 18Ki 64Ki 72Ki
And finally, for each kind of job (as determined by the job’s displayName
), we have statistics summarizing the resources used by the instances of that kind of job:
multithreadedJob
Total Cores: 4.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.85 10.85 10.85 10.85 10.85 | 41.70 41.70 41.70 41.70 41.70 | 1.68 1.68 1.68 1.68 1.68 | 179488Ki 179488Ki 179488Ki 179488Ki 179488Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
efficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.83 10.83 10.83 10.83 10.83 | 10.43 10.43 10.43 10.43 10.43 | 0.40 0.40 0.40 0.40 0.40 | 179312Ki 179312Ki 179312Ki 179312Ki 179312Ki | 4Ki 4Ki 4Ki 4Ki 4Ki
inefficientJob
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 10.38 10.38 10.38 10.38 10.38 | 1.36 1.36 1.36 1.36 1.36 | 9.02 9.02 9.02 9.02 9.02 | 179696Ki 179696Ki 179696Ki 179696Ki 179696Ki | 64Ki 64Ki 64Ki 64Ki 64Ki
doNothing
Total Cores: 1.0
Count | Real Time (s)* | CPU Time (core·s) | CPU Wait (core·s) | Memory (B) | Disk (B)
n | min med* ave max total | min med ave max total | min med ave max total | min med ave max total | min med ave max total
1 | 0.33 0.33 0.33 0.33 0.33 | 0.33 0.33 0.33 0.33 0.33 | 0.01 0.01 0.01 0.01 0.01 | 177168Ki 177168Ki 177168Ki 177168Ki 177168Ki | 0Ki 0Ki 0Ki 0Ki 0Ki
For each job, we first list its name, and then the total cores that it asked for, summed across all instances of it. Then we show a table of statistics.
Here the *
marker in the table headers becomes relevant; it shows that jobs are being sorted by the median of the real time used. You can control this with the --sortCategory
option.
The columns meanings are the same as for the workers:
The
Count
column shows the number of jobs of each type that ran.The
Real Time
column shows satistics about the wall clock times that instances of the job type took. All the sub-column values are in seconds.The
CPU Time
column shows statistics about the CPU usage amounts of each job. Note thatmultithreadedJob
managed to use CPU time at faster than one core second per second, because it reserved multiple cores and ran multiple threads.The
CPU Wait
column shows statistics about CPU time reserved for but not consumed by jobs. Note thatinefficientJob
used hardly any of the cores it requested for most of its real time.The
Memory
column shows the peak memory usage of each job.The
Disk
column shows the disk usage at the end of each job. It may not always reflect the actual peak disk usage.
Example Cleanup¶
Once we’re done looking at the stats, we can clean up the job store by running:
toil clean file:my-jobstore
Status Command¶
Continuing the example from the stats section above, if we ran our workflow with the command
python3 workflow.py file:my-jobstore --stats
We could interrogate our jobstore with the status command, for example:
toil status file:my-jobstore
If the run was successful, this would not return much valuable information, something like
2018-01-11 19:31:29,739 - toil.lib.bioio - INFO - Root logger is at level 'INFO', 'toil' logger at level 'INFO'.
2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Parsed arguments
2018-01-11 19:31:29,740 - toil.utils.toilStatus - INFO - Checking if we have files for Toil
The root job of the job store is absent, the workflow completed successfully.
Otherwise, the toil status
command will return something like the following:
Of the 3 jobs considered, there are 1 completely failed jobs, 1 jobs with children, 2 jobs ready to run, 0 zombie jobs, 0 jobs with services, 0 services, and 0 jobs with log files currently in FileJobStore(/Users/anovak/workspace/toil/tree).
The toil status
command supports several useful flags, including --perJob
to get per-job status information, --logs
to print stored worker logs, and --failed
to list all failed jobs in the workflow. For more information, run toil status --help
.
Clean Command¶
If a Toil pipeline didn’t finish successfully, or was run using --clean=always
or --stats
, the job store will exist
until it is deleted. toil clean <jobStore>
ensures that all artifacts associated with a job store are removed.
This is particularly useful for deleting AWS job stores, which reserves an SDB domain as well as an S3 bucket.
The deletion of the job store can be modified by the --clean
argument, and may be set to always
, onError
,
never
, or onSuccess
(default).
Temporary directories where jobs are running can also be saved from deletion using the --cleanWorkDir
, which has
the same options as --clean
. This option should only be run when debugging, as intermediate jobs will fill up
disk space.
Debug Job Command¶
If a Toil worklfow fails, and it wasn’t run with --clean=always
, the failing job will be waiting in the job store to be debugged. (With WDL or CWL workflows, you may have needed to manually set a --jobStore
location you can find again.)
You can use toil debug-job
on a job in the job store to run it on your local machine, to locally reproduce any error that may have happened during a remote workflow.
The toil debug-job
command takes a job store, and the ID or a name of a job in it. If multiple jobs match a job name, and only one seems to have run out of retries and completely failed, it will run that one.
You can also pass the --printJobInfo
flag to dump information about the job instead of running it.
Kill Command¶
To kill all currently running jobs for a given jobstore, use the command
toil kill file:my-jobstore