Auto-Deployment¶
If you want to run a Toil Python workflow in a distributed environment, on multiple worker machines, either in the cloud or on a bare-metal cluster, the Python code needs to be made available to those other machines. If the workflow’s main module imports other modules, those modules also need to be made available on the workers. Toil can automatically do that for you, with a little help on your part. We call this feature auto-deployment of a workflow.
Let’s first examine various scenarios of auto-deploying a workflow, which, as we’ll see shortly cannot be auto-deployed. Lastly, we’ll deal with the issue of declaring Toil as a dependency of a workflow that is packaged as a setuptools distribution.
Toil can be easily deployed to a remote host. First, assuming you’ve followed our Preparing your AWS environment section to install Toil
and use it to create a remote leader node on (in this example) AWS, you can now log into this into using
Ssh-Cluster Command and once on the remote host, create and activate a virtualenv (noting to make sure to use the
--system-site-packages
option!):
$ virtualenv --system-site-packages venv
$ . venv/bin/activate
Note the --system-site-packages
option, which ensures that globally-installed packages are accessible inside the
virtualenv. Do not (re)install Toil after this! The --system-site-packages
option has already transferred Toil and
the dependencies from your local installation of Toil for you.
From here, you can install a project and its dependencies:
$ tree
.
├── util
│ ├── __init__.py
│ └── sort
│ ├── __init__.py
│ └── quick.py
└── workflow
├── __init__.py
└── main.py
3 directories, 5 files
$ pip install matplotlib
$ cp -R workflow util venv/lib/python3.9/site-packages
Ideally, your project would have a setup.py
file (see setuptools) which streamlines the installation process:
$ tree
.
├── util
│ ├── __init__.py
│ └── sort
│ ├── __init__.py
│ └── quick.py
├── workflow
│ ├── __init__.py
│ └── main.py
└── setup.py
3 directories, 6 files
$ pip install .
Or, if your project has been published to PyPI:
$ pip install my-project
In each case, we have created a virtualenv with the --system-site-packages
flag in the venv
subdirectory then
installed the matplotlib
distribution from PyPI along with the two packages that our project consists of. (Again,
both Python and Toil are assumed to be present on the leader and all worker nodes.)
We can now run our workflow:
$ python3 main.py --batchSystem=kubernetes …
Important
If workflow’s external dependencies contain native code (i.e. are not pure Python) then they must be manually installed on each worker.
Warning
Neither python3 setup.py develop
nor pip install -e .
can be used in
this process as, instead of copying the source files, they create .egg-link
files that Toil can’t auto-deploy. Similarly, python3 setup.py install
doesn’t work either as it installs the project as a Python .egg
which is
also not currently supported by Toil (though it could be in the future).
Also note that using the
--single-version-externally-managed
flag with setup.py
will
prevent the installation of your package as an .egg
. It will also disable
the automatic installation of your project’s dependencies.
Auto Deployment with Sibling Python Files¶
This scenario applies if a Python workflow imports files that are its siblings:
$ cd my_project
$ ls
userScript.py utilities.py
$ ./userScript.py --batchSystem=kubernetes …
Here userScript.py
imports additional functionality from utilities.py
.
Toil detects that userScript.py
has sibling Python files and copies them to the
workers, alongside the main Python file. Note that sibling Python files will be
auto-deployed regardless of whether they are actually imported by the workflow:
all .py
files residing in the same directory as the main workflow Python file will
automatically be auto-deployed.
This structure is a suitable method of organizing the source code of reasonably complicated workflows.
Auto-Deploying a Package Hierarchy¶
Recall that in Python, a package is a directory containing one or more
.py
files, one of which must be called __init__.py
, and optionally other
packages. For more involved workflows that contain a significant amount of
code, this is the recommended way of organizing the source code. Because we use
a package hierarchy, the main workflow file is actually a Python module.
It is merely one of the modules in the package
hierarchy. We need to inform Toil that we want to use a package hierarchy by
invoking Python’s -m
option. This enables Toil to identify the entire set
of modules belonging to the workflow and copy all of them to each worker. Note
that while using the -m
option is optional in the scenarios above, it is
mandatory in this one.
The following shell session illustrates this:
$ cd my_project
$ tree
.
├── utils
│ ├── __init__.py
│ └── sort
│ ├── __init__.py
│ └── quick.py
└── workflow
├── __init__.py
└── main.py
3 directories, 5 files
$ python3 -m workflow.main --batchSystem=kubernetes …
Here the workflow entry point module main.py
does not reside in the current directory, but
is part of a package called util
, in a subdirectory of the current
directory. Additional functionality is in a separate module called
util.sort.quick
which corresponds to util/sort/quick.py
. Because we
invoke the workflow via python3 -m workflow.main
, Toil can determine the
root directory of the hierarchy–my_project
in this case–and copy all Python
modules underneath it to each worker. The -m
option is documented here
When -m
is passed, Python adds the current working directory to
sys.path
, the list of root directories to be considered when resolving a
module name like workflow.main
. Without that added convenience we’d have to
run the workflow as PYTHONPATH="$PWD" python3 -m workflow.main
. This also
means that Toil can detect the root directory of the invoked module’s package
hierarchy even if it isn’t the current working directory. In other words we
could do this:
$ cd my_project
$ export PYTHONPATH="$PWD"
$ cd /some/other/dir
$ python3 -m workflow.main --batchSystem=kubernetes …
Also note that the root directory itself must not be package, i.e. must not
contain an __init__.py
.