Quickstart

Dask-Yarn is designed to be used like any other python library - install it locally and use it in your code (either interactively, or as part of an application). As long as the computer you’re deploying on has access to the YARN cluster (usually an edge node), everything should work fine.

Install Dask-Yarn on an Edge Node

Dask-Yarn is designed to be used from an edge node. To install, use either conda or pip to create a new environment and install dask-yarn on the edge node.

Conda Environments:

Create a new conda environment with dask-yarn installed. You may also want to add any other packages you rely on for your work.

$ conda create -n my_env dask-yarn      # Create an environment

$ conda activate my_env                 # Activate the environment

Virtual Environments:

Create a new virtual environment with dask-yarn installed. You may also want to add any other packages you rely on for your work.

$ python -m venv my_env                 # Create an environment using venv

$ source my_env/bin/activate            # Activate the environment

$ pip install dask-yarn                 # Install some packages

Package your environment for Distribution

We need to ensure that the libraries used on the Yarn cluster are the same as what you are using locally. By default, dask-yarn handles this by distributing a packaged python environment to the Yarn cluster as part of the applications. This is typically handled using

See Managing Python Environments for more information.

Conda Environments:

If you haven’t already installed conda-pack, you’ll need to do so now. You can either install it in the environment to be packaged, or your root environment (where it will be available to use in all environments).

$ conda install -c conda-forge conda-pack   # Install conda-pack

$ conda-pack                                # Package environment
Collecting packages...
Packing environment at '/home/username/miniconda/envs/my_env' to 'my_env.tar.gz'
[########################################] | 100% Completed |  12.2s

Virtual Environments:

If you haven’t already installed venv-pack, you’ll need to do so now.

$ pip install venv-pack                     # Install venv-pack

$ venv-pack                                 # Package environment
Collecting packages...
Packing environment at '/home/username/my-env' to 'my-env.tar.gz'
[########################################] | 100% Completed |  8.3s

Kinit (Optional)

If your cluster is configured to use Kerberos for authentication, you need to make sure you have an active ticket-granting-ticket before continuing:

$ kinit

Usage

To start a YARN cluster, create an instance of YarnCluster. This constructor takes several parameters, leave them empty to use the defaults defined in the Configuration.

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(environment='environment.tar.gz',
                      worker_vcores=2,
                      worker_memory="8GiB")

# Connect to the cluster
client = Client(cluster)

# Do some work here

# Shutdown client and cluster (alternatively use context-manager as shown below):
client.shutdown()
cluster.shutdown()

By default no workers are started on cluster creation. To change the number of workers, use the YarnCluster.scale() method. When scaling up, new workers will be requested from YARN. When scaling down, workers will be intelligently selected and scaled down gracefully, freeing up resources.

# Scale up to 10 workers
cluster.scale(10)

# ...

# Scale back down to 2 workers
cluster.scale(2)

Alternatively, you can enable adaptive scaling using the YarnCluster.adapt() method. When enabled, the cluster will scale up and down automatically depending on usage. Here we turn on adaptive scaling, bounded at a minimum of 2 workers and a maximum of 10 workers.

# Adaptively scale between 2 and 10 workers
cluster.adapt(minimum=2, maximum=10)

If you’re working interactively in a Jupyter Notebook or JupyterLab, you can also use the provided graphical interface to change the cluster size, instead of calling YarnCluster.scale() or YarnCluster.adapt() manually.

Cluster widget in a Jupyter Notebook

Normally the cluster will persist until the YarnCluster object is deleted. To be more explicit about when the cluster is shutdown, you can either use the cluster as a context manager, or manually call YarnCluster.shutdown().

# Use ``YarnCluster`` as a context manager
with YarnCluster(...) as cluster:
    # The cluster will remain active inside this block,
    # and will be shutdown when the context exits.

# Or manually call `shutdown`
cluster = YarnCluster(...)
# ...
cluster.shutdown()