Deploying on Google Cloud Dataproc

Dataproc is Google Cloud’s hosted service for creating Apache Hadoop and Apache Spark clusters. Dataproc supports a series of open-source initialization actions that allows installation of a wide range of open source tools when creating a cluster. In particular, the following instructions will guide you through creating a Dataproc cluster with Dask and Dask-Yarn installed and configured for you. This tutorial is loosely adapted from the README for the Dask initialization action.

What the Initialization Action is Doing

The initialization action installation script does several things:

  • It accepts a metadata parameter for configuring your cluster to use Dask with either its standalone scheduler or with Dask-Yarn to utilize Yarn.
  • For the yarn configuration, this script installs dask and dask-yarn on all machines and adds a baseline Skein config file. This file tells each machine where to locate the Dask-Yarn environment, as well as how many workers to use by default: 2. This way, you can get started with dask-yarn by simply creating a YarnCluster object without providing any parameters. Dask relies on using Yarn to schedule its tasks.
  • For the standalone configuration, this script installs dask and configures the cluster to use the Dask scheduler for managing Dask workloads.
  • The Dataproc service itself provides support for web UIs such as Jupyter and the Dask web UIs. This will be explained in more detail below.

Configuring your Dataproc Cluster

There are several ways to create a Dataproc cluster. This tutorial will focus on using the gcloud SDK to do so.

First, you’ll need to create a GCP Project. Please follow the instructions here to do so.

Decide on a name for your Dataproc cluster. Then, pick a geographic region to place your cluster in, ideally one close to you.

The following command will create a cluster for the dask-yarn configuration.

gcloud dataproc clusters create ${CLUSTER_NAME} \
  --region ${REGION} \
  --master-machine-type n1-standard-16 \
  --worker-machine-type n1-standard-16 \
  --image-version preview \
  --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/dask/ \
  --metadata dask-runtime=yarn \
  --optional-components JUPYTER \

To break down this command:

  • gcloud dataproc clusters create ${CLUSTER_NAME} uses the gcloud sdk to to create a Dataproc cluster.
  • --region ${REGION} specifies the cluster region.
  • --master-machine-type and worker-machine-type allow configuration of CPUs and RAM via different types of machines.
  • image-version preview specifies the Dataproc image version. You’ll use the latest preview image of Dataproc for the most up-to-date features.
  • --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/dask/ specifies the initialization actions to install on the cluster. You can add as many as you’d like via a comma-separated list.
  • --metadata dask-runtime=yarn specifies to configure your cluster with Dask configured for use with yarn.
  • --optional-components JUPYTER configures the cluster with the Jupyter optional component to access Jupyter notebooks running on the cluster. Like initialization actions, you can add as many optional components as you’d like. These differ from initialization actions in that they come with first-class support from the Dataproc service, but there are less options available.
  • --enable-component-gateway allows you to bypass needing an SSH tunnel for a certain predetermined list of web UIs on your cluster, such as Jupyter and the Yarn ApplicationMaster, by connecting directly through the Dataproc web console.

Connecting to your cluster

You can access your cluster several different ways. If you configured your cluster with a notebook service such as Jupyter or Zeppelin and enable component gateway (explained above), you can access these by navigating to your clusters page, clicking on the name of your cluster and clicking on the Web Interfaces tab to access your web UIs.

You can also ssh into your cluster. You can do this via the Dataproc web console: from the clusters page, click on your cluster name, then VM Instances and click SSH next to the master node.

Additionally, you can also use the gcloud sdk to SSH onto your cluster. First, locate the zone that your cluster is in. This will be the region you specified earlier but with a letter attached to it, such as us-central1-b. To locate your cluster’s zone, you can find this on the clusters page next to your cluster. This was determined via Dataproc’s Auto Zone feature, but you can choose any zone to place your cluster by adding the --zone flag when creating a new cluster.

gcloud compute ssh ${CLUSTER_NAME}-m --zone ${ZONE}

Once connected, either via a Jupyter notebook or via ssh, try running some code. If your cluster is configured with Dask-Yarn:

from dask_yarn import YarnCluster
from dask.distributed import Client
import dask.array as da

import numpy as np

cluster = YarnCluster()
client = Client(cluster)

cluster.adapt() # Dynamically scale Dask resources

x = da.sum(np.ones(5))

If your cluster is configured with the standalone scheduler:

from dask.distributed import Client
import dask.array as da

import numpy as np

client = Client("localhost:8786")

x = da.sum(np.ones(5))

Monitoring Dask Jobs

You can monitor your Dask applications using Web UIs, depending on the runtime you are using.

For yarn mode, you can access the Skein Web UI via the YARN ResourceManager. To access the YARN ResourceManager, create your cluster with component gateway enabled or create an SSH tunnel. You can then access the Skein Web UI by following these instructions.

For standalone mode, you can access the native Dask UI. Create an SSH tunnel to access the Dask UI on port 8787.

Deleting your Dataproc Cluster

You can delete your cluster when you are done with it by running the following command:

gcloud dataproc clusters delete ${CLUSTER_NAME} --region ${REGION}

Further Information

Please refer to the Dataproc documentation for more information on using Dataproc.