Deploying Ray on a local kubernetes cluster

In this post, I’ll show you how to run the Ray application covered in the previous post as a Kubernetes deployment running on a local Kubernetes cluster deployed using kubeadm. To recall, the application executes a Directed Acyclic Graph (DAG) consisting of reading tabular data about wine characteristics and quality from AWS S3, training a RandomForest model and explaining the model’s predictions using different implementations of the KernelSHAP algorithm and comparing the results.

In this post, you’ll learn:

  • Creating a local kubernetes cluster
  • Creating a Ray cluster consisting of ray-head and ray-worker Kubernetes deployments on this kubernetes cluster
  •  Using Kubernetes Horizontal Pod Autoscaling (HPA) to automatically increase the number of worker pods when the CPU utilization exceeds a utilization percentage threshold.
    • Use the metrics-server application to monitor pod resource usage and override the metric-resolution argument to the metrics-server container to 15 sec for faster resource tracking
    • Lower the HPA downscale period from its default value of 5 min to 15 sec by using a custom kube-controller-manager. This reduces the time to downscale the number of worker pod replicas when the Ray application has stopped running
  • Using kubernetes Secrets to pass AWS S3 access tokens to the cluster so that an application running on the cluster can assume the IAM role necessary to execute Boto3 calls on the S3 bucket containing our input data

All the code for this post is here

I will not describe Kubernetes concepts in any great detail below. I will focus on the parts specific to running a Ray application on Kubernetes.

Setting up a local kubernetes cluster

The first step is to set up a local kubernetes cluster. Minikube and Kubeadm are two popular deployment tools. I decided to use kubeadm to set up my local cluster for the ease of set up. The kubeadm set up guide is pretty straightforward to follow. The tricky part is installing the pod network add-on. There are several options – Calico, Cilium and Flannel to name a few. I tried all three and ran into various arcane problems with Calico and Cilium (which are probably solvable, but stumped a kubernetes non-expert such as me). I achieved success with Flannel and thus the cluster set up script shown below uses Flannel pod add-on.

You should run these commands one by one rather than running them all at once. Notice that we use a custom configuration while calling kubeadm init. This custom configuration sets the pod-network-cidr to a custom IP address (required by Flannel) and sets the horizontal-pod-autoscaler-downscale-stabilization argument of kube-controller-manager to 15 sec from the default value of 5 min. This setting controls the scale-down time for the kubernetes Horizontal Pod Autoscaler (HPA). Setting a lower value means that worker replicas will be shutdown quicker once the target resource utilization falls below the auto-scaling threshold. More about this when I discuss HPA.

Here’s the custom-kube-controller-manager-config.yaml that is used while initializing kubeadm:

If the kubeadm setup worked correctly, you should see something like the screenshot below when you run this command:

Note that if you previously used another pod network add-on, kubeadm sometimes tries to use a previous pod network state leading to arcane initialization errors. One way around such errors is to remove any Container Networking Interface (CNI) related files for your unused pod network add-on. These files are located in /etc/cni/net.d/ folder. I’m not sure if this is the right way to get around initialization problems due to unused pod network add-ons, but it worked for me.

Finally, By default, your cluster will not schedule Pods on the control-plane node for security reasons. If you want to be able to schedule Pods on the control-plane node, (for example for a single-machine
Kubernetes cluster for development) run:

Set up metrics-server

Metrics Server is a scalable, efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines. Metrics Server collects resource metrics from Kubelets and exposes them in Kubernetes apiserver through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler. Metrics API can also be accessed by kubectl top, making it easier to debug autoscaling pipelines.

To collect resource metrics more frequently, I set the –metric-resolution argument to the metrics-server container to 15 sec. See Appendix for the full metrics-server deployment that I used.

Ray cluster deployment

Now that we have a local kubernetes cluster , let’s deploy Ray on it. Here’s the Ray deployment yaml.

This is borrowed from deploying on Kubernetes tutorial in Ray’s documentation. As the tutorial says, a Ray cluster consists of a single head node and a set of worker nodes . This is implemented as:

  • A ray-head Kubernetes Service that enables the worker nodes to discover the location of the head node on start up.
  • A ray-head Kubernetes Deployment that backs the ray-head Service with a single head node pod (replica).
  • A ray-worker Kubernetes Deployment with multiple worker node pods (replicas) that connect to the ray-head pod using the ray-head Service

I’ve modified the configuration in Ray’s documentation by adding kubernetes secret injection and CPU resource limits to both the ray-head and ray-worker deployments (required for HPA to work). Let’s look at each in more detail.

Setting up Kubernetes secrets

The input data used in our application is stored in a S3 bucket with a bucket access policy that allows Get/Put/DeleteObject operations on this bucket.  This bucket policy has an IAM role attached which can be assumed to get temporary STS credentials to make Boto calls to access the contents of the S3 bucket. For more information about this set up, see “Reading data from S3” section in my previous post. One way to pass these temporary credentials is to use environment variables. However, Kubernetes offers a better way to manage such information called kubernetes secret. A Secret is an object that contains a small amount of sensitive data such as a password.

The shell script below shows how to use aws sts assume-role command to obtain temporary credentials. These credentials are written to a file which can be passed to a Docker run command using the –env-file argument. The credentials then become part of the environment and accessible from a Python program using os.environ[‘ENV_NAME’]. The credentials are also used to create a kubernetes secret. Notice the use of the generic keyword in the kubectl create secret command. This signifies that we are creating an opaque secret, rather than one of the predefined kubernetes secret types.

To use this script, you must have a iamroles.txt file containing your IAM role ARN.

To use a Secret, a Pod needs to reference the Secret. A Secret can be used with a Pod in three ways:

  • As files in a volume mounted on one or more of its containers.
  • As container environment variable.
  • By the kubelet when pulling images for the Pod.

Here, we’ll use the first and second method in ray-head and the second method in ray-worker.

In the ray-head deployment, secrets are being injected in two ways:

  1. Using envFrom to define all of the Secret’s data as container environment variables. The key from the Secret becomes the environment variable name in the Pod.
  2. Using volume mounts. The secret data is exposed to Containers in the Pod through a Volume.

When you deploy the cluster, you can exec into the running head pod to verify that the secrets have been injected as expected:

  1. Get pods running in the Ray namespace
  2. exec into the head pod:
  3. Use the command below to print the value of AWS_ACCESS_KEY_ID_SESS

    verify that the value matches the output of the sts assume-role command
  4. Now lets check the content of the /etc/awstoken directory. We should see the AWS credentials used to create the kubernetes secret.

    Verify that the contents of these files matches the output of sts assume-role command

You only need one of these methods to inject secrets. The worker pods only use the environment variable method.

Kubernetes Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler (HPA) automatically scales the number of Pods in a replication controller, deployment or replica set based on observed CPU utilization or a custom application metric. For details about the algorithm used by the Autoscaler, see HPA documentation. In our HPA configuration, we specify the deployment subject to autoscaling (ray-worker), the minimum and maximum number of replicas and the target CPU utilization percentage. When the observed CPU utilization percentage exceeds the target, autoscaling will kick in launching new ray-worker pods. These pods will automatically join the ray-head deployment, ready to take on work.

In the ray-cluster configuration YAML, notice that I’ve added CPU resource limits to both the ray-head and ray-worker deployments. This is necessary for the HPA to work because new worker replicas are launched when pod CPU utilization percentage exceeds a threshold. For utilization percentage to be calculated, we need a resource limit. The requests.cpu field is also used to set the value of the num-cpus argument passed to the ray-head and ray-worker deployments through the MY_CPU_REQUEST variable.

You can verify the values of the environment variables by using kubectl describe pods <pod-name> 

Note that requests.cpu is specified in milli-cpus

let’s now create a HPA that automatically launches additional worker nodes when worker pod CPU resource utilization percentage exceeds a threshold. Then we’ll run the Python application and check for the number of worker replicas.

Here’s the HPA configuration yaml (called ray-autoscaler.yaml).

To create HPA, use the following commands:

Note that the HPA is created in the same namespace as the rest of the Ray deployment.

To verify that the HPA has been successfully created, you can use:

Having created the HPA, we can now run our Python application:

Let’s verify that our pods are running (by using kubectl -get pods -n ray)

After a few seconds, the CPU utilization has reached limits.cpu.

At this point, HPA will launch two additional worker replicas, increasing the number of worker replicas to 3 (max allowed in the HPA settings)

Once the shapley value calculations have finished, check the number of pods. You’ll see the number of worker replicas scale down to 1 after a few seconds. Recall we set horizontal-pod-autoscaler-downscale-stabilization argument of kube-controller-manager to 15 sec while initializing kubeadm. Doing so lowers the down scale time so that the worker replicas are terminated faster.


In this post you learnt how to use kubeadm to set up a local kubernetes cluster, deploy a Ray cluster on this kubernetes cluster, passing AWS STS credentials using kubernetes Secret and launching new worker replicas using kubernetes HPA. The script used to execute the steps shown above, starting from setting up metrics-server is here:



Here’s the metrics-server deployment I used. You can cut and paste this into a yaml file and then deploy it using kubectl apply -f file_name.yaml


Be the first to comment

Leave a Reply

Your email address will not be published.