Mastering Kubernetes Autoscaling: Horizontal vs Vertical Scaling

|
Published:
|
|
Mastering Kubernetes Autoscaling: Horizontal vs Vertical Scaling

This tutorial serves as a guide to mastering Kubernetes Autoscaling. We’ll explore the two main techniques of Kubernetes scaling: horizontal scaling and vertical scaling. Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are the tools that implement these concepts within Kubernetes, respectively. By understanding both horizontal and vertical scaling, along with HPA and VPA, you’ll be equipped to achieve peak performance and efficient resource management for your containerized applications. Ensuring optimal resource utilization is crucial for cloud applications built with Kubernetes. However, fluctuating workloads can quickly turn manual scaling into a tedious and inefficient task. This is where Kubernetes autoscaling comes in, offering a dynamic approach to resource management. Let’s get started.

Kubernetes Autoscaling: HPA vs VPA

What is Autoscaling?

Autoscaling is a technique that is used to automatically adjust the resources allocated to an application based on predefined metrics. An application deployment/statefulset or any other resource can be dynamically scaled up or down to meet fluctuating demand, ensuring efficient resource utilization and optimal application performance.

Benefits of Autoscaling

Autoscaling is needed in cloud-native environments like Kubernetes for several reasons:

  1. Dynamic Workload Demands: Applications often experience fluctuating traffic patterns throughout the day or across different seasons. Autoscaling allows resources to be added or removed automatically based on these variations, ensuring that the application can handle peak loads without being over-provisioned during quieter periods.
  2. Optimal Resource Utilization: Without autoscaling, resources may be provisioned based on peak loads, leading to underutilization during off-peak times and increased costs. Autoscaling adjusts resources dynamically, optimizing resource utilization and reducing unnecessary expenses.
  3. Improved Performance and Availability: By scaling resources in response to workload changes, autoscaling helps maintain consistent performance levels and availability. It ensures that applications remain responsive even under heavy traffic conditions, enhancing user experience and minimizing downtime.
  4. Cost Efficiency: Autoscaling helps control infrastructure costs by scaling resources up only when needed and scaling down during periods of lower demand. This elasticity allows organizations to pay for resources based on actual usage rather than maintaining fixed-capacity infrastructure.
  5. Operational Efficiency: Manual scaling processes can be time-consuming and prone to human error. Autoscaling automates the scaling process, reducing the burden on operations teams and enabling faster response to workload changes.

Types of Autoscaling in Kubernetes

There are different types of workload autoscaling in Kubernetes:

  • Horizontal workload autoscaling
  • Vertical workload autoscaling
  • Cluster-size based autoscaling
  • Event-driven autoscaling
  • Autoscaling based on schedules.

In this guide, we will be focusing on Horizontal and Vertical workload autoscaling.

Horizontal Scaling

In Kubernetes, horizontal scaling refers to the capability of increasing or decreasing the number of replicas (instances) of a particular workload based on metrics such as CPU utilization, memory usage or any custom chosen metrics to meet the demand. This is achieved using the Horizontal Pod Autoscaler (HPA), a Kubernetes API resource/controller that automatically adjusts the number of Pods in a replication controller, deployment, replica set, or stateful set.

How does horizontal scaling work in Kubernetes?

First of all, you need to enable horizontal scaling by defining an HPA resource in Kubernetes and specifying the target resource (deployment, replica set, stateful set etc.) and the metrics against which scaling decisions should be made (e.g., CPU utilization, memory usage).

When specified metrics exceed the defined thresholds, Kubernetes increases the number of replicas (Pods) to distribute the workload across more instances, ensuring optimal performance and responsiveness. Conversely, when workload decreases and metrics fall below a specified threshold, Kubernetes scales down the number of replicas to conserve resources and reduce costs.

HPA api resource/controller is available out-of-the-box in Kubernetes cluster.

Vertical Scaling

In Kubernetes, Vertical scaling refers to the practice of dynamically adjusting the CPU and memory resource requests and limits for already running Pods within a deployment, statefulset, or replicaset to ensure that they have sufficient resources to meet their workload requirements effectively.

Vertical scaling is achieved using Vertical Pod Autoscaler, aka, VPA. Kubernetes does not ship with VPA API resource/controller by default. Hence, you need to install it as custom resource if you want to use it.

So, how does Vertical scaling works in Kubernetes?

Unlike HPA which adds or removes Pods in a cluster to meet the demand, VPA focuses on the analysis of individual Pod resource utilization (CPU, memory) and recommends optimal requests and limits. These requests and limits define the minimum and maximum resources a pod can use. VPA can automatically update them, ensuring pods have the resources they need to function effectively without exceeding limitations.

Prerequisites for Enabling Autoscaling in Kubernetes

Enabling autoscaling in Kubernetes typically requires several prerequisites to ensure effective operation:

  1. Metrics Server: A functional Kubernetes Metrics Server is essential for gathering resource utilization metrics such as CPU and memory usage from cluster nodes and Pods. This server provides the data necessary for autoscaling controllers to make scaling decisions.
    Follow the link below on how to install and configure Metrics API Server in Kubernetes.
    Install Kubernetes Metrics Server on a Kubernetes Cluster
  2. Resource Metrics: Ensure that Pods or applications within the cluster are configured to expose relevant resource metrics. This includes defining metrics endpoints or utilizing metrics providers compatible with Kubernetes (e.g., Prometheus).
  3. Autoscaler Installation: Depending on the type of autoscaling (horizontal or vertical), install the appropriate autoscaler components:
    • Horizontal Pod Autoscaler (HPA): Typically included by default in Kubernetes clusters, but verify and ensure it is enabled.
    • Vertical Pod Autoscaler (VPA): Install VPA components as Custom Resource Definitions (CRDs) if not already included in the Kubernetes distribution.
  4. A running Kubernetes cluster: of course, you must be having a running Kubernetes cluster e.g., Minikube, GKE, EKS, e.t.c.

Creating Horizontal Pod Autoscalers (HPA) in Kubernetes Cluster

To horizontally scale your resources; Deployments, StatefulSets, ReplicaSets… you need to create the Horizontal Pod Autoscaler (HPA).

Install Kubernetes Metrics Server

Before you can proceed, ensure Metrics Server has been installed as outlined here.

Just to confirm that we have the metric server up and running, let’s check resource usage of the pods in my default namespace.

kubectl top pod

Sample output;

NAME                                  CPU(cores)   MEMORY(bytes)   
example-deployment-77d66d9f6f-h9nwn   0m           2Mi             
example-deployment-77d66d9f6f-hhq4t   0m           2Mi             
example-deployment-77d66d9f6f-lnpm8   0m           2Mi             
mysql-0                               9m           350Mi           
mysql-1                               9m           350Mi

If Metrics server is not installed yet, you will get such an output as, error: Metrics API not available.

Deploy an Application

If you already have an application in place that you want to autoscale, then you are good to go.

We already have a sample Nginx app running under a namespace called apps.

kubectl get deployment -n apps
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
nginx-app   3/3     3            3           13d
operator    1/1     1            1           11d

This is the deployment that we will try to scale horizontally. It currently has three replicas. So, first of all, let’s scale it down to single Pod.

kubectl scale deployment nginx-app --replicas=1 -n apps

Confirm;

kubectl get deployment -n apps
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
nginx-app   1/1     1            1           13d
operator    1/1     1            1           11d

Define Resource Requests and Limits in Pod Specifications

When deploying your application, you need to specify CPU and memory requests and limits in the Pod’s container specifications. This helps Kubernetes scheduler make placement decisions based on available resources and ensures fair resource allocation among Pods. It also helps the autoscalers to improve resource usage in the cluster.

Requests specify guaranteed minimum resources (CPU and memory) for a container while limits define maximum resource that the containers can consume.

To check if the resource requests and limits are set for a deployment, you can use kubectl get command. For example, to check resource definition for my nginx-app deployment in the apps namespace.

kubectl get deployment nginx-app -n apps -o yaml

Under the deployment container template specifications, you should see the resource requests and limits for the Pods.

  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx-app
    spec:
      containers:
      - image: nginx:latest
        imagePullPolicy: Always
        name: nginx
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/nginx/html
          name: html-volume

If you dont see them on your Deployment, then these resources requests and limits are not defined.

You can edit the deployment and set them;

kubectl edit deployment nginx-app -n apps

And add your resource requests and limits as shown above.

You can also use kubectl set command to set the CPU and Memory requests and limits.

To set the requests;

kubectl set resources deployment nginx-app -n apps --requests=cpu=100m
kubectl set resources deployment nginx-app -n apps --requests=memory=256Mi

Limits;

kubectl set resources deployment nginx-app -n apps --limits=cpu=500m
kubectl set resources deployment nginx-app -n apps --limits=memory=512Mi

Or you can combine the resources;

kubectl set resources deployment nginx-app -n apps --requests=cpu=500m,memory=512Mi
kubectl set resources deployment nginx-app -n apps --limits=cpu=500m,memory=512Mi

Without the resource requests defined, you HPA may show unknown for the target metric being checked.

kubectl get hpa -n apps
NAME        REFERENCE              TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
nginx-app   Deployment/nginx-app   cpu: <unknown>/80%   1         3         1          5m29s

Similarly, you may see that the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu in container when you render the HPA;

kubectl describe hpa nginx-app -n apps
Name:                                                  nginx-app
Namespace:                                             apps
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Sun, 30 Jun 2024 13:17:00 +0000
Reference:                                             Deployment/nginx-app
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 80%
Min replicas:                                          1
Max replicas:                                          3
Deployment pods:                                       1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu in container nginx of Pod nginx-app-6ff7b5d8f6-k5k48
Events:
  Type     Reason                        Age                    From                       Message
  ----     ------                        ----                   ----                       -------
  Warning  FailedComputeMetricsReplicas  4m3s (x12 over 6m48s)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value: failed to get cpu utilization: missing request for cpu in container nginx of Pod nginx-app-6ff7b5d8f6-k5k48
  Warning  FailedGetResourceMetric       108s (x21 over 6m48s)  horizontal-pod-autoscaler  failed to get cpu utilization: missing request for cpu in container nginx of Pod nginx-app-6ff7b5d8f6-k5k48

Create HorizontalPodAutoscaler (HPA) Resource

You can create an HPA resource declaratively using manifests YAML file or imperatively via the kubectl autoscale command.

Note
Each HPA is associated with a specific workload controller (Deployment, ReplicaSet, or StatefulSet), and there isn’t a direct way to create a generic HPA that scales multiple deployments simultaneously.

Let’s see how to create an HPA imperatively via the kubectl autoscale command.

kubectl autoscale --help

Sample command to create an HPA that autoscales my deployment, nginx-apps, in the apps namespace;

kubectl autoscale deployment nginx-app --cpu-percent=80 --min=1 --max=3 -n apps
  • --cpu-percent=80: This flag sets the target CPU utilization for the deployment. The HPA will aim to maintain the average CPU usage across all pods in the deployment at or below 80%.
  • --min=1: This flag sets the minimum number of replicas allowed for the deployment. Even if the HPA determines scaling down is necessary based on CPU usage, it won’t go below 1 replica.
  • --max=3: This flag sets the maximum number of replicas allowed for the deployment. The HPA won’t scale the deployment beyond 3 replicas regardless of how high the CPU usage climbs.

In essence:

  • Scaling Up: When the average CPU usage across all pods in the deployment consistently exceeds 80% for a period defined by the HPA’s cooldown configuration, the HPA triggers a scale-up event. It adds additional replicas to the deployment to distribute the workload and bring the average CPU utilization back down towards the target of 80%.
  • Scaling Down: While the --cpu-percent flag doesn’t define a specific threshold for scaling down, the HPA does have built-in logic for scaling down deployments. This logic considers several factors, including:
    • Target CPU Utilization: The HPA aims to maintain the average CPU usage around the target of 80%. If the average CPU utilization remains consistently below 80% for a cooldown period, the HPA might consider scaling down.
    • Minimum Replicas: The HPA won’t scale the deployment below the minimum number of replicas specified by the --min=1 flag in this case.
    • HPA Cool Down: Even if the CPU usage falls below 80%, the HPA won’t immediately scale down. It waits for a cooldown period to ensure the decrease in resource usage is not a temporary fluctuation.

To create an HPA via declarative way, create a manifest file;

cat hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-app
  namespace: apps
spec:
  metrics:
  - resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
    type: Resource
  minReplicas: 1 
  maxReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-app

Then apply;

kubectl apply -f hpa.yaml

List available HPAs;

Use the command below to list available HPAs. You can omit -n <name>

kubectl get hpa -n apps
NAME        REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
nginx-app   Deployment/nginx-app   cpu: 0%/80%   1         3         1          27s

Get Details of an HPA:

kubectl describe hpa <name> -n <namespace>

Simulating Events to Trigger Horizontal Scaling

Now, let’s see how we can stress our app to test the horizontal scaling.

We will use ApacheBench (ab) to perform load testing on our web app.

while true; do sleep 0.01; ab -n 500000 -c 1000 http://192.168.122.62:30833/; done

The command will send 500,000 HTTP requests to http://192.168.122.62:30833/ with a concurrency level of 1000 requests at a time, pausing for 0.01 seconds between each iteration.

Before you execute the command, run the command below on the control plane node to watch how HPA responds to the load on the web app;

kubectl get hpa -n apps -w

Before we executed the load testing command, this is the status of the HPA;

NAME        REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
nginx-app   Deployment/nginx-app   cpu: 0%/80%   1         3         1          27m

Now, execute the load the testing command and see what happens.

Sample output of the load testing command;

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.122.62 (be patient)
Completed 50000 requests
Completed 100000 requests
Completed 150000 requests
Completed 200000 requests
Completed 250000 requests
Completed 300000 requests
Completed 350000 requests
Completed 400000 requests
Completed 450000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 499996 requests completed
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.122.62 (be patient)
Completed 50000 requests
Completed 100000 requests
Completed 150000 requests
Completed 200000 requests
Completed 250000 requests
Completed 300000 requests
Completed 350000 requests
Completed 400000 requests
Completed 450000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 499995 requests completed
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.122.62 (be patient)
Completed 50000 requests
Completed 100000 requests
Completed 150000 requests
Completed 200000 requests
Completed 250000 requests
Completed 300000 requests
Completed 350000 requests
Completed 400000 requests
Completed 450000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 499997 requests completed
This is ApacheBench, Version 2.3 <$Revision: 1903618 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.122.62 (be patient)
Completed 50000 requests
Completed 100000 requests
Completed 150000 requests

HPA response to load;

NAME        REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
nginx-app   Deployment/nginx-app   cpu: 0%/80%   1         3         1          27m
nginx-app   Deployment/nginx-app   cpu: 124%/80%   1         3         1          44m
nginx-app   Deployment/nginx-app   cpu: 64%/80%    1         3         2          44m
nginx-app   Deployment/nginx-app   cpu: 135%/80%   1         3         2          44m
nginx-app   Deployment/nginx-app   cpu: 179%/80%   1         3         3          44m
nginx-app   Deployment/nginx-app   cpu: 160%/80%   1         3         3          45m
nginx-app   Deployment/nginx-app   cpu: 159%/80%   1         3         3          45m
nginx-app   Deployment/nginx-app   cpu: 157%/80%   1         3         3          45m
nginx-app   Deployment/nginx-app   cpu: 30%/80%    1         3         3          45m
nginx-app   Deployment/nginx-app   cpu: 0%/80%     1         3         3          46m
nginx-app   Deployment/nginx-app   cpu: 70%/80%    1         3         3          46m
nginx-app   Deployment/nginx-app   cpu: 156%/80%   1         3         3          46m
nginx-app   Deployment/nginx-app   cpu: 154%/80%   1         3         3          46m
nginx-app   Deployment/nginx-app   cpu: 153%/80%   1         3         3          47m
nginx-app   Deployment/nginx-app   cpu: 118%/80%   1         3         3          47m
nginx-app   Deployment/nginx-app   cpu: 31%/80%    1         3         3          47m
nginx-app   Deployment/nginx-app   cpu: 14%/80%    1         3         3          47m
nginx-app   Deployment/nginx-app   cpu: 98%/80%    1         3         3          48m
nginx-app   Deployment/nginx-app   cpu: 158%/80%   1         3         3          48m
nginx-app   Deployment/nginx-app   cpu: 160%/80%   1         3         3          48m
nginx-app   Deployment/nginx-app   cpu: 156%/80%   1         3         3          48m
nginx-app   Deployment/nginx-app   cpu: 103%/80%   1         3         3          49m
nginx-app   Deployment/nginx-app   cpu: 0%/80%     1         3         3          49m
nginx-app   Deployment/nginx-app   cpu: 96%/80%    1         3         3          49m
nginx-app   Deployment/nginx-app   cpu: 151%/80%   1         3         3          49m
nginx-app   Deployment/nginx-app   cpu: 100%/80%   1         3         3          50m
nginx-app   Deployment/nginx-app   cpu: 1%/80%     1         3         3          50m
nginx-app   Deployment/nginx-app   cpu: 0%/80%     1         3         3          50m
nginx-app   Deployment/nginx-app   cpu: 0%/80%     1         3         3          55m
nginx-app   Deployment/nginx-app   cpu: 0%/80%     1         3         1          55m

As a summary, the Horizontal Pod Autoscaler (HPA) for the nginx-app deployment responded dynamically to varying levels of CPU utilization over a period of time. Initially, with low CPU usage, it maintained 1 replica. As CPU demand increased, the HPA gradually scaled up, reaching a peak of 3 replicas when CPU usage spiked to 179%. Throughout this scaling process, the HPA adjusted replicas based on the configured target of 80% CPU utilization, demonstrating its ability to automatically scale the deployment to handle increased workload and then scale down as demand decreased, ensuring efficient resource utilization.

After stopping the AB load tester, the last entry shows a reduction to 1 replica suggesting a decrease in CPU load, indicating the HPA’s ongoing responsiveness to workload fluctuations.

Using VPA to Autoscale Kubernetes Containers

In our next guide, we will learn how to install and use VPA to control resource usage by Kubernetes cluster.

Conclusion

In this blog post, you have learnt the major types of autoscaling in Kubernetes, horizontal and vertical scaling along with the api resources that implements them, HPA and VPA. As you can see from the simulation done above, Kubernetes provides robust autoscaling capabilities through Horizontal Pod Autoscaler (HPA). HPA scales the number of pod replicas based on metrics like CPU utilization or custom metrics, ensuring applications can handle varying workloads efficiently.

SUPPORT US VIA A VIRTUAL CUP OF COFFEE

We're passionate about sharing our knowledge and experiences with you through our blog. If you appreciate our efforts, consider buying us a virtual coffee. Your support keeps us motivated and enables us to continually improve, ensuring that we can provide you with the best content possible. Thank you for being a coffee-fueled champion of our work!

Photo of author
Kifarunix
Linux Certified Engineer, with a passion for open-source technology and a strong understanding of Linux systems. With experience in system administration, troubleshooting, and automation, I am skilled in maintaining and optimizing Linux infrastructure.

Leave a Comment