Setting up autoscaling for your worker pools
Update the cluster autoscaler configmap to enable automatically scaling worker nodes in your worker pools based on the minimum and maximum values that you set.
After you edit the configmap to enable autoscaling on a worker pool, the cluster autoscaler scales your cluster in response to your workload requests. This means you can't resize or rebalance your worker pools. Scanning and scaling up and down happens at regular intervals over time, and depending on the number of worker nodes might take a longer period of time to complete, such as 30 minutes. Later, if you want to remove the cluster autoscaler, you must first disable each worker pool in the ConfigMap.
Beginning in version 1.2.4 the maxEmptyBulkDelete option is no longer supported. Remove this option from your configmap by running kubectl edit configmap iks-ca-configmap -n kube-system command and deleting the option.
As a replacement, you can use the maxScaleDownParallelism option which was added in version 1.2.4. For more information, see the configmap reference.
Before you begin:
-
Edit the cluster autoscaler configmap YAML file.
oc edit cm iks-ca-configmap -n kube-system -o yamlExample output
apiVersion: v1 data: workerPoolsConfig.json: | [ {"name": "<worker_pool>","minSize": 1,"maxSize": 2,"enabled":false} ] kind: ConfigMap -
Edit the ConfigMap with the parameters to define how the cluster autoscaler scales your cluster worker pool. Note: Unless you disabled all public application load balancers (ALBs) in each zone of your standard cluster, you must change the
minSizeto2per zone so that the ALB pods can be spread for high availability.-
"name": "default": Replace"default"with the name or ID of the worker pool that you want to scale. To list worker pools, runibmcloud oc worker-pool ls --cluster <cluster_name_or_ID>. To manage more than one worker pool, copy the JSON line to a comma-separated line, such as follows.[ {"name": "default","minSize": 1,"maxSize": 2,"enabled":false}, {"name": "Pool2","minSize": 2,"maxSize": 5,"enabled":true} ]The cluster autoscaler can scale only worker pools that have the
ibm-cloud.kubernetes.io/worker-pool-idlabel. To check whether your worker pool has the required label, runibmcloud oc worker-pool get --cluster <cluster_name_or_ID> --worker-pool <worker_pool_name_or_ID> | grep Labels. If your worker pool does not have the required label, add a new worker pool and use this worker pool with the cluster autoscaler. -
"minSize": 1: Specify the minimum number of worker nodes per zone. Setting aminSizedoes not automatically trigger a scale-up. TheminSizeis a threshold so that the cluster autoscaler does not scale to fewer than a certain number of worker nodes per zone. If your cluster does not yet have that number per zone, the cluster autoscaler does not scale up until you have workload resource requests that require more resources. For example, if you have a worker pool with one worker node per three zones (three total worker nodes) and set theminSizeto4per zone, the cluster autoscaler does not immediately provision an additional three worker nodes per zone (12 worker nodes total). Instead, the scale-up is triggered by resource requests. If you create a workload that requests the resources of 15 worker nodes, the cluster autoscaler scales up the worker pool to meet this request. Now, theminSizemeans that the cluster autoscaler does not scale down to fewer than four worker nodes per zone even if you remove the workload that requests the amount. For more information, see the Kubernetes docs. -
"maxSize": 2: Specify the maximum number of worker nodes per zone that the cluster autoscaler can scale up the worker pool to. The value must be equal to or greater than the value that you set for theminSize. -
"enabled": false: Set the value totruefor the cluster autoscaler to manage scaling for the worker pool. Set the value tofalseto stop the cluster autoscaler from scaling the worker pool. Later, if you want to remove the cluster autoscaler, you must first disable each worker pool in the ConfigMap.
-
-
Save the configuration file.
-
Get your cluster autoscaler pod.
oc get pods -n kube-system -
Review the
Eventssection of the cluster autoscaler pod for aConfigUpdatedevent to verify that the ConfigMap is successfully updated. The event message for your ConfigMap is in the following format:minSize:maxSize:PoolName:<SUCCESS|FAILED>:error message.oc describe pod -n kube-system <cluster_autoscaler_pod>Example output
Name: ibm-iks-cluster-autoscaler-857c4d9d54-gwvc6 Namespace: kube-system ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ConfigUpdated 3m ibm-iks-cluster-autoscaler-857c4d9d54-gwvc6 {"1:3:default":"SUCCESS:"}
If you enable a worker pool for autoscaling and then later add a zone to this worker pool, restart the cluster autoscaler pod so that it picks up this change: oc delete pod -n kube-system <cluster_autoscaler_pod>.
Customizing the cluster autoscaler configuration values
Customize the cluster autoscaler settings such as the amount of time it waits before scaling worker nodes up or down.
-
When you modify a ConfigMap parameter other than the worker pool
minSize,maxSize, or if you enable or disable a worker pool, the cluster autoscaler pods are restarted.
-
Review the cluster autoscaler ConfigMap parameters.
-
Download the cluster autoscaler add-on ConfigMap and review the parameters.
oc get cm iks-ca-configmap -n kube-system -o yaml > configmap.yaml -
Open the
configmap.yamlfile and update the settings that you want to change. -
Reapply the cluster autoscaler add-on configmap.
oc apply -f configmap.yaml -
Verify that the pods are restarted successfully.
oc get pods -n kube-system | grep autoscaler
Cluster autoscaler configmap reference
expander- How the cluster autoscaler determines which worker pool to scale if you have multiple worker pools. The default value is
random. random: Selects randomly betweenmost-podsandleast-waste.most-pods: Selects the worker pool that is able to schedule the most pods when scaling up. Use this method if you are usingnodeSelectorto make sure that pods land on specific worker nodes.least-waste: Selects the worker pool that has the least unused CPU after scaling up. If two worker pools use the same amount of CPU resources after scaling up, the worker pool with the least unused memory is selected.prometheusScrape- Set to
trueto send Prometheus metrics. To stop sending metrics, set tofalse. ignoreDaemonSetsUtilization- Ignores autoscaler DaemonSet pods when calculating resource utilization for scale-down. The default value is
false. imagePullPolicy- When to pull the Docker image. The default value is
Always. Always: Pulls the image every time that the pod is started.IfNotPresent: Pulls the image only if the image isn't already present locally. \n -Never: Assumes that the image exists locally and never pulls the image.livenessProbeFailureThreshold- The number of times that the
kubeletretries a liveness probe after the pod starts and the first liveness probe fails. After the failure threshold is reached, the container is restarted and the pod is markedUnreadyfor a readiness probe, if applicable. The default value is3. livenessProbePeriodSeconds- The interval in seconds that the
kubeletperforms a liveness probe. The default value is600. livenessProbeTimeoutSeconds- The time in seconds after which the liveness probe times out. The default value is
10. maxBulkSoftTaintCount- The maximum number of worker nodes that can be tainted or untainted with
PreferNoScheduleat the same time. To disable this feature, set to0. The default value is0. maxBulkSoftTaintTime- The maximum amount of time that worker nodes can be tainted or untainted with
PreferNoScheduleat the same time. The default value is10m. maxFailingTime- The maximum time in minutes that the cluster autoscaler pod runs without a completed action before the pod is automatically restarted. The default value is
15m. maxInactivity- The maximum time in minutes that the cluster autoscaler pod runs without any recorded activity before the pod is automatically restarted. The default value is
10m. maxNodeProvisionTime- The maximum amount of time in minutes that a worker node can take to begin provisioning before the cluster autoscaler cancels the scale-up request. The default value is
120m. maxNodeGroupBinpackingDuration- Maximum time in seconds spent in bin packing simulation for each worker-pool. The default value is
10s. maxNodesPerScaleUp- Maximum number of nodes that can be added in a single scale up. This is intended strictly for optimizing autoscaler algorithm latency and should not be used as a rate limit for scale-up. The default value is
1000. maxRetryGap- The maximum time in seconds to retry after failing to connect to the service API. Use this parameter and the
retryAttemptsparameter to adjust the retry window for the cluster autoscaler. The default value is60. parallelDrain- Set to
trueto allow parallel draining of nodes. The default value isfalse. maxScaleDownParallelism- The maximum number of nodes, both empty and still needing to be drained, that can be deleted in parallel. The default value is
10. maxDrainParallelism- Maximum number of nodes that still need to be drained that can be drained and deleted in parallel. The default value is
1. nodeDeletionBatcherInterval- How long in minutes that the autoscaler can gather nodes to delete them in batch. The deafault value is
0m. nodeDeleteDelayAfterTaint- How long in seconds to wait before deleting a node after tainting it. The default value is
5s. enforceNodeGroupMinSize- Set this value to
trueto scale up the worker pool to the configured min size if needed. The default value isfalse. resourcesLimitsCPU- The maximum amount of worker node CPU that the
ibm-iks-cluster-autoscalerpod can consume. The default value is600m. resourcesLimitsMemory- The maximum amount of worker node memory that the
ibm-iks-cluster-autoscalerpod can consume. The default value is600Mi. coresTotal- The minimum and maximum number of cores in the cluster. Cluster autoscaler does not scale the cluster beyond these numbers. The default value is
0:320000. memoryTotal- Minimum and maximum amount of memory in gigabytes for the cluster. Cluster autoscaler does not scale the cluster beyond these numbers. The default value is
0:6400000. resourcesRequestsCPU- The minimum amount of worker node CPU that the
ibm-iks-cluster-autoscalerpod starts with. The default value is200m. resourcesRequestsMemory- The minimum amount of worker node memory that the
ibm-iks-cluster-autoscalerpod starts with. The default value is200Mi. retryAttempts- The maximum number of attempts to retry after failing to connect to the service API. Use this parameter and the
maxRetryGapparameter to adjust the retry window for the cluster autoscaler. | The default value is64. logLevel- The log level for the autoscaler. Logging levels are
info,debug,warning,error. The default value isinfo. scaleDownDelayAfterAdd- The amount of time after scale up that scale down evaluation resumes. The default value is
10m.
scaleDownDelayAfterDelete | The amount of time after node deletion that scale down evaluation resumes. The default value is the same as the scan-interval which is 1m.
scaleDownDelayAfterFailure- The amount of time in minutes that the autoscaler must wait after a failure. The default value is
3m. kubeClientBurst- Allowed burst for the Kubernetes client. The default value is
300. kubeClientQPS- The QPS value for Kubernetes client. How many queries are accepted once the burst has been exhausted. The default value is
5.0. maxEmptyBulkDelete- Only supported in versions earlier than 1.2.4 The maximum number of empty nodes that can be deleted by the autoscaler at the same time. The default value is
10. maxScaleDownParallelism- 1.2.4 and later Maximum number of nodes, both empty and needing to be drained, that can be deleted in parallel. The default value is
10. maxGracefulTerminationSec- The maximum number of seconds the autoscaler waits for pod to end when it scales down a node. The default value is
600. maxTotalUnreadyPercentage- The maximum percentage of unready nodes in the cluster. After this value is exceeded, the autoscaler stops operations. The default value is
45. okTotalUnreadyCount- Number of allowed unready nodes, irrespective of the
maxTotalUnreadyPercentagevalue. The default value is3. unremovableNodeRecheckTimeout- The timeout in minutes before the autoscaler rechecks a node that couldn't be removed in earlier attempt. The default value is
5m. scaleDownUnneededTime- The amount of time in minutes that a worker node must be unnecessary before it can be scaled down. The default value is
10m. scaleDownUtilizationThreshold- The worker node utilization threshold. If the worker node utilization is less than the threshold, the worker node is considered to be scaled down. Worker node utilization is calculated as the sum of the CPU and memory resources that are requested
by all pods that run on the worker node, divided by the worker node resource capacity. | The default value is
0.5. scaleDownUnreadyTime- The amount of time in minutes that autoscaler must wait before an unready node is considered for scale down. The default value is
20m. scaleDownNonEmptyCandidatesCount- The maximum number of non-empty nodes considered in one iteration as candidates for scale down with drain. | The default value is
30. scaleDownCandidatesPoolRatio- The ratio of nodes that are considered as additional non-empty candidates for scale down when some candidates from previous iteration are no longer valid. The default value is
0.1. scaleDownCandidatesPoolMinCount- The minimum number of nodes that are considered as additional non-empty candidates for scale down when some candidates from previous iterations are no longer valid. The default value is
50. scaleDownEnabled- When set to
false, the autoscaler does not perform scale down. The default value istrue. scaleDownUnreadyEnabled- When set to
trueunready nodes are scheduled for scale down. The default value istrue. scanInterval- Set how often in minutes that the cluster autoscaler scans for workload usage that triggers scaling up or down. The default value is
1m. skipNodesWithLocalStorage- When set to
true, worker nodes that have pods that are saving data to local storage are not scaled down. The default value istrue. skipNodesWithSystemPods- When set to
true, worker nodes that havekube-systempods are not scaled down. Do not set the value tofalsebecause scaling downkube-systempods might have unexpected results. The default value istrue. expendablePodsPriorityCutoff- Pods with priority under the cutoff are expendable. They can be removed without consideration during scale down and they don't cause scale up. Pods that
PodPriorityset to null are not expendable. The default value is-10. minReplicaCount- The minimum number of replicas that a replicaSet or replication controller must allow to be deleted during scale down. The default value is
0. maxPodEvictionTime- Maximum time the autoscaler tries to evict a pod before stopping. The default value is
2m. newPodScaleUpDelay- Pods newer than this value in seconds are not considered for scale-up. Can be increased for individual pods through the
cluster-autoscaler.kubernetes.io/pod-scale-up-delayannotation. The default value is0s. balancingIgnoreLabel- The node label key to ignore during zone balancing. Apart from the fixed node labels, the autoscaler can ignore an additional 5 node labels during zone balancing. For example,
balancingIgnoreLabel1:label1,balancingIgnoreLabel2: custom-label2. workerPoolsConfig.json- The worker pools that you want to autoscale, including their minimum and maximum number of worker nodes per zone in the format
{"name": "<pool_name>","minSize": 1,"maxSize": 2,"enabled":false}. <pool_name>: The name or ID of the worker pool that you want to enable or disable for autoscaling. To list available worker pools, runibmcloud oc worker-pool ls --cluster <cluster_name_or_ID>.maxSize: <number_of_workers>: The maximum number of worker nodes per zone that the cluster autoscaler can scale up to. The value must be equal to or greater than the value that you set for theminSize: <number_of_workers>size.min=<number_of_workers>: The minimum number of worker nodes per zone that the cluster autoscaler can scale down to. If you need your ALB pods to be spread for high availability, you must set the value to at least 2. If you disabled all public ALBs in each zone of your standard cluster, you can set the value to0. Keep in mind that setting aminsize does not automatically trigger a scale-up. Theminsize is a threshold so that the cluster autoscaler does not scale to fewer than this minimum number of worker nodes per zone. If your cluster does not have this number of worker nodes per zone yet, the cluster autoscaler does not scale up until you have workload resource requests that require more resources.enabled=: Whentrue, the cluster autoscaler can scale your worker pool. Whenfalse, the cluster autoscaler won't scale the worker pool. Later, if you want to remove the cluster autoscaler, you must first disable each worker pool in the ConfigMap. If you enable a worker pool for autoscaling and then later add a zone to this worker pool, restart the cluster autoscaler pod so that it picks up this change:oc delete pod -n kube-system <cluster_autoscaler_pod>.- By default, the
defaultworker pool is not enabled, with amaxvalue of2and aminvalue of1. scaleDownGPUUtilizationThreshold- The sum of GPU requests of all pods running on the node divided by the node's allocatable resources. When resource requests are less than this threshold, a node can be considered for scale down. The utilization calculation only considers GPU
resources. CPU and memory utilization are ignored. The default value is
0.5. OSReservedMemoryGi- The amount of reserved memory in GiB. The default value is
0.3. OSReservedCPUMili- The amount of reserved CPU in MiliCPU. The default value is
30.