Migrating to a new operating system
Classic VPC
Complete the following steps to migrate your worker nodes to a new operating system.
Beginning with cluster version 4.18, Red Hat Enterprise Linux CoreOS (RHCOS) is the default operating system and RHEL worker nodes are deprecated in this version. Support for RHEL worker nodes ends with the release of version 4.21. Migrate your clusters to use RHCOS worker nodes as soon as possible.
Milestone | Description |
---|---|
4.18 release: 23 May 2025 | Beginning with cluster version 4.18, Red Hat Enterprise Linux CoreOS (RHCOS) is the default operating system and RHEL worker nodes are deprecated in this version. RHEL workers are still available in version 4.18 only to complete the migration to RHOCS workers. |
4.21 release | Cluster version 4.21 supports only RHCOS worker nodes. Migrate your RHEL 9 worker nodes to RHCOS before updating to version 4.21. |
The steps to migrate to RHCOS are different based on your use case. Review the following links for steps that apply to your use case.
- Migrating worker nodes to RHCOS
- Follow theses steps for most uses cases.
- Migrating GPU worker nodes to RHCOS
- If you have GPU worker nodes, follow these steps to migrate to RHCOS.
Looking for Terraform steps? For more information, see this blog post for steps on how to migrate to RHCOS by using Terraform. TBD link
Migrating worker nodes to RHCOS
Complete the following steps to migrate your worker nodes to RHCOS.
To migrate to RHCOS, you must provision a new worker pool, then delete the previous RHEL worker pool. The new worker pool must reside in the same zone as the previous worker pool.
Step 1: Upgrade your cluster master
Run the following command to update the master.
ibmcloud ks cluster master update --cluster <clusterNameOrID> --version 4.18_openshift
Step 2: Creating a new RHCOS worker pool
- Make sure to specify
RHCOS
as the--operating-system
of the new pool. - Make sure that the number of nodes specified with the
--size-per-zone
option matches the number of workers per zone for the RHEL worker pool. To list a worker pool's zones and the number of workers per zone, runibmcloud oc worker-pool get --worker-pool WORKER_POOL --cluster CLUSTER
. - Make sure to include the
--entitlement ocp_entitled
option if you have a Cloud Pak entitlement.
-
Run the
ibmcloud oc worker-pool create
command to create a new worker pool.VPC: Example command to create a RHCOS worker pool. For more information about the
worker pool create vpc-gen2
command, see the CLI reference for command details. Adding worker nodes in VPC clusters.ibmcloud oc worker-pool create vpc-gen2 --name <worker_pool_name> --cluster <cluster_name_or_ID> --flavor <flavor> --size-per-zone <number_of_workers_per_zone> --operating-system RHCOS [--entitlement ocp_entitled]
Satellite: Example command to create a RHCOS worker pool. Note that for Satellite clusters, you must first attach hosts to your location before you can create a worker pool.
ibmcloud oc worker-pool create satellite --cluster CLUSTER --host-label "os=RHCOS" --name NAME --size-per-zone SIZE --operating-system RHCOS --zone ZONE [--label LABEL]
-
Verify that the worker pool is created and note the worker pool ID.
ibmcloud oc worker-pool ls --cluster <cluster_name_or_ID>
Example output
Name ID Flavor OS Workers my_workerpool aaaaa1a11a1aa1aaaaa111aa11 b3c.4x16.encrypted REDHAT_8_64 0
-
Add one or more zones to your worker pool. When you add a zone, the number of worker nodes you specified with the
--size-per-zone
option are added to the zone. These worker nodes run the RHCOS operating system. It's recommended that the zones you add to the RHCOS worker pool match the zones added to the RHEL worker pool that you are replacing. To view the zones attached to a worker pool, runibmcloud oc worker-pool zones --worker-pool WORKER_POOL --cluster CLUSTER
. If you add zones that do not match those of RHEL worker pool, make sure your workloads will not be impacted by moving them to a new zone. Note that File or Block storage are not supported across zones.
Step 3: Add worker nodes to your RHCOS worker pool
Step 4: Migrate your workloads
If you have software-defined storage (SDS) solutions like OpenShift Data Foundation or Portworx, update your storage configurations to include the new worker nodes and verify your workloads before removing your RHEL worker nodes.
For more information about rescheduling workloads, see Safely Drain a Node in the Kubernetes docs or Understanding how to evacuate pods on nodes in the Red Hat OpenShift docs.
-
Migrate per pod by cordoning node and deleting individual pods.
oc adm cordon no/<nodeName> oc delete po -n <namespace> <podName>
-
Migrate per Node by draining nodes. For more information, see Safely drain a node.
-
Migrate per worker pool by deleting your entire RHEL worker pool.
ibmcloud ks worker-pool rm --cluster <clusterNameOrID> --worker-pool <workerPoolNameOrID>
Step 5: Remove the RHEL worker nodes
Remove the worker pool that contains the RHEL workers.
Consider scaling down your RHEL worker pool and keeping it for several days before you remove it. This way, you can easily scale the worker pool back up if your workload experiences disruptions during the migration process. When you have determined that your workload is stable and functions normally, you can safely remove the RHEL worker pool.
- List your worker pools and note the name of the worker pool you want to remove.
ibmcloud oc worker-pool ls --cluster CLUSTER
- Run the command to remove the worker pool.
ibmcloud oc worker-pool rm --worker-pool WORKER_POOL --cluster CLUSTER
Migrating NVIDIA GPU resources to RHCOS worker nodes
Review the following steps about how to migrate your NVIDIA GPU operator resources from RHEL 8 GPU worker nodes to RHCOS worker nodes.
The NVIDIA GPU operator consists of the following resources:
gpu-feature-discovery
nvidia-container-toolkit-daemonset
nvidia-cuda-validator
nvidia-dcgm
nvidia-dcgm-exporter
nvidia-device-plugin-daemonset
nvidia-driver-daemonset
nvidia-node-status-exporter
nvidia-operator-validator
The main component of interest is nvidia-driver-daemonset
. This component is responsible for installing the GPU driver into the GPU worker node. These drivers are installed differently when targeting RHEL 8 versus RHCOS worker nodes.
Official statement from NVIDIA GPU operator: All worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. For more information, see Installing the NVIDIA GPU operator.
The NVIDIA GPU operator isn't capable of simultaneously managing driver installations on different worker node operating systems. This limitation means that if the GPU driver installation is solely managed by the NVIDIA GPU operator, then a full migration of the driver installation is required when changing worker node operating systems.
- Migrating a Red Hat OpenShift on IBM Cloud version 4.17 on VPC with RHEL 8 worker nodes to version 4.18 with RHCOS worker nodes.
- Version 4.17 does not support RHCOS worker nodes.
- Version 4.17 supports RHEL 8 and RHEL 9 (exclusions apply).
- Version 4.18 does not support RHEL 8 worker nodes.
- Version supports only RHCOS and RHEL 9.
- NVIDIA GPU operator does not support RHEL 9 operating system.
Complete the following steps to migrate NVIDIA GPU operator driver installations from RHEL 8 to RHCOS worker nodes. This example specifically describes migration steps for the the following cluster configuration:
Initial environment details
- Red Hat OpenShift on IBM Cloud 4.17 VPC cluster
- RHEL 8 worker nodes using NVIDIA GPU flavors
- NVIDIA GPU operator installed
- NVIDIA GPU operator's ClusterPolicy installed
- Operator, ClusterPolicy, and operands ready
- Get the details of the
nvidia-gpu-operator
.
Example outputoc get po -n nvidia-gpu-operator -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-ng7zn 1/1 Running 0 6m6s 172.23.145.152 10.240.0.15 <none> <none> gpu-operator-678b489684-7zgkq 1/1 Running 0 45h 172.23.145.135 10.240.0.15 <none> <none> nvidia-container-toolkit-daemonset-j4dzs 1/1 Running 0 6m6s 172.23.145.143 10.240.0.15 <none> <none> nvidia-cuda-validator-l44mz 0/1 Completed 0 2m28s 172.23.145.236 10.240.0.15 <none> <none> nvidia-dcgm-7sfvn 1/1 Running 0 6m7s 172.23.145.180 10.240.0.15 <none> <none> nvidia-dcgm-exporter-s5k48 1/1 Running 0 6m6s 172.23.145.172 10.240.0.15 <none> <none> nvidia-device-plugin-daemonset-xhds2 1/1 Running 0 6m6s 172.23.145.191 10.240.0.15 <none> <none> nvidia-driver-daemonset-mjqls 1/1 Running 0 7m1s 172.23.145.145 10.240.0.15 <none> <none> nvidia-node-status-exporter-5kvs4 1/1 Running 0 7m16s 172.23.145.235 10.240.0.15 <none> <none> nvidia-operator-validator-pz7wm 1/1 Running 0 6m7s 172.23.145.153 10.240.0.15 <none> <none>
- Get the details of the
gpu-cluster-policy
and make sure it isready
.
Example outputoc get clusterpolicies.nvidia.com gpu-cluster-policy
NAME STATUS AGE gpu-cluster-policy ready 2025-03-07T03:07:00Z
Step 1: Update the cluster master
Run the following command to update the master.
ibmcloud oc cluster master update --cluster <clusterNameOrID> --version 4.18_openshift
At this point, don't upgrade workers nodes to 4.18. For now, keep your RHEL 8 workers on 4.17.
Step 2: Create an RHCOS cluster worker pool
-
Run the following command to create a RHCOS worker pool.
ibmcloud oc worker-pool create vpc-gen2 \ --cluster <clusterNameOrID> \ --name <workerPoolName> \ --flavor <workerNodeFlavor> \ --size-per-zone <sizePerZoneCount> \ --operating-system RHCOS
Don't add zones to this RHCOS worker pool. There should be no workers in this worker pool.
Step 3: Add worker pool labels to the RHCOS worker pool
Add the following labels to your RHCOS worker pool.
nvidia.com/gpu.deploy.operands=false
. For more information, see Preventing Installation of Operands on Some Nodesnvidia.com/gpu.deploy.driver=false
. For more information, see Preventing Installation of NVIDIA GPU Driver on Some Nodes
Adding labels to the worker pool allows for node labels to exist before the worker nodes are available to cluster. This ensures that NVIDIA GPU resources are not automatically scheduled from the start. If NVIDA GPU resources are scheduled on worker nodes where drivers cannot be installed, it will degrade the status of the ClusterPolicy resource. Drivers cannot yet be installed on RHCOS worker nodes because the NVIDIA GPU operator is configured to use the RHEL 8 install method.
Run the following command to add labels to your worker pool.
ibmcloud oc worker-pool label set \
--cluster <clusterNameOrID> \
--worker-pool <workerPoolNameOrID> \
--label nvidia.com/gpu.deploy.operands=false \
--label nvidia.com/gpu.deploy.driver=false
Step 4: Add RHCOS worker nodes to cluster
Add capacity to your cluster to allow for workload migration. Adding zones to the worker pool triggers the worker nodes to start provisioning and join the cluster. Note that the NVIDIA GPU resources are not deployed on the RHCOS worker nodes yet.
ibmcloud oc zone add vpc-gen2 \
--cluster <clusterNameOrID> \
--worker-pool <workerPoolNameOrID> \
--subnet-id <vpcSubnetID> \
--zone <vpcZone>
At this point, RHCOS worker nodes are available in the cluster to begin migration.
Step 5: Change the driver installer on the RHEL 8 worker nodes to unmanaged
Add the nvidia.com/gpu.deploy.driver=false
label to your RHEL 8 worker nodes. This label unschedules the existing driver installer pods from the RHEL 8 workers. The driver is not to be uninstalled. Other operands, including device
plug-ins, remain on the RHEL 8 workers. The ClusterPolicy
state remains ready. Since the driver is still installed and the device plug-in is running, GPU workloads continue to be functional.
-
Add the
nvidia.com/gpu.deploy.driver=false
to RHEL 8 worker nodes.To label an individual worker node:
oc label nodes <nodeName> "nvidia.com/gpu.deploy.driver=false"
To label an entire worker pool:
ibmcloud oc worker-pool label set \ --cluster <clusterNameOrID> \ --worker-pool <workerPoolNameOrID> \ --label nvidia.com/gpu.deploy.driver=false
-
List pods to confirm the labels.
oc get po -n nvidia-gpu-operator -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-ng7zn 1/1 Running 0 4h27m 172.23.145.152 10.240.0.15 <none> <none> gpu-operator-678b489684-7zgkq 1/1 Running 0 2d2h 172.23.145.135 10.240.0.15 <none> <none> nvidia-container-toolkit-daemonset-j4dzs 1/1 Running 0 4h27m 172.23.145.143 10.240.0.15 <none> <none> nvidia-cuda-validator-l44mz 0/1 Completed 0 4h24m 172.23.145.236 10.240.0.15 <none> <none> nvidia-dcgm-7sfvn 1/1 Running 0 4h27m 172.23.145.180 10.240.0.15 <none> <none> nvidia-dcgm-exporter-s5k48 1/1 Running 0 4h27m 172.23.145.172 10.240.0.15 <none> <none> nvidia-device-plugin-daemonset-xhds2 1/1 Running 0 4h27m 172.23.145.191 10.240.0.15 <none> <none> nvidia-driver-daemonset-mjqls 1/1 Terminating 0 4h28m 172.23.145.145 10.240.0.15 <none> <none> nvidia-node-status-exporter-5kvs4 1/1 Running 0 4h28m 172.23.145.235 10.240.0.15 <none> <none> nvidia-operator-validator-pz7wm 1/1 Running 0 4h27m 172.23.145.153 10.240.0.15 <none> <none>
-
Confirm the
gpu-cluster-policy
isready
.oc get clusterpolicies.nvidia.com gpu-cluster-policy
Example output
NAME STATUS AGE gpu-cluster-policy ready 2025-03-07T03:07:00Z
Step 6: Schedule driver installer and other operands to RHCOS worker nodes
Add the nvidia.com/gpu.deploy.driver=true
and nvidia.com/gpu.deploy.operands=true
to your RHCOS workers.
Adding these labels attempts to schedule the driver installer, device plugin, and other operands to the RHCOS worker nodes. Most pods are in the init
state due to driver installer failing. Driver installer is failing because it
is attempting to install the driver by using the RHEL 8 method.
Run the label nodes
command to add labels.
To label an individual worker node:
oc label nodes <nodeName> "nvidia.com/gpu.deploy.driver=true"
oc label nodes <nodeName> "nvidia.com/gpu.deploy.operands=true"
To label an entire worker pool:
ibmcloud oc worker-pool label set \
--cluster <clusterNameOrID> \
--worker-pool <workerPoolNameOrID> \
--label nvidia.com/gpu.deploy.driver=true
After adding the labels, continue to the next step.
Step 7: Convert driver installer from RHEL 8 to RHCOS installation method
Delete nvidia-driver-installer
DaemonSet. This DaemonSet is specific to RHEL 8 and is no longer needed. The GPU operator reconciles and detects that a RHCOS worker node is present in the cluster. The GPU operator re-creates the
driver installer DaemonSet, but now with the RHCOS installation method based on OpenShift Driver Toolkit.
-
Delete
nvidia-driver-installer
DaemonSet. After deleting the DaemonSet, don't add or reload any of the RHEL 8 GPU workers.oc delete daemonset -n nvidia-gpu-operator nvidia-driver-installer
-
List pods and confirm that the GPU driver is installed on the RHCOS worker nodes and the remaining operands are
ready
.oc get po -n nvidia-gpu-operator -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-h4bhx 1/1 Running 0 18m 172.23.137.119 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> gpu-feature-discovery-ng7zn 1/1 Running 0 4h58m 172.23.145.152 10.240.0.15 <none> <none> gpu-operator-678b489684-7zgkq 1/1 Running 0 2d2h 172.23.145.135 10.240.0.15 <none> <none> nvidia-container-toolkit-daemonset-79j86 1/1 Running 0 18m 172.23.137.115 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-container-toolkit-daemonset-j4dzs 1/1 Running 0 4h58m 172.23.145.143 10.240.0.15 <none> <none> nvidia-cuda-validator-l44mz 0/1 Completed 0 4h55m 172.23.145.236 10.240.0.15 <none> <none> nvidia-cuda-validator-xgscz 0/1 Completed 0 15m 172.23.137.121 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-dcgm-7sfvn 1/1 Running 0 4h58m 172.23.145.180 10.240.0.15 <none> <none> nvidia-dcgm-9rpnz 1/1 Running 0 18m 172.23.137.117 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-dcgm-exporter-s5k48 1/1 Running 0 4h58m 172.23.145.172 10.240.0.15 <none> <none> nvidia-dcgm-exporter-x8vlc 1/1 Running 2 (14m ago) 18m 172.23.137.116 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-device-plugin-daemonset-7g5hz 1/1 Running 0 18m 172.23.137.120 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-device-plugin-daemonset-xhds2 1/1 Running 0 4h58m 172.23.145.191 10.240.0.15 <none> <none> nvidia-driver-daemonset-416.94.202502260030-0-dkcmh 2/2 Running 0 19m 172.23.137.107 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-node-status-exporter-5kvs4 1/1 Running 0 5h 172.23.145.235 10.240.0.15 <none> <none> nvidia-node-status-exporter-94v9f 1/1 Running 0 19m 172.23.137.110 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-operator-validator-4wk6z 1/1 Running 0 18m 172.23.137.118 test-coajphf20ooqeeg7u9dg-btsstagevpc-gx316x8-000239a8 <none> <none> nvidia-operator-validator-pz7wm 1/1 Running 0 4h58m 172.23.145.153 10.240.0.15 <none> <none>
-
Confirm the
gpu-cluster-policy
is ready.oc get clusterpolicies.nvidia.com gpu-cluster-policy
Example output
NAME STATUS AGE gpu-cluster-policy ready 2025-03-07T03:07:00Z
-
Describe your nodes and confirm the allocatable GPUs.
oc describe no
Example output
... Capacity: nvidia.com/gpu: 1 ... Allocatable: nvidia.com/gpu: 1
Step 8: Migrate GPU-dependent workloads to your RHCOS worker nodes
Now that RHCOS GPU worker nodes have the GPU driver installed and are ready for scheduling, migrate GPU-dependent workloads to the RHCOS worker nodes.
-
Migrate per pod by cordoning node and deleting individual pods.
oc adm cordon no/<nodeName> oc delete po -n <namespace> <podName>
-
Migrate per Node by draining nodes. For more information, see Safely drain a node.
-
Migrate per worker pool by deleting your entire RHEL worker pool.
ibmcloud oc worker-pool rm --cluster <clusterNameOrID> --worker-pool <workerPoolNameOrID>
Step 9: Remove labels from your RHCOS worker pool
Remove worker pool labels that you added in a previous step. This removal ensures that new RHCOS worker nodes provisioned afterward don't have these labels and the NVIDIA GPU components get automatically installed.
Step 10: Scale down or delete RHEL 8 worker pool
At this point, migration of NVIDIA GPU driver is complete. You can scale down or delete your RHEL worker pools.
ibmcloud oc worker-pool rm --cluster <clusterNameOrID> --worker-pool <workerPoolNameOrID>
Migrating to Red Hat Enterprise Linux 9
For RHEL 9, the /tmp
directory is a separate partition that has the nosuid
, noexec
, and nodev
options set. If your apps install to and run scripts or binaries under the /tmp
directory,
they might fail. Update your apps to use the /var/tmp
directory instead of the /tmp
directory to run temporary scripts or binaries.
The default cgroup
implementation is cgroup
v2. In RHEL 9, cgroup
v1 isn't supported. Review the Kubernetes migration documentation for cgroup
v2 and verify that your applications fully support cgroup
v2. There are known issues with older versions of Java that may cause out of memory (OOM) issues for workloads.
-
Review your worker pool operating systems to find which pools you need to migrate.
ibmcloud ks worker-pools -c CLUSTER
-
Specify the
RHEL_9_64
version for the worker pool.ibmcloud oc worker-pool operating-system set --cluster CLUSTER --worker-pool POOL --operating-system RHEL_9_64
-
Update each worker node in the worker pool by running the
ibmcloud oc worker update
for Classic clusters oribmcloud oc worker replace
for VPC clusters.Make sure you have enough worker nodes to support your workload while you update or replace the relevant worker nodes. For more information, see Updating VPC worker nodes or Updating classic worker nodes.
Example command to update Classic worker nodes.
ibmcloud oc worker update --cluster CLUSTER --worker WORKER1_ID [--worker WORKER2_ID]
Example command to replace VPC worker nodes.
ibmcloud oc worker replace --cluster CLUSTER --worker WORKER_ID
-
Get the details for your worker pool and workers. In the output, verify that your worker nodes run the
RHEL_9_64
operating system.Get the details for a worker pool.
ibmcloud oc worker-pools -c CLUSTER
Get the details for a worker node.
ibmcloud oc worker get --cluster CLUSTER --worker WORKER_NODE_ID