Updating Classic worker nodes that use OpenShift Data Foundation

Classic infrastructure

For Classic clusters with a storage solution such as OpenShift Data Foundation you must cordon, drain, and replace each worker node sequentially. If you deployed OpenShift Data Foundation to a subset of worker nodes in your cluster, then after you replace the worker node, you must then edit the ocscluster resource to include the new worker node.

The following tutorial covers both major and minor worker node updates.

Major update: Complete the steps with this label to apply a major update, for example if you are updating your worker nodes to a new major version, such as from 4.11 to 4.12 as well as OpenShift Data Foundation from 4.11 to 4.12.
Minor update: Complete the steps with this label to apply a patch update, for example if you are updating from 4.12.15_1542_openshift to 4.12.16_1544_openshift while keeping OpenShift Data Foundation at version 4.12.

Skipping versions during an upgrade, such as from 4.8 to 4.12 is not supported.

Before updating your worker nodes, make sure to back up your app data. Also, plan to complete the following steps for one worker node at a time. Repeat the steps for each worker node that you want to update.

Update the cluster master

Major update

If you are updating your worker nodes to a new major version, such as from 4.11 to 4.12, update the cluster master first.

ibmcloud oc cluster master update --cluster CLUSTER [--version MAJOR.MINOR.PATCH] [--force-update] [-f] [-q]

Example command:

ibmcloud oc cluster master update --cluster mycluster --version 4.19.19 --force-update

Wait until the master update finishes.

Determine which worker nodes you want to update

Major update Minor update

List your worker nodes by using the oc get nodes command and determining which worker nodes you want to update.

oc get nodes

Example output

NAME           STATUS   ROLES           AGE    VERSION
10.241.0.4     Ready    master,worker   106s   v1.21.6+4b61f94
10.241.128.4   Ready    master,worker   22d    v1.21.6+bb8d50a
10.241.64.4    Ready    master,worker   22d    v1.21.6+bb8d50a

Scale down OpenShift Data Foundation

Major update Minor update

For each worker node that you found in the previous step, find the rook-ceph-mon and rook-ceph-osd deployments.
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```

Scale down the deployments that you found in the previous step.

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

oc scale deployment rook-ceph-osd-2 --replicas=0 -n openshift-storage

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=NODE-NAME --replicas=0 -n openshift-storage

Cordon and drain the worker node

Major update Minor update

Cordon the node. Cordoning the node prevents any pods from being scheduled on this node.
```
oc adm cordon NODE_NAME
```
Example output
```
node/10.241.0.4 cordoned
```

Drain the node to remove all the pods. When you drain the worker node, the pods move to the other worker nodes ensuring there is no downtime. Draining also ensures that there is no disruption of the pod disruption budget.

oc adm drain NODE_NAME --force --delete-emptydir-data --ignore-daemonsets

Example output

evicting pod "managed-storage-validation-webhooks-7fd79bc9f7-pdpv6"
evicting pod "calico-kube-controllers-647dbbd685-fmrp9"
evicting pod "certified-operators-2v852"
evicting pod "csi-snapshot-controller-77fbf474df-47ddt"
evicting pod "calico-typha-8574d89b8c-7f2cc"
evicting pod "dns-operator-6d48cbff67-vrrsw"
evicting pod "router-default-6fc798b98b-9m6kh"
evicting pod "prometheus-adapter-5b77ffdd5f-hzqrp"
evicting pod "alertmanager-main-1"
evicting pod "prometheus-k8s-0"
evicting pod "network-check-source-66c7fbb86-2r78z"

Wait until draining finishes, then complete the following steps to replace the worker node.

Update the worker node

Major update Minor update

List your worker nodes by using ibmcloud oc worker ls and find the worker node that you cordoned and drained in the previous step.

ibmcloud oc worker ls -c CLUSTER

Example output

ID                                                 Primary IP     Flavor     State    Status   Zone        Version   
kube-c85ra07w091uv4nid9ug-vpcoc-default-000001c1   10.241.128.4   bx2.4x16   normal   Ready    us-east-3   4.8.29_1544_openshift*   
kube-c85ra07w091uv4nid9ug-vpcoc-default-00000288   10.241.0.4     bx2.4x16   normal   Ready    us-east-1   4.8.29_1544_openshift*   
kube-c85ra07w091uv4nid9ug-vpcoc-default-00000352   10.241.64.4    bx2.4x16   normal   Ready    us-east-2   4.8.29_1544_openshift*

Update the worker node.

ibmcloud oc worker update -c CLUSTER --worker kube-***

Example output

The replacement worker node is created in the same zone with the same flavor, but gets new public or private IP addresses. During the replacement, all pods might be rescheduled onto other worker nodes and data is deleted if not stored outside the pod. To avoid downtime, ensure that you have enough worker nodes to handle your workload while the selected worker nodes are being replaced.
Replace worker node kube-c85ra07w091uv4nid9ug-cluster-default-00000288? [y/N]> y
Deleting worker node kube-c85ra07w091uv4nid9ug-cluster-default-00000288 and creating a new worker node in cluster

Wait for the replacement node to be provisioned and then list your worker nodes. Note that this process might take 20 minutes or more.

oc get nodes

Example output

NAME           STATUS   ROLES           AGE   VERSION
10.241.0.4     Ready    master,worker   22d   v1.21.6+bb8d50a
10.241.128.4   Ready    master,worker   22d   v1.21.6+bb8d50a
10.241.64.4    Ready    master,worker   22d   v1.21.6+bb8d50a

Clean up the resources from the old node

Major update Minor update

Navigate to the openshift-storage project.
```
oc project openshift-storage
```
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=<failed_osd_id> -p FORCE_OSD_REMOVAL=true | oc create -f -
```
The FAILED_osd_id value is the integer in the pod name immediately after the rook-ceph-osd prefix. The FORCE_OSD_REMOVAL value must be changed to true in clusters that have only three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```

Verify that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Example output

2023-03-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC) from old node:
```
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
```
If there is a PV in Released state, delete it:
```
oc delete pv <persistent_volume>
```

Add the new storage nodes

Major update Minor update

Wait for the OpenShift Data Foundation pods to deploy to the new worker. Verify that he OSD persistent volumes are created and that all pods are in a Running state.
```
oc get pv
oc get ocscluster
oc get pods -n openshift-storage
```

Verify that all other required OpenShift Data Foundation pods are in Running state.

oc get pod -n openshift-storage | grep mon

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Identify the crashcollector pod deployment.

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=NODE-NAME -n openshift-storage

If there is an existing crashcollector deployment, delete it.

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=NODE-NAME -n openshift-storage

Delete the ocs-osd-removal-job.

oc delete -n openshift-storage job ocs-osd-removal-job

Example output:

job.batch "ocs-osd-removal-job" deleted

Update the OpenShift Data Foundation add-on

Major update

Check the existing version.

ibmcloud oc cluster addon ls --cluster CLUSTER

Update the add-on.

ibmcloud oc cluster addon update openshift-data-foundation --cluster CLUSTER --version VERSION

Verify the add-on is updated.

ibmcloud oc cluster addon ls --cluster CLUSTER

Update your cluster resource

Major update

Get the name of your ocscluster resource.

oc get ocscluster

Example output

NAME             AGE
ocscluster-vpc   19d

Run the following command to edit your ocscluster resource.
```
oc edit ocscluster OCS-CLUSTER-NAME
```

Set the ocsUpgrade parameter to true.

...
spec:
    billingType: hourly
monSize: 20Gi
autoDiscoverDevices: true
numOfOsd: 1
ocsUpgrade: true
osdSize: 250Gi
status:
    storageClusterStatus: Decreasing the capacity not allowed

Save and close the file.
Wait for the update to complete.

Verify that the storagecluster and cephcluster resources are both deployed correctly.

oc get storagecluster -n openshift-storage
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   43h   Ready              2023-06-21T09:22:00Z   4.11.0

oc get cephcluster -n openshift-storage
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          43h   Ready   Cluster created successfully   HEALTH_OK

oc get csv -n openshift-storage
NAME                              DISPLAY                       VERSION   REPLACES                          PHASE
mcg-operator.v4.11.8              NooBaa Operator               4.11.8    mcg-operator.v4.11.7              Succeeded
ocs-operator.v4.11.8              OpenShift Container Storage   4.11.8    ocs-operator.v4.11.7              Succeeded
odf-csi-addons-operator.v4.11.8   CSI Addons                    4.11.8    odf-csi-addons-operator.v4.11.7   Succeeded
odf-operator.v4.11.8              OpenShift Data Foundation     4.11.8    odf-operator.v4.11.7              Succeeded