Why does my worker node show a NetworkUnavailable
error?
Virtual Private Cloud Classic infrastructure Satellite
When you update your master or worker nodes, your worker nodes enter a Node network unavailable
state.
Your worker nodes might enter a NetworkUnavailable
or Node network unavailable
state whenever the calico-node
pod has been shut down. This might happen during a Calico patch update, but shouldn't impact your
application availability.
When Calico is updated, the node.kubernetes.io/network-unavailable:NoSchedule
taint is added to your worker node and the Node network unavailable
condition becomes True
. Both of these conditions are cleared
when Calico restarts, which typically takes only a few seconds.
While this happens, you might see an error message similar to the following.
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
[Kubernetes] Node network unavailable is Triggered on kubernetes.node.name = 10.184.XXX.XXX
Sometimes, the restart might take longer. In nearly all cases, the restart is fast enough to avoid any worker node network issues. However, there are situations where a Calico restart is delayed and thus, there could be network interruptions. For these cases, the node network unavailable taint and condition are designed to keep new apps from being deployed to the new node until Calico and the node are fixed. Calico updates are rolled out in a very controlled manner so as to minimize overall application impact should there be a node problem.
Monitor the Node network unavailable
state with IBM Cloud Monitoring
By using services to monitor applications such as IBM Cloud Monitoring, you can configure alerts for when a worker node goes into a Node network unavailable
state, and count each time this happens. You can also configure thresholds
and tune your alerts to allow for when worker nodes are in a Node network unavailable
state during routine Calico patches.
When you set up IBM Cloud Monitoring alerts, take the following scenarios into consideration.
- A
Node network unavailable
alert might become a problem when acalico-node
pod fails to achieve aRunning
state, and its container restart count continues to increase. - A worker node remains in
Node network unavailable
state for a long amount of time.
After a worker update or replace, sometimes the calico-node
pod still does not start on Red Hat OpenShift VPC Cluster. The calico_node
pod might get stuck in a state where it is unable to start on a Red Hat OpenShift
VPC cluster. This is not an issue on IKS or Classic clusters. This can occur when you have the sysdig-admission-controller-webhook
installed and try to do a worker update or replace. This happens because:
- The VPN client pod gets moved to the new worker as it is starting.
calico-node
on the new worker starts up, but gets stuck because it makes anapiserver
call and times out after 2 seconds.- The
apiserver
call then tries to call the webhook which fails because the VPN client pod was trying to start on the new node. The VPN node cannot successfully do so becausecalico-node
hasn't started up yet.
In summary, the calico-node
pod startup depends on the webhook working; the webhook depends on the VPN client pod; and the VPN client pod depends on calico-node
starting up. The system is stuck in a circular dependency.
If you are able to gather logs from a successfully deployed calico-node
pod, you might see an error like this:
2022-09-08 07:13:19.719 [WARNING][9] startup/utils.go 228: Failed to set NetworkUnavailable; will retry error=Patch "https://172.21.0.1:443/api/v1/nodes/10.242.64.17/status?timeout=2s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Workarounds for calico-node
You can use one of the following methods to work around the issue and get the calico-node
pod running again.
- Remove the
sysdig-admission-controller-webhook
from the system. - Modify the
sysdig-admission-controller-webhook
and change the timeout to be less than 2 seconds. - Modify the
sysdig-admission-controller-webhook
to scope it to the appropriate namespaces, and avoid system-critical namespaces such ascalico-system
. - Cordon the new node but don't drain it. Delete the VPN pod and wait for it to start on another worker. Uncordon the node.
After performing any of the previous workarounds, the calico-node
pod can start successfully.