Why do cluster master operations fail due to a broken webhook?
Virtual Private Cloud Classic infrastructure
This troubleshooting topic is not for general webhook troubleshooting. See Debugging webhooks for webhook problems not related to updating the cluster master.
During a master operation such as updating your cluster version, the cluster had a broken webhook application.
Now, master operations can't complete. You see an error similar to the following:
Cannot complete cluster master operations because the cluster has a broken webhook application. For more information, see the troubleshooting docs: 'https://ibm.biz/master_webhook'
Your cluster has configurable Kubernetes webhook resources, validating or mutating admission webhooks, that can intercept and modify requests from various services in the cluster to the API server in the cluster master.
Because webhooks can change or reject requests, broken webhooks can impact the functionality of the cluster in various ways, such as preventing you from updating the master version or other maintenance operations. For more information, see the Dynamic Admission Control in the Kubernetes documentation.
Potential causes for broken webhooks include:
- The underlying resource that issues the request is missing or unhealthy, such as a Kubernetes service, endpoint, or pod.
- The webhook is part of an add-on or other plug-in application that did not install correctly or is unhealthy.
- Your cluster might have a networking connectivity issue that prevents the webhook from communicating with the Kubernetes API server in the cluster master.
Run the following commands to create a test pod to get an error that identifies the broken webhook. If the test passes, then the failure might have been temporary and can be retried.
-
Run the following commands to create the test pod and label the
ibm-system
namespace.kubectl run webhook-test --image us.icr.io/armada-master/pause:3.10 -n ibm-system kubectl delete pod -n ibm-system webhook-test --ignore-not-found kubectl label ns ibm-system ibm-cloud.kubernetes.io/webhook-test-at="$(date -u +%FT%H_%M_%SZ)" --overwrite
The error message might have the name of the broken webhook. In the following example output, the webhook is
trust.hooks.securityenforcement.admission.cloud.ibm.com
.Error from server (InternalError): Internal error occurred: failed calling webhook "trust.hooks.securityenforcementadmission.cloud.ibm.com": Post https://ibmcloud-image-enforcement.ibm-system.svc:443/mutating-pods?timeout=30s: dialtcp 172.21.xxx.xxx:443: connect: connection timed out
-
Get the name of the broken webhook.
- If the error message has a broken webhook, replace
trust.hooks.securityenforcement.admission.cloud.ibm.com
with the broken webhook that you previously identified.
Example outputkubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations -o jsonpath='{.items[?(@.webhooks[*].name=="trust.hooks.securityenforcement.admission.cloud.ibm.com")].metadata.name}{"\n"}'
image-admission-config
- If the error does not have a broken webhook, list all the webhooks in your cluster and check their configurations in the following steps.
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations
- If the error message has a broken webhook, replace
-
Review the service and location details of the mutating or validating webhook configuration in the
clientConfig
section in the output of the following command. Replaceimage-admission-config
with the name that you previously identified. If the webhook exists outside the cluster, contact the cluster owner to check the webhook status.kubectl get mutatingwebhookconfiguration image-admission-config -o yaml
kubectl get validatingwebhookconfigurations image-admission-config -o yaml
Example output
clientConfig: caBundle: <redacted> service: name: <name> namespace: <namespace> path: /inject port: 443
-
Optional: Back up the webhooks, especially if you don't know how to reinstall the webhook or don't have the required permissions to create webhooks.
kubectl get mutatingwebhookconfiguration <name> -o yaml > mutatingwebhook-backup.yaml
kubectl get validatingwebhookconfiguration <name> -o yaml > validatingwebhook-backup.yaml
-
Check the status of the related service and pods for the webhook.
-
Check the service Type, Selector, and Endpoint fields.
kubectl describe service -n <namespace> <service_name>
-
If the service type is ClusterIP, check that the Konnectivity pod is in a Running status so that the webhook can connect securely to the Kubernetes API in the cluster master. If the pod is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot.
- Check the Konnectivity agent pods.
kubectl describe pods -n kube-system -l app=konnectivity-agent
- Check the Konnectivity agent pods.
-
If the service does not have an endpoint, check the health of the backing resources, such as a deployment or pod. If the resource is not healthy, check the pod events, logs, worker node health, and other components to troubleshoot. For more information, see Debugging app deployments.
kubectl get all -n my-service-namespace -l <key=value>
-
If the service does not have any backing resources, or if troubleshooting the pods does not resolve the issue, remove the mutating or validating webhook configuration identified earlier.
kubectl delete validatingwebhookconfiguration NAME
kubectl delete mutatingwebhookconfiguration NAME
-
-
Retry the cluster master operation, such as updating the cluster.
-
If you still see the error, you might have worker node or network connectivity issues.
- Worker node troubleshooting.
- Make sure that the webhook can connect to the Kubernetes API server in the cluster master. For example, if you use Calico network policies, security groups, or some other type of firewall, set up your classic or VPC cluster with the appropriate access.
- If the webhook is managed by an add-on that you installed, uninstall the add-on. Common add-ons that cause webhook issues include the following:
-
Re-create the webhook or reinstall the add-on.
-
If the issue persists, contact support. Open a support case. In the case details, be sure to include any relevant log files, error messages, or command outputs.