Debugging network connections between pods
Review the options and strategies for debugging connection issues between pods.
Check the health of your cluster components and networking pods
Follow these steps to check the health of your components. Networking issues might occur if your cluster components are not up to date or are not in a healthy state.
-
Check that your cluster master and worker nodes run on a supported version and are in a healthy state. If the cluster master or workers do not run a supported version, make any necessary updates so that they run a supported version. If the status of any components is not
Normal
orReady
, review the cluster master health states, cluster states, worker node states, or steps to troubleshootCritical
orNotReady
worker nodes for more information. Make sure any related issues are resolved before continuing.To check the cluster master version and health:
ibmcloud oc cluster get -c <cluster-id>
To check worker node versions and health:
ibmcloud oc workers -c <cluster-id>
-
For each worker node, verify that the Calico and cluster DNS pods are present and running in a healthy state.
-
Run the command to get the details of your cluster's pods.
oc get pods -A -o wide | grep -e calico -e dns-default
-
In the output, make sure that your cluster includes the following pods. Make sure that each pod's status is
Running
, and that the pods do not have too many restarts.- Exactly one
calico-node
pod per worker node. - At least one
calico-typha
pod per cluster. Larger clusters might have more than one. - Exactly one
calico-kube-controllers
pod per cluster. - One
dns-default
pod per node. However, some nodes might not have thedns-default
pod if they have labels attached. This is normal and does not cause network issues.
Example output
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED READINESS GATES calico-system calico-kube-controllers-1a1a1a1 1/1 Running 0 37m 172.17.61.195 10.245.0.5 <none> <none> calico-system calico-node-1a1a1a1 1/1 Running 0 37m 10.245.0.5 10.245.0.5 <none> <none> calico-system calico-node-1a1a1a1 1/1 Running 0 37m 10.245.0.4 10.245.0.4 <none> <none> calico-system calico-typha-1a1a1a1 1/1 Running 0 37m 10.245.0.5 10.245.0.5 <none> <none> openshift-dns dns-default-1a1a1a1 2/2 Running 0 33m 172.17.36.144 10.245.0.4 <none> <none> openshift-dns dns-default-1a1a1a1 2/2 Running 0 33m 172.17.61.210 10.245.0.5 <none> <none>
- Exactly one
-
If any of the listed pods are not present or are in an unhealthy state, go through the cluster and worker node trouble shooting documentation included in the previous steps. Make sure any issues with the pods in this step are resolved before moving on.
-
Debug with test pods
To determine the cause of networking issues on your pods, you can create a test pod on each of your worker nodes. Then, you can run tests and observe networking activity within the pod, which might reveal the source of the problem.
Setting up the pods
-
Create a new privileged namespace for your test pods. Creating a new namespace prevents any custom policies or configurations in existing namespaces from affecting your test pods. In this example, the new namespace is called
pod-network-test
.Create the namespace.
oc create ns pod-network-test
-
Add labels to the new privileged namespace.
oc label namespace pod-network-test --overwrite=true \ pod-security.kubernetes.io/enforce=privileged \ pod-security.kubernetes.io/enforce-version=latest \ pod-security.kubernetes.io/audit=privileged \ pod-security.kubernetes.io/audit-version=latest \ pod-security.kubernetes.io/warn=privileged \ pod-security.kubernetes.io/warn-version=latest \ security.openshift.io/scc.podSecurityLabelSync="false"
-
Run the command to allow the namespace to run pods with a privileged security context.
oc adm policy add-scc-to-group privileged system:serviceaccounts:pod-network-test
- Create and apply the following daemonset to create a test pod on each node.
apiVersion: apps/v1 kind: DaemonSet metadata: labels: name: webserver-test app: webserver-test name: webserver-test spec: selector: matchLabels: name: webserver-test template: metadata: labels: name: webserver-test app: webserver-test spec: tolerations: - operator: "Exists" containers: - name: webserver securityContext: privileged: true image: us.icr.io/armada-master/network-alpine:latest env: - name: ENABLE_ECHO_SERVER value: "true" - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name restartPolicy: Always terminationGracePeriodSeconds: 1 ```
- Apply the daemonset to deploy test pods on your worker nodes.
oc apply --namespace pod-network-test -f <daemonset-file>
- Verify that the pods start up successfully by listing all pods in the namespace.
oc get pods --namespace pod-network-test -o wide
Running tests within the pods
Run curl
, ping
, and nc
commands to test each pod's network connection and the dig
command to test the cluster DNS. Review each output, then see Identifying issues to find what the outcomes might mean.
-
List your test pods and note the name and IP of each pod.
oc get pods --namespace pod-network-test -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES webserver-test-1a1a1 1/1 Running 0 68s 172.17.36.169 10.245.0.4 <none> <none> webserver-test-1a1a1 1/1 Running 0 68s 172.17.61.240 10.245.0.5 <none> <none>
-
Run the
exec
command to log into one pod.kubectl exec -it --namespace pod-network-test <pod_name> -- sh
-
Run the
curl
command on the pod and note the output. Specify the IP of the pod that you did not log into. This tests the network connection between pods on different nodes.curl <pod_ip>:8080
Example successful output.
Hostname: webserver-test-t546j Pod Information: node name: env var NODE_NAME not set pod name: webserver-test-t546j pod namespace: env var POD_NAMESPACE not set pod IP: env var POD_IP not set Connection Information: remote address: 172.17.36.169 remote port: 56042 local address: 172.17.61.240 local port: 8080
-
Run the
ping
command on the pod and note the output. Specify the IP of the pod that you did not log into with theexec
command. This tests the network connection between pods on different nodes.ping -c 5 <pod_ip>
Example successful output.
PING 172.30.248.201 (172.30.248.201) 56(84) bytes of data. 64 bytes from 172.30.248.201: icmp_seq=1 ttl=62 time=0.473 ms 64 bytes from 172.30.248.201: icmp_seq=2 ttl=62 time=0.449 ms 64 bytes from 172.30.248.201: icmp_seq=3 ttl=62 time=0.381 ms 64 bytes from 172.30.248.201: icmp_seq=4 ttl=62 time=0.438 ms 64 bytes from 172.30.248.201: icmp_seq=5 ttl=62 time=0.348 ms --- 172.30.248.201 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4086ms rtt min/avg/max/mdev = 0.348/0.417/0.473/0.046 ms
-
Run the
nc
command on the pod and note the output. Specify the IP of the pod that you did not log into with theexec
command. This tests the network connection between pods on different nodes.nc -vzw 5 <pod_ip> 8080
Example successful output.
nc -vzw 5 172.17.61.240 8080 172.17.61.240 (172.17.61.240:8080) open
-
Run the
dig
commands to test the DNS.dig +short kubernetes.default.svc.cluster.local
Example output
172.21.0.1
dig +short ibm.com
Example output
23.50.74.64
-
Run
curl
commands to test a full TCP or HTTPS connection to the service. This example tests the connection between the pod and the cluster master by retrieving the cluster's version information. Successfully retrieving the cluster version indicates a healthy connection.curl -k https://kubernetes.default.svc.cluster.local/version
Example output
{ "major": "1", "minor": "25", "gitVersion": "v1.25.14+bcb9a60", "gitCommit": "3bdfba0be09da2bfdef3b63e421e6a023bbb08e6", "gitTreeState": "clean", "buildDate": "2023-10-30T21:33:07Z", "goVersion": "go1.19.13 X:strictfipsruntime", "compiler": "gc", "platform": "linux/amd64" }
-
Log out of the pod.
exit
-
Repeat the earlier steps with the remaining pods.
Identifying issues
Review the outputs from the earlier section to help find the cause of your pod networking issues. This section lists some common causes that can be identified from the earlier section.
-
If the commands functioned normally on the test pods, but you still have networking issues in your application pods in your default namespace, there might be issues related specifically to your application.
- You might have Calico or Kubernetes network security policies in place that restrict your networking traffic. If a networking policy is applied to a pod, all traffic that is not specifically allowed by that policy is dropped. For more information on networking policies, see the Kubernetes documentation.
- If you are using Istio or Red Hat OpenShift Service Mesh, there might be service configuration issues that drop or block traffic between pods. For more information, see the troubleshooting documentation for Istio and Red Hat OpenShift Service Mesh.
- The issue might be related to bugs in the application rather than your cluster, and might require your own independent trouble shooting.
-
If the
curl
,ping
, ornc
commands failed for certain pods, identify which worker nodes those pods are on. If the issue exists on only some of your worker nodes, replace those worker nodes or see additional information on worker node trouble shooting. -
If the DNS lookups from the
dig
commands failed, See the Red Hat DNS troubleshooting information.
If you are still unable to resolve your pod networking issue, open a support case and include a detailed description of the problem, how you have tried to solve it, what kinds of tests you ran, and relevant logs for your pods and worker nodes. For more information on opening a support case and what information to include, see the general debugging guide.