IBM Cloud Docs
Monitoring cluster health

Monitoring cluster health

For cluster metrics and app monitoring, Red Hat® OpenShift® on IBM Cloud® clusters include built-in tools to help you manage the health of your single cluster instance. You can also set up IBM Cloud tools for multi-cluster analysis or other use cases, such as IBM Cloud Kubernetes Service cluster add-ons: IBM Log Analysis and IBM Cloud Monitoring.

Understanding options for monitoring

To help understand when to use the built-in Red Hat OpenShift tools or IBM Cloud integrations, review the following information.

Monitoring limitation in private-only clusters with RHCOS worker nodes: The monitoring agent relies on kernel headers in the operating system, however RHCOS doesn't have kernel headers. In this scenario, the agent reaches back to sysdig.com to use the pre-compiled agent. In clusters with no public network access this process fails. To allow monitoring on RHCOS clusters, you must either allow outbound traffic or see the Sysdig documentation for installing the agent on air-gapped environments.

IBM Cloud Monitoring

Review the following details about IBM Cloud Monitoring.

  • Customizable user interface for a unified look at your cluster metrics, container security, resource usage, alerts, and custom events.
  • Quick integration with the cluster via a script.
  • Aggregated metrics and container monitoring across clusters and cloud providers.
  • Historical access to metrics that is based on the timeline and plan, and ability to capture and download trace files.
  • Highly available, scalable, and compliant with industry security standards.
  • Integrated with IBM Cloud IAM for user access management.
  • Free trial to try out the capabilities.
  • To get started, see Forwarding cluster and app metrics to IBM Cloud Monitoring.

For more information, see Monitoring.

Built-in Red Hat OpenShift monitoring tools

OpenShift includes a preconfigured, preinstalled, and self-updating monitoring stack that provides monitoring for core platform components on a per-cluster basis. This monitoring includes built-in Prometheus and Grafana deployments in the openshift-monitoring project for cluster metrics, which is available in a single zone only. You can view and manage your monitoring dashboards, metrics, and alerts from the Red Hat OpenShift web console. For more information, see Monitoring in the Red Hat OpenShift documentation.

By default, the monitoring stack does not use persistent storage to back up metric history, and instead uses a temporary EmptyDir volume in the host filesystem. The retention period for metrics history ranges from 11 to 15 days, depending on your cluster version. For some workloads, these settings might use a significant amount of disk space and memory, or might not meet requirements for metrics retention. You can configure the monitoring stack to use persistent storage, change the metrics retention policies, or run Prometheus on dedicated nodes. For more information, see Configuring the monitoring stack.

Note that Red Hat OpenShift on IBM Cloud version 4.16 sets a default 10 GB size retention.

Monitoring Red Hat® OpenShift® on IBM Cloud® storage metrics

Red Hat® OpenShift® on IBM Cloud® clusters include built-in tools to help cluster administrators get information about the availability and capacity of storage volumes.

If you are unable to view storage metrics in the Red Hat OpenShift monitoring dashboard, see Debugging Block Storage for VPC metrics.

The following metrics can be monitored for Red Hat® OpenShift® on IBM Cloud® clusters.

  • kubelet_volume_stats_available_bytes
  • kubelet_volume_stats_capacity_bytes
  • kubelet_volume_stats_inodes
  • kubelet_volume_stats_inodes_free
  • kubelet_volume_stats_inodes_used

Want to set up storage monitoring alerts for platforms such as email or Slack? See Sending notifications to external systems in the Red Hat OpenShift documentation.

Before monitoring metrics for Block Storage for VPC, you must have a cluster with the Block Storage for VPC cluster add-on enabled and you must have a Block Storage for VPC volume attached to a worker node. Red Hat® OpenShift® on IBM Cloud® Storage Metrics are populated only for mounted storage volumes.

  1. Navigate to the Red Hat OpenShift web console and select Monitoring and then Metrics.

  2. Input the metric you want to monitor in the dialog box and select Run queries.

    kubelet_volume_stats_used_bytes{persistentvolumeclaim="NAME OF PVC"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="NAME OF PVC"}
    

    Example output

    endpoint       instance      job     metrics_path  namespace  node         persistentvolumeclaim  prometheus               service  value 
    https-metrics  11.111.1.1:XX kubelet /metrics      default    11.111.1.1   PVC-NAME               openshift-monitoring/k8s kubelet  0.003596851526321722
    

For more information, see Monitoring.

If your volume is reaching capacity, try setting up volume expansion.

Forwarding cluster and app metrics to IBM Cloud Monitoring

Use the Red Hat OpenShift on IBM Cloud observability plug-in to create a monitoring configuration for IBM Cloud Monitoring in your cluster, and use this monitoring configuration to automatically collect and forward metrics to IBM Cloud Monitoring.

With IBM Cloud Monitoring, you can collect cluster and pod metrics, such as the CPU and memory usage of your worker nodes, incoming and outgoing HTTP traffic for your pods, and data about several infrastructure components. In addition, the agent can collect custom application metrics by using either a Prometheus-compatible scraper or a statsd facade.

Considerations for using the Red Hat OpenShift on IBM Cloud observability plug-in:

  • You can have only one monitoring configuration for IBM Cloud Monitoring in your cluster at a time. If you want to use a different IBM Cloud Monitoring service instance to send metrics to, use the ibmcloud ob monitoring config replace command.
  • Red Hat OpenShift clusters in Satellite can't currently use the Red Hat OpenShift on IBM Cloud console or the observability plug-in CLI to enable monitoring for Satellite clusters. You must manually deploy monitoring agents to your cluster to forward metrics to Monitoring.
  • If you created a Monitoring configuration in your cluster without using the Red Hat OpenShift on IBM Cloud observability plug-in, you can use the ibmcloud ob monitoring agent discover command to make the configuration visible to the plug-in. Then, you can use the observability plug-in commands and functionality in the IBM Cloud console to manage the configuration.

Before you begin:

  • Verify that you are assigned the Editor platform access role and Manager server access role for IBM Cloud Monitoring.
  • Verify that you are assigned the Administrator platform access role and the Manager service access role for all Kubernetes namespaces in IBM Cloud Kubernetes Service to create the monitoring configuration. To view a monitoring configuration or launch the Monitoring dashboard after the monitoring configuration is created, users must be assigned the Administrator platform access role and the Manager service access role for the ibm-observe Kubernetes namespace in IBM Cloud Kubernetes Service.
  • If you want to use the CLI to set up the monitoring configuration:

To set up a monitoring configuration for your cluster:

  1. Create an IBM Cloud Monitoring service instance and note the name of the instance. The service instance must belong to the same IBM Cloud account where you created your cluster, but can be in a different resource group and IBM Cloud region than your cluster.

  2. Set up a monitoring configuration for your cluster. When you create the monitoring configuration, an Red Hat OpenShift project ibm-observe is created and a Monitoring agent is deployed as a Kubernetes daemon set to all worker nodes in your cluster. This agent collects cluster and pod metrics, such as the worker node CPU and memory usage, or the amount incoming and outgoing network traffic to your pods.

    In the console.

    1. From the Red Hat OpenShift clusters console, select the cluster for which you want to create a Monitoring configuration.
    2. On the cluster Overview page, click Connect.
    3. Select the region and the IBM Cloud Monitoring service instance that you created earlier, and click Connect.

    In the CLI.

    1. Create the Monitoring configuration. When you create the Monitoring configuration, the access key that was last added is retrieved automatically. If you want to use a different access key, add the --sysdig-access-key <access_key> option to the command.

      To use a different service access key after you created the monitoring configuration, use the ibmcloud ob monitoring config replace command.

      Version 4.15 and later: If your cluster has outbound traffic protection enabled, you must set up monitoring by using the private endpoint. To do this, specify the --private-endpoint option.

      ibmcloud ob monitoring config create --cluster <cluster_name_or_ID> --instance <Monitoring_instance_name_or_ID> [--private-endpoint]
      

      Example output

      Creating configuration...
      OK
      
    2. Verify that the monitoring configuration was added to your cluster.

      ibmcloud ob monitoring config list --cluster <cluster_name_or_ID>
      

      Example output

      Listing configurations...
      
      OK
      Instance Name                Instance ID                            CRN   
      IBM Cloud Monitoring-aaa     1a111a1a-1111-11a1-a1aa-aaa11111a11a   crn:v1:prod:public:sysdig:us-south:a/a11111a1aaaaa11a111aa11a1aa1111a:1a111a1a-1111-11a1-a1aa-aaa11111a11a::  
      
  3. Optional: Verify that the Monitoring agent was set up successfully.

    1. If you used the console to create the Monitoring configuration, log in to your cluster.

    2. Verify that the daemon set for the Monitoring agent was created and all instances are listed as AVAILABLE.

      oc get daemonsets -n ibm-observe
      

      Example output

      NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
      sysdig-agent   9         9         9       9            9           <none>          14m
      

      The number of daemon set instances that are deployed equals the number of worker nodes in your cluster.

    3. Review the ConfigMap that was created for your Monitoring agent.

      oc describe configmap -n ibm-observe
      
  4. Access the metrics for your pods and cluster from the Monitoring dashboard.

    1. From the Red Hat OpenShift clusters console, select the cluster that you configured.
    2. On the cluster Overview page, click Launch. The Monitoring dashboard opens.
    3. Review the pod and cluster metrics that the Monitoring agent collected from your cluster. It might take a few minutes for your first metrics to show.
  5. Review how you can work with the Monitoring dashboard to further analyze your metrics.

Viewing cluster states

Review the state of a Red Hat OpenShift cluster to get information about the availability and capacity of the cluster, and potential problems that might occur.

To view information about a specific cluster, such as its zones, service endpoint URLs, Ingress subdomain, version, and owner, use the ibmcloud oc cluster get --cluster <cluster_name_or_ID> command. Include the --show-resources option to view more cluster resources such as add-ons for storage pods or subnet VLANs for public and private IPs.

You can review information about the overall cluster, the IBM-managed master, and your worker nodes. To troubleshoot your cluster and worker nodes, see Troubleshooting clusters.

For more information about cluster and worker states, see:

Master states

Your Red Hat OpenShift on IBM Cloud cluster includes an IBM-managed master with highly available replicas, automatic security patch updates applied for you, and automation in place to recover in case of an incident. You can check the health, status, and state of the cluster master by running ibmcloud oc cluster get --cluster <cluster_name_or_ID>.

The Master Health reflects the state of master components and notifies you if something needs your attention. The health might be one of the following states.

  • error: The master is not operational. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal. You can also open an IBM Cloud support case.
  • normal: The master is operational and healthy. No action is required.
  • unavailable: The master might not be accessible, which means some actions such as resizing a worker pool are temporarily unavailable. IBM is automatically notified and takes action to resolve this issue. You can continue monitoring the health until the master is normal.
  • unsupported: The master runs an unsupported version of Kubernetes. You must update your cluster to return the master to normal health.

The Master Status provides details of what operation from the master state is in progress. The status includes a timestamp of how long the master has been in the same state, such as Ready (1 month ago). The Master State reflects the lifecycle of possible operations that can be performed on the master, such as deploying, updating, and deleting. Each state is described in the following table.

Master states
Master state Description
deployed The master is successfully deployed. Check the status to verify that the master is Ready or to see if an update is available.
deploying The master is currently deploying. Wait for the state to become deployed before working with your cluster, such as adding worker nodes.
deploy_failed The master failed to deploy. IBM Support is notified and works to resolve the issue. Check the Master Status field for more information, or wait for the state to become deployed.
deleting The master is currently deleting because you deleted the cluster. You can't undo a deletion. After the cluster is deleted, you can no longer check the master state because the cluster is completely removed.
delete_failed The master failed to delete. IBM Support is notified and works to resolve the issue. You can't resolve the issue by trying to delete the cluster again. Instead, check the Master Status field for more information, or wait for the cluster to delete. You can also open an IBM Cloud support case.
scaled_down The master resources have been scaled down to zero replicas. This is a temporary state that occurs while etcd is being restored after a backup. You cannot interact with your cluster while it is in this state. Wait for the etcd restoration to complete and the master state to return to deployed.
updating The master is updating its Kubernetes version. The update might be a patch update that is automatically applied, or a minor or major version that you applied by updating the cluster. During the update, your highly available master can continue processing requests, and your app workloads and worker nodes continue to run. After the master update is complete, you can update your worker nodes. If the update is unsuccessful, the master returns to a deployed state and continues running the previous version. IBM Support is notified and works to resolve the issue. You can check if the update failed in the Master Status field.
update_cancelled The master update is canceled because the cluster was not in a healthy state at the time of the update. Your master remains in this state until your cluster is healthy and you manually update the master. To update the master, use the ibmcloud oc cluster master update command. If you don't want to update the master to the default major.minor version during the update, include the --version option and specify the latest patch version that is available for the major.minor version that you want, such as 1.31. To list available versions, run ibmcloud oc versions.
update_failed The master update failed. IBM Support is notified and works to resolve the issue. You can continue to monitor the health of the master until the master reaches a normal state. If the master remains in this state for more than 1 day, open an IBM Cloud support case. IBM Support might identify other issues in your cluster that you must fix before the master can be updated.

Enabling remote health reporting

Telemetry is a remote health monitoring feature that collects aggregated data about your cluster, such as the health of your components and the number and types of resources in use. If you have a public cluster, you can elect to have your own Telemetry data visible in your account for your use. For more information, see Telemetry for remote health monitoring.