Monitoring cluster health

For cluster metrics and app monitoring, Red Hat® OpenShift® on IBM Cloud® clusters include built-in tools to help you manage the health of your single cluster instance. You can also set up IBM Cloud tools for multi-cluster analysis or other use cases, such as IBM Cloud Kubernetes Service cluster add-ons: IBM Cloud Logs and IBM Cloud Monitoring.

Understanding options for monitoring

To help understand when to use the built-in Red Hat OpenShift tools or IBM Cloud integrations, review the following information.

Monitoring limitation in private-only clusters with RHCOS worker nodes: The monitoring agent relies on kernel headers in the operating system, however RHCOS doesn't have kernel headers. In this scenario, the agent reaches back to sysdig.com to use the pre-compiled agent. In clusters with no public network access this process fails. To allow monitoring on RHCOS clusters, you must either allow outbound traffic or see the Sysdig documentation for installing the agent on air-gapped environments.

IBM Cloud Monitoring

Review the following details about IBM Cloud Monitoring.

Customizable user interface for a unified look at your cluster metrics, container security, resource usage, alerts, and custom events.
Quick integration with the cluster via a script.
Aggregated metrics and container monitoring across clusters and cloud providers.
Historical access to metrics that is based on the timeline and plan, and ability to capture and download trace files.
Highly available, scalable, and compliant with industry security standards.
Integrated with IBM Cloud IAM for user access management.

Built-in Red Hat OpenShift monitoring tools

OpenShift includes a preconfigured, preinstalled, and self-updating monitoring stack that provides monitoring for core platform components on a per-cluster basis. This monitoring includes built-in Prometheus and Grafana deployments in the openshift-monitoring project for cluster metrics, which is available in a single zone only. You can view and manage your monitoring dashboards, metrics, and alerts from the Red Hat OpenShift web console. For more information, see Monitoring in the Red Hat OpenShift documentation.

By default, the monitoring stack does not use persistent storage to back up metric history, and instead uses a temporary EmptyDir volume in the host file system. The retention period for metrics history ranges from 11 to 15 days, depending on your cluster version. For some workloads, these settings might use a significant amount of disk space and memory, or might not meet requirements for metrics retention. You can configure the monitoring stack to use persistent storage, change the metrics retention policies, or run Prometheus on dedicated nodes. For more information, see Configuring the monitoring stack.

Note that Red Hat OpenShift on IBM Cloud version 4.16 sets a default 10 GB size retention.

Monitoring Red Hat® OpenShift® on IBM Cloud® storage metrics

Red Hat® OpenShift® on IBM Cloud® clusters include built-in tools to help cluster administrators get information about the availability and capacity of storage volumes.

If you are unable to view storage metrics in the Red Hat OpenShift monitoring dashboard, see Debugging Block Storage for VPC metrics.

The following metrics can be monitored for Red Hat® OpenShift® on IBM Cloud® clusters.

kubelet_volume_stats_available_bytes
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_free
kubelet_volume_stats_inodes_used

Want to set up storage monitoring alerts for platforms such as email or Slack? See Sending notifications to external systems in the Red Hat OpenShift documentation.

Before monitoring metrics for Block Storage for VPC, you must have a cluster with the Block Storage for VPC cluster add-on enabled and you must have a Block Storage for VPC volume attached to a worker node. Red Hat® OpenShift® on IBM Cloud® Storage Metrics are populated only for mounted storage volumes.

Navigate to the Red Hat OpenShift web console and select Monitoring and then Metrics.

Input the metric you want to monitor in the dialog box and select Run queries.

kubelet_volume_stats_used_bytes{persistentvolumeclaim="NAME OF PVC"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="NAME OF PVC"}

Example output

endpoint       instance      job     metrics_path  namespace  node         persistentvolumeclaim  prometheus               service  value 
https-metrics  11.111.1.1:XX kubelet /metrics      default    11.111.1.1   PVC-NAME               openshift-monitoring/k8s kubelet  0.003596851526321722

For more information, see Monitoring.

If your volume is reaching capacity, try setting up volume expansion.

Enabling remote health reporting

Telemetry is a remote health monitoring feature that collects aggregated data about your cluster, such as the health of your components and the number and types of resources in use. If you have a public cluster, you can elect to have your own Telemetry data visible in your account for your use. For more information, see Telemetry for remote health monitoring.