Infrastructure Health Metrics

You can monitor the health, capacity, and performance of the infrastructure for your Compute virtual machine (VM) and bare metal instances by using metrics, alarms, and notifications.

This topic describes the metrics emitted by the metric namespace oci_compute_infrastructure_health.

Resources: Compute instances.

Overview of Metrics: oci_compute_infrastructure_health

The Compute infrastructure health metrics help you monitor the status and health of Compute instances.

  • Instance health (up/down) status: The instance_status metric lets you check whether a VM or bare metal instance is available (up) or unavailable (down) when in the running state.
  • Instance maintenance status: The maintenance_status metric lets you monitor whether a VM instance is scheduled for reboot maintenance.
  • Bare metal infrastructure health status: The health_status metric helps you monitor the health of the infrastructure for bare metal instances, including hardware components such as the CPU and memory.

Based on the value of the metrics, you can proactively move affected instances to healthy hardware and thereby minimize the impact on your applications.

Required IAM Policy

To monitor resources, you must be given the required type of access in a policy  written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services as well as the resources being monitored. If you try to perform an action and get a message that you don’t have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment  you should work in. For more information on user authorizations for monitoring, see the Authentication and Authorization section for the related service: Monitoring or Notifications.

Available Metrics: oci_compute_infrastructure_health

The metrics listed in the following table are automatically available for your instances. The instance_status metric is available for both VM and bare metal instances, the maintenance_status metric is available only for VM instances, and the health_status metric is available only for bare metal instances. You do not need to enable monitoring on the instance to get these metrics.

You also can use the Monitoring service to create custom queries.

The metric includes the following dimensions :

faultClass
The type of hardware issue:
  • CPU: A fault has been detected in one or more CPUs.
  • MEM-BOOT: A fault in the memory subsystem was detected during instance launch or a recent reboot.
  • MEM-RUNTIME: A fault in the memory subsystem was detected.
  • MGMT-CONTROLLER: A fault in the instance management controller has been detected.
  • PCI: A fault in the PCI subsystem has been detected.

For troubleshooting suggestions and more information about these hardware issues, see Compute Health Monitoring for Bare Metal Instances.

resourceDisplayName
The friendly name of the instance.
resourceId
The OCID  of the instance.
maintenanceDueTime

The scheduled start time of the 24-hour maintenance window, in the format defined by RFC3339.

computeMaintenanceAction
The action that Oracle Cloud Infrastructure will perform on an instance during a scheduled maintenance event:
  • REBOOT: The instance is stopped on the physical VM host that needs maintenance, and then restarted on a healthy VM host.
recommendedAction
The action that you can take before the scheduled maintenance event, so that you can control how and when your applications experience downtime.
  • REBOOT: You can proactively reboot the instance before the scheduled maintenance time. When you reboot an instance for maintenance, the instance is stopped on the physical VM host that needs maintenance, and then restarted on a healthy VM host. For more information, see Moving a Compute Instance to a New Host.
Metric Metric Display Name Unit Description Dimensions
health_status Infrastructure Health Status Issues The number of health issues for a bare metal instance. Any non-zero value indicates a health defect.

faultClass

resourceDisplayName

resourceId

instance_status Instance Status Count The status of a running VM or bare metal instance. A value of 0 indicates that the instance is available (up). A value of 1 indicates that the instance is not available (down) due to an infrastructure issue. If the instance is stopped, then the metric does not have a value.

resourceDisplayName

resourceId

maintenance_status Maintenance Status Count The maintenance status of a VM instance. A value of 0 indicates that the instance is not scheduled for a maintenance reboot. A value of 1 indicates that the instance is scheduled for a maintenance reboot.

maintenanceDueTime

computeMaintenanceAction

recommendedAction

resourceDisplayName

resourceId

Using the Console

To view infrastructure health metrics for a single Compute instance
  1. Open the navigation menu. Under Core Infrastructure, go to Compute and click Instances.
  2. Click the instance that you're interested in.
  3. Under Resources, click Metrics.
  4. In the Metric Namespace list, select oci_compute_infrastructure_health.

    The Metrics page displays a default set of charts for the current instance.

For more information about monitoring metrics and using alarms, see Monitoring Overview. For information about notifications for alarms, see Notifications Overview.

To view infrastructure health metrics for all Compute instances in a compartment
  1. Open the navigation menu. Under Solutions and Platform, go to Monitoring and click Service Metrics.
  2. Select a compartment.
  3. For Metric Namespace, select oci_compute_infrastructure_health.

    The Service Metrics page dynamically updates to show charts for each metric that is emitted by the selected metric namespace.

For more information about monitoring metrics and using alarms, see Monitoring Overview. For information about notifications for alarms, see Notifications Overview.