Oracle Cloud Infrastructure Documentation

Infrastructure Health Metrics

You can monitor the health, capacity, and performance of your Compute bare metal instances by using metrics, alarms, and notifications.

This topic describes the metrics emitted by the metric namespace oci_compute_infrastructure_health.

Resources: Bare metal Compute instances.

Overview of Metrics: oci_compute_infrastructure_health

The infrastructure health metrics help you monitor the health of the infrastructure for your bare metal instances, including hardware components such as the CPU, motherboard, DIMM, and NVMe drives. You can use the metrics to identify hardware issues, and proactively take action to minimize the impact on your applications.

Required IAM Policy

To monitor resources, you must be given the required type of access in a policy  written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services as well as the resources being monitored. If you try to perform an action and get a message that you don’t have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment  you should work in. For more information on user authorizations for monitoring, see the Authentication and Authorization section for the related service: Monitoring or Notifications.

Available Metrics: oci_compute_infrastructure_health

The metric listed in the following table is automatically available for each bare metal instance that you create. You do not need to enable monitoring on the instance to get this metric.

You also can use the Monitoring service to create custom queries.

The metric includes the following dimensions :

The type of hardware issue:
  • CPU: A fault has been detected in one or more CPUs.

  • MEM-BOOT: A fault in the memory subsystem was detected during instance launch or a recent reboot.

  • MEM-RUNTIME: A fault in the memory subsystem was detected.

  • MGMT-CONTROLLER: A fault in the instance management controller has been detected.

  • PCI: A fault in the PCI subsystem has been detected.

The friendly name of the instance.
The OCID  of the instance.
Metric Metric Display Name Unit Description Dimensions
health_status Infrastructure Health Status Issues Number of issues. Any non-zero value indicates a health defect.




Using the Console

To view infrastructure health metrics for a single Compute instance
To view infrastructure health metrics for all Compute instances in a compartment

Using the API

For information about using the API and signing requests, see REST APIs and Security Credentials. For information about SDKs, see Software Development Kits and Command Line Interface.

Use the following APIs for monitoring: