Infrastructure Health Metrics
This topic describes the metrics emitted by the metric namespace
Resources: Compute instances.
Overview of Metrics: oci_compute_infrastructure_health
The Compute infrastructure health metrics help you monitor the status and health of Compute instances.
- Instance health (up/down) status: The
instance_statusmetric lets you check whether a VM or bare metal instance is available (up) or unavailable (down) when in the running state.
- Instance maintenance status: The
maintenance_statusmetric lets you monitor whether a VM instance is scheduled for reboot maintenance.
- Bare metal infrastructure health status: The
health_statusmetric helps you monitor the health of the infrastructure for bare metal instances, including hardware components such as the CPU and memory.
Based on the value of the metrics, you can proactively move affected instances to healthy hardware and thereby minimize the impact on your applications.
Required IAM Policy
To monitor resources, you must be given the required type of access in a policy written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services as well as the resources being monitored. If you try to perform an action and get a message that you don’t have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment you should work in. For more information on user authorizations for monitoring, see the Authentication and Authorization section for the related service: Monitoring or Notifications.
Available Metrics: oci_compute_infrastructure_health
The metrics listed in the following table are automatically available for your instances.
instance_status metric is available for both VM and bare metal
maintenance_status metric is available only for VM
instances, and the
health_status metric is available only for bare
metal instances. You do not need to enable monitoring on the instance to get these
You also can use the Monitoring service to create custom queries.
The metric includes the following dimensions :
- The type of hardware issue:
CPU: A fault has been detected in one or more CPUs.
MEM-BOOT: A fault in the memory subsystem was detected during instance launch or a recent reboot.
MEM-RUNTIME: A fault in the memory subsystem was detected.
MGMT-CONTROLLER: A fault in the instance management controller has been detected.
PCI: A fault in the PCI subsystem has been detected.
For troubleshooting suggestions and more information about these hardware issues, see Compute Health Monitoring for Bare Metal Instances.
- The friendly name of the instance.
- The OCID of the instance.
The scheduled start time of the 24-hour maintenance window, in the format defined by RFC3339.
- The action that Oracle Cloud Infrastructure will perform on an instance during a
scheduled maintenance event:
REBOOT: The instance is stopped on the physical VM host that needs maintenance, and then restarted on a healthy VM host.
- The action that you can take before the scheduled maintenance event, so that you
can control how and when your applications experience downtime.
REBOOT: You can proactively reboot the instance before the scheduled maintenance time. When you reboot an instance for maintenance, the instance is stopped on the physical VM host that needs maintenance, and then restarted on a healthy VM host. For more information, see Moving a Compute Instance to a New Host.
|Metric||Metric Display Name||Unit||Description||Dimensions|
||Infrastructure Health Status||Issues||The number of health issues for a bare metal instance. Any non-zero value indicates a health defect.||
||Instance Status||Count||The status of a running VM or bare metal instance. A value of 0 indicates that the instance is available (up). A value of 1 indicates that the instance is not available (down) due to an infrastructure issue. If the instance is stopped, then the metric does not have a value.||
||Maintenance Status||Count||The maintenance status of a VM instance. A value of 0 indicates that the instance is not scheduled for a maintenance reboot. A value of 1 indicates that the instance is scheduled for a maintenance reboot.||
Using the Console
- Open the navigation menu. Under Core Infrastructure, go to Compute and click Instances.
- Click the instance that you're interested in.
- Under Resources, click Metrics.
In the Metric Namespace list, select oci_compute_infrastructure_health.
The Metrics page displays a default set of charts for the current instance.
- Open the navigation menu. Under Solutions and Platform, go to Monitoring and click Service Metrics.
- Select a compartment.
For Metric Namespace, select oci_compute_infrastructure_health.
The Service Metrics page dynamically updates to show charts for each metric that is emitted by the selected metric namespace.
Using the API
For information about using the API and signing requests, see REST APIs and Security Credentials. For information about SDKs, see Software Development Kits and Command Line Interface.