Oracle Cloud Infrastructure Documentation

Monitoring Overview

The Oracle Cloud Infrastructure Monitoring service enables you to actively and passively monitor your cloud resources using the Metrics and Alarms features.

This image shows metrics and alarms as used in the Monitoring service.

Metrics Feature Overview

The Metrics feature relays A measurement related to health, capacity, or performance of a given resource. (Monitoring service). Example: CpuUtilization data about the health, capacity, and performance of your cloud resources. A metric is a measurement related to health, capacity, or performance of a given The cloud objects that your company's employees create and use when interacting with Oracle Cloud Infrastructure.. Resources, services, and applications emit metrics to the Monitoring service. Common metrics reflect data related to availability and latency, application uptime and downtime, completed transactions, failed and successful operations, and key performance indicators (KPIs), such as sales and engagement quantifiers.

By querying Monitoring for this data, you can understand how well the systems and processes are working to achieve the service levels you commit to your customers. For example, you can monitor the CPU utilization and disk reads of your Compute A bare metal or virtual machine (VM) compute host. The image used to launch the instance determines its operating system and other software. The shape specified during the launch process determines the number of CPUs and memory allocated to the instance.. You can then use this data to determine when to launch more instances to handle increased load, troubleshoot issues with your instance, or better understand system behavior.

Example Metric: Failure Rate

For application health, one of the common KPIs is failure rate, for which a common definition is the number of failed transactions divided by total transactions. This KPI is usually delivered through application monitoring and management software.

As a developer, you can capture this KPI from your applications using custom metrics. Simply record observations every time an application transaction takes place and then post that data to the Monitoring service. In this case, set up metrics to capture failed transactions, successful transactions, and transaction latency (time spent per completed transaction).

Alarms Feature Overview

The Alarms feature of the Monitoring service works with the Notifications service to notify you when metrics meet alarm-specified triggers. When configured, repeat notifications remind you of a continued firing state at the configured repeat interval. You are also notified when an The trigger rule and query to evaluate and related configuration, such as notification details to use when the trigger is breached. Alarms passively monitor your cloud resources using metrics in Monitoring. transitions back to the OK state, or when an alarm is reset.

You can search for alarms using Search-supported attributes. For more information about Search, see Overview of Search.

Search-Supported Attributes for Alarms

Monitoring Concepts

The following concepts are essential to working with Monitoring.

aggregated data
The result of applying a statistic and interval to a selection of raw data points for a given metric. For example, you can apply the statistic max and interval 1h (one hour) to the last 24 hours of raw data points for the metric CpuUtilization. Aggregated data is displayed in default metric charts in the Console. You can also build metric queries for specific sets of aggregated data. For instructions, see Viewing Default Metric Charts and Building Metric Queries.
alarm
The alarm query to evaluate and the notification destination to use when the alarm is in the firing state, along with other alarm properties. For instructions on managing alarms, see Managing Alarms.
alarm query
The Monitoring Query Language (MQL) expression to evaluate for the alarm. An alarm query must specify a metric, statistic, interval, and a trigger rule (threshold or absence). The Alarms feature of the Monitoring service interprets results for each returned time series as a Boolean value, where zero represents false and a non-zero value represents true. A true value means that the trigger rule condition has been met. For more information, see Building Metric Queries and the query attribute description in the Alarm API reference.
data point
A timestamp-value pair for the specified metric. Example: 2018-05-10T22:19:00Z, 10.4
A data point is either raw or aggregated. Raw data points are posted by the metric namespace to the Monitoring service using the PostMetricData operation. The frequency of the data points posted varies by metric namespace. For example, your custom namespace might send data points for a given metric at a 20-second frequency.
Aggregated data points are the result of applying a statistic and interval to raw data points. The interval of the aggregated data points is determined by the SummarizeMetricsData request. For example, a request specifying the statistic sum and interval 1h (one hour) returns a sum value for each hour of available raw data points for the given metric.
dimension
A qualifier provided in a metric definition. Example: Resource identifier (resourceId), provided in the definitions of oci_computeagent metrics. Use dimensions to filter or group metric data. Example dimension name-value pair for filtering by availability domain: availabilityDomain = "VeBZ:PHX-AD-1"
frequency
The time period between each posted raw data point for a given metric. (Raw data points are posted by the metric namespace to the Monitoring service.) While frequency varies by metric, default service metrics typically have a frequency of 60 seconds (that is, one data point posted per minute). See also resolution.
interval
The time window used to convert the given set of raw data points.
The timestamp of the aggregated data point corresponds to the beginning of the time window during which raw data points are assessed. For example, for a five-minute interval, the timestamp "2:05" corresponds to the five-minute time window from 2:05 to 2:09:n. During the time window, the value of this aggregated data point dynamically updates. The final value for the aggregated data point is obtained when the last raw data point is assessed. There may be a short delay before the final value is posted.
This image shows how the timestamp of an aggregated data point corresponds to the interval.
The following example query specifies a 5-minute interval. CpuUtilization[5m].max() For supported values, see Monitoring Query Language (MQL) Reference.
See also resolution.
message
The content that the Alarms feature of the Monitoring service publishes to topics in the alarm’s configured notification destinations. A message is sent when the alarm transitions to another state, such as from "OK" to "FIRING." For more information about messages, see How Monitoring Works.
metadata
A reference provided in a metric definition. Example: unit (bytes), provided in the definition of the oci_computeagent metricDiskBytesRead. Use metadata to determine additional information about a given metric. For metric definitions, see Supported Services.
metric
A measurement related to health, capacity, or performance of a given resource. Example: The oci_computeagent metricCpuUtilization, which measures usage of a Compute instance. For metric definitions, see Supported Services.
Note

Metric resources do not have An Oracle-assigned unique ID called an Oracle Cloud Identifier (OCID). This ID is included as part of the resource's information in both the Console and API..
metric definition
A set of references, qualifiers, and other information provided by a metric namespace for a given metric. For example, the oci_computeagent metric DiskBytesRead is defined by dimensions (such as resource identifier) and metadata (specifying bytes for unit) as well as identification of its metric namespace (oci_computeagent). Each posted set of data points carries this information. Use the ListMetricData API operation to get metric definitions. For metric definitions, see Supported Services.
metric namespace
Indicator of the The cloud objects that your company's employees create and use when interacting with Oracle Cloud Infrastructure., service, or application that emits the metric. Provided in the metric definition. For example, the CpuUtilization metric definition emitted by the OracleCloudAgent software on Compute A bare metal or virtual machine (VM) compute host. The image used to launch the instance determines its operating system and other software. The shape specified during the launch process determines the number of CPUs and memory allocated to the instance. lists the metric namespace oci_computeagent as the source of the CpuUtilization metric. For metric definitions, see Supported Services.
metric stream
An individual set of aggregated data for a metric. A stream can be either specific to a single The cloud objects that your company's employees create and use when interacting with Oracle Cloud Infrastructure. or aggregated across all resources in the compartment. Within a metric chart in the Console, each metric stream is represented as a line. By default, metric streams are resource-specific, so the chart displays a line for each resource. If you choose to aggregate all metric streams, then the chart displays one line for all resources.
notification destination
Protocol and other details for sending messages when the alarm transitions to another state, such as from "OK" to "FIRING." The details and setup may vary by destination service. For the Notifications service, each destination includes a topic and subscription protocol (such as PagerDuty). For more information about messages, topics, and subscriptions, see Notifications Overview.
oraclecloudagent software
Software that allows a Compute instance to post raw data points to the Monitoring service. Automatically installed with the latest versions of supported images. See Enabling Monitoring for Compute Instances.
query
The Monitoring Query Language (MQL) expression to evaluate for returning aggregated data. The query must specify a metric, statistic, and interval. For more information, see Building Metric Queries.
resolution
The period between time windows, or the regularity at which time windows shift. For example, use a resolution of 1m to retrieve aggregations every minute.
Note

For metric queries, the The time window used to convert the given set of raw data points. (Monitoring service.) Example: 5 minutes you select drives the default The period between time windows, or the regularity at which time windows shift. (Monitoring service.) Example: 1 minute of the request, which determines the maximum time range of data returned.

Maximum time range returned for a query

For more information about the resolution parameter as used in metric queries, see SummarizeMetricsData.

For alarm queries, the specified The time window used to convert the given set of raw data points. (Monitoring service.) Example: 5 minutes has no effect on the The period between time windows, or the regularity at which time windows shift. (Monitoring service.) Example: 1 minute of the request. The only valid value of the resolution for an alarm query request is 1m. For more information about the resolution parameter as used in alarm queries, see Alarm.

As shown in the following illustration, resolution controls the start time of each aggregation window relative to the previous window while interval controls the length of the windows. Both requests apply the statistic max to the data within each five-minute window (from the interval), resulting in a single aggregated data point representing the highest CPUutilization counter for that window. Only the resolution value differs. This resolution changes the regularity at which the aggregation windows shift, or the start times of successive aggregation windows. Request A does not specify a resolution and thus uses the default value equal to the interval (5 minutes). This request's five-minute aggregation windows are thus taken from the sets of data points emitted between minute 0 to minute 4:n, minute 5 to minute 9:n, and so forth. Request B specifies a 1-minute resolution, so its five-minute aggregation windows are taken from the set of data points emitted every minute from 0 to 4:n, 1 to 6:n, and so forth.
This image shows how aggregation windows start according to the resolution.
statistic
The aggregation function applied to the given set of raw data points. For supported statistics, see Monitoring Query Language (MQL) Reference.
suppression
A configuration to avoid publishing messages during the specified time range. Useful for suspending alarm notifications during system maintenance. Each suppression applies to a single alarm. In the Console, you can apply one definition of a suppression to multiple alarms. The result is an individual suppression for each alarm. For instructions on suppressing alarms, see To suppress alarms.
trigger rule
The condition that must be met for the alarm to be in the firing state. A trigger rule can be based on a threshold or absence of a metric.

How Monitoring Works

The Monitoring service uses A measurement related to health, capacity, or performance of a given resource. (Monitoring service). Example: CpuUtilization to monitor resources and The trigger rule and query to evaluate and related configuration, such as notification details to use when the trigger is breached. Alarms passively monitor your cloud resources using metrics in Monitoring. to notify you when these metrics meet alarm-specified triggers.

Metrics are emitted to the Monitoring service by The cloud objects that your company's employees create and use when interacting with Oracle Cloud Infrastructure. as raw A timestamp-value pair for the specified metric. (Monitoring service.) Example: 2018-05-10T22:19:00Z, 10.4, or timestamp-value pairs, along with A qualifier provided in a metric definition. (Monitoring service.) Example: Resource identifier (resourceId), provided in the definitions of oci_computeagent metrics. and metadata. For example, the Compute service (metric namespace "oci_computeagent") posts this data for monitoring-enabled Compute instances. The posted data includes all oci_computeagent metrics, such as CpuUtilization. Metric data posted to the Monitoring service is only presented to you or consumed by the Oracle Cloud Infrastructure features that you enable to use metric data.

When you query a metric, the Monitoring service returns aggregated data according to the specified parameters. You can specify a range (such as the last 24 hours), The aggregation function applied to the given set of raw data points. Example: SUM, and The time window used to convert the given set of raw data points. (Monitoring service.) Example: 5 minutes. The Console displays one monitoring chart per metric for selected resources. The aggregated data in each chart reflects your selected statistic and interval. API requests can optionally filter by A qualifier provided in a metric definition. (Monitoring service.) Example: Resource identifier (resourceId), provided in the definitions of oci_computeagent metrics. and specify a The period between time windows, or the regularity at which time windows shift. (Monitoring service.) Example: 1 minute. API responses include the metric name along with its source compartment and Indicator of the resource, service, or application that emits the metric. Provided in the metric definition. (Monitoring service.) Example: oci_computeagent. You can feed the aggregated data into a visualization or graphing library.

Metric and alarm data is accessible via the Console, CLI, and API. For retention periods, see Storage Limits.

The Alarms feature of the Monitoring service publishes alarm An alert published to all subscriptions in the specified topic. Each message is delivered at least once per subscription. (Notifications and Monitoring services.) to configured destinations managed by the Notifications service. Each destination is a A communication channel for sending messages to the subscriptions in the topic. (Notifications service.) with a set of An endpoint for a topic. Published messages are sent to each subscription for a topic. Supported subscription protocols include Email and HTTPS (PagerDuty). (Notifications service.). For more information about the Notifications service, see Notifications Overview.

Message types
Message format and examples

Availability

Monitoring is currently available in the following regions:

Region Name Region Location Region Key
ap-tokyo-1 Asia-Pacific: Tokyo, Japan NRT
ca-toronto-1 Canada: Toronto YYZ
eu-frankfurt-1 Europe: Frankfurt, Germany FRA
uk-london-1 United Kingdom: London LHR
us-ashburn-1 United States: Ashburn, VA IAD
us-phoenix-1 United States: Phoenix, AZ PHX

Supported Services

The following services have resources or components that can emit metrics to Monitoring:

Resource Identifiers

Most types of Oracle Cloud Infrastructure resources have a unique, Oracle-assigned identifier called an Oracle Cloud ID (OCID). For information about the OCID format and other ways to identify your resources, see Resource Identifiers.

Note

Metric resources do not have An Oracle-assigned unique ID called an Oracle Cloud Identifier (OCID). This ID is included as part of the resource's information in both the Console and API..

Ways to Access Monitoring

You can access the Monitoring service using the Console (a browser-based interface) or the REST API. Instructions for the Console and API are included in topics throughout this guide. For a list of available SDKs, see Software Development Kits and Command Line Interface.

Console: To access Monitoring using the Console, you must use a supported browser. You can use the Console link at the top of this page to go to the sign-in page. You will be prompted to enter your cloud tenant, your user name, and your password. Open the navigation menu. Under Solutions, Platform and Edge, go to Monitoring.

API: To access Monitoring through APIs, use Monitoring API for metrics and alarms and Notifications API for notifications (used with alarms).

Authentication and Authorization

Each service in Oracle Cloud Infrastructure integrates with IAM for authentication and authorization, for all interfaces (the Console, SDK or CLI, and REST API).

An administrator in your organization needs to set up A collection of users who all need a particular type of access to a set of resources or compartment., A collection of related resources that can be accessed only by certain groups that have been given permission by an administrator in your organization., and An IAM document that specifies who has what type of access to your resources. It is used in different ways: to mean an individual statement written in the policy language; to mean a collection of statements in a single, named "policy" document (which has an Oracle Cloud ID (OCID) assigned to it); and to mean the overall body of policies your organization uses to control access to resources. that control which users can access which services, which resources, and the type of access. For example, the policies control who can create new users, create and manage the cloud network, launch instances, create buckets, download objects, etc. For more information, see Getting Started with Policies. For specific details about writing policies for each of the different services, see Policy Reference.

If you’re a regular user (not an administrator) who needs to use the Oracle Cloud Infrastructure resources that your company owns, contact your administrator to set up a user ID for you. The administrator can confirm which compartment or compartments you should be using.

Administrators: For common policies that give groups access to metrics, see Let users view metric definitions in a compartment and Restrict user access to a specific metric namespace. For a common alarms policy, see Let users view alarms. To authorize resources, such as instances, to make API calls, add the resources to a dynamic group. Use the dynamic group's matching rules to add the resources, and then create a policy that allows that dynamic group access to metrics. See Let instances make API calls to access monitoring metrics in the tenancy.

Limits on Monitoring

See Monitoring Limits for a list of applicable limits and instructions for requesting a limit increase.

Other limits include the following.

Storage Limits

Item Time range stored
Metric definitions 14 days
Alarm history entries 90 days

Returned Data Limits (Metrics)

When you query metrics and view metric charts, the returned data is subject to certain limits. Limits information for returned data includes the 100,000 data point maximum and time range maximums (determined by resolution, which relates to interval). See MetricData Reference.

Troubleshooting Limits

If you see an error that the query has exceeded the maximum number of An individual set of aggregated data for a metric. Typically specific to a resource. (Monitoring service.), then update the query to evaluate a number of metric streams that is within the limit. For example, you can reduce the metric streams by specifying dimensions. You can continue to evaluate all metric streams that were in the original query by spreading the metric streams across multiple queries (or alarms).