Troubleshooting Exadata Database Service on Cloud@Customer Systems

Patching Failures on Exadata Database Service on Cloud@Customer Systems

Patching operations can fail for various reasons. Typically, an operation fails because a database node is down, there is insufficient space on the file system, or the virtual machine cannot access the object store.

Determining the Problem
In the Console, you can identify a failed patching operation by viewing the patch history of an Exadata Database Service on Cloud@Customer system or an individual database.
Troubleshooting and Diagnosis
Diagnose the most common issues that can occur during the patching process of any of the Exadata Database Service on Cloud@Customer components.

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Determining the Problem

In the Console, you can identify a failed patching operation by viewing the patch history of an Exadata Database Service on Cloud@Customer system or an individual database.

A patch that was not successfully applied displays a status of Failed and includes a brief description of the error that caused the failure. If the error message does not contain enough information to point you to a solution, you can use the database CLI and log files to gather more data. Then, refer to the applicable section in this topic for a solution.

Parent topic: Patching Failures on Exadata Database Service on Cloud@Customer Systems

Troubleshooting and Diagnosis

Diagnose the most common issues that can occur during the patching process of any of the Exadata Database Service on Cloud@Customer components.

Database Server VM Issues
One or more of the following conditions on the database server VM can cause patching operations to fail.
Oracle Grid Infrastructure Issues
One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.
Oracle Databases Issues
An improper database state can lead to patching failures.

Parent topic: Patching Failures on Exadata Database Service on Cloud@Customer Systems

Database Server VM Issues

One or more of the following conditions on the database server VM can cause patching operations to fail.

Database Server VM Connectivity Problems

Cloud tooling relies on the proper networking and connectivity configuration between virtual machines of a given VM cluster. If the configuration is not set properly, this may incur in failures on all the operations that require cross-node processing. One example can be not being able to download the required files to apply a given patch.

Given the case, you can perform the following actions:

Verify that your DNS configuration is correct so that the relevant virtual machine addresses are resolvable within the VM cluster.
Refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.

Parent topic: Troubleshooting and Diagnosis

Oracle Grid Infrastructure Issues

One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.

Oracle Grid Infrastructure is Down

Oracle Clusterware enables servers to communicate with each other so that they can function as a collective unit. The cluster software program must be up and running on the VM Cluster for patching operations to complete. Occasionally you might need to restart the Oracle Clusterware to resolve a patching failure.

In such cases, verify the status of the Oracle Grid Infrastructure as follows:

./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

If Oracle Grid Infrastructure is down, then restart by running the following commands:

crsctl start cluster -all

crsctl check cluster

Parent topic: Troubleshooting and Diagnosis

Oracle Databases Issues

An improper database state can lead to patching failures.

Oracle Database is Down

The database must be active and running on all the active nodes so the patching operations can be completed successfully across the cluster.

Use the following command to check the state of your database, and ensure that any problems that might have put the database in an improper state are resolved:

srvctl status database -d db_unique_name -verbose

The system returns a message including the database instance status. The instance status must be Open for the patching operation to succeed.

If the database is not running, use the following command to start it:

srvctl start database -d db_unique_name -o open

Parent topic: Troubleshooting and Diagnosis

Obtaining Further Assistance

If you were unable to resolve the problem using the information in this topic, follow the procedures below to collect relevant database and diagnostic information. After you have collected this information, contact Oracle Support.

Collecting Cloud Tooling Logs
Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.
Collecting Oracle Diagnostics

Related Topics

Oracle Support

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Collecting Cloud Tooling Logs

Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.

DBAASCLI Logs

/var/opt/oracle/log/dbaascli

dbaascli.log

Parent topic: Obtaining Further Assistance

Collecting Oracle Diagnostics

To collect the relevant Oracle diagnostic information and logs, run the dbaascli diag collect command.

For more information about the usage of this utility, see DBAAS Tooling: Using dbaascli to Collect Cloud Tooling Logs and Perform a Cloud Tooling Health Check.

Related Topics

DBAAS Tooling: Using dbaascli to Collect Cloud Tooling Logs and Perform a Cloud Tooling Health Check

Parent topic: Obtaining Further Assistance

VM Operating System Update Hangs During Database Connection Drain

Description: This is an intermittent issue. During virtual machine operating system update with 19c Grid Infrastructure and running databases, dbnodeupdate.sh waits for RHPhelper to drain the connections, which will not progress because of a known bug "DBNODEUPDATE.SH HANGS IN RHPHELPER TO DRAIN SESSIONS AND SHUTDOWN INSTANCE".

Symptoms: There are two possible outcomes due to this bug:

VM operating system update hangs in rhphelper
- Hangs the automation
- Some or none of the database connections will have drained, and some or all of the database instances will remain running.
VM operating system update does not drain database connections because rhphelper crashed
- Does not hang automation
- Some or none of the database connection draining completes

/var/log/cellos/dbnodeupdate.trc will show this as the last line:

(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
(trace:/u01/app/grid/crsdata/scaqak04dv0201/rhp//executeRHPDrain.150721125206.trc)

Action:

Upgrade Grid Infrastructure version to 19.11 or above.
(OR)

Disable rhphelper before updating and enable it back after updating.
To disable before updating is started:
```
/u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid 19.10.0.0.0 -setDrainAttributes ENABLE=false
```
To enable after updating is completed:
```
/u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid oracle-home-current-version -setDrainAttributes ENABLE=true
```
If you disable rhphelper, then there will be no database connection draining before database services and instances are shutdown on a node before the operating system is updated.
If you missed disabling RHPhelper and upgrade is not progressing and hung, then it is observed that the draining of services is taking time:
1. Inspect the /var/log/cellos/dbnodeupdate.trc trace file, which contains a paragraph similar to the following:
```
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. 
(trace: /u01/app/grid/crsdata/<nodename>/rhp//executeRHPDrain.150721125206.trc)
```
2. Open the /var/log/cellos/dbnodeupdate.trc trace file.
  If rhphelper fails, then the trace file contains the message as follows:
```
"Failed execution of RHPhelper"
```
  If rhphelper hangs, then the trace file contains the message as follows:
```
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
```
3. Identify the rhphelper processes running at the operating system level and kill them.
  There are two commands that will have the string “rhphelper” in the name – a Bash shell, and the underlying Java program, which is really rhphelper executing.
  
  rhphelper runs as root, so must be killed as root (sudo from opc).
  For example:
```
[opc@<HOST> ~] pgrep –lf rhphelper
191032 rhphelper
191038 java
```
```
[opc@<HOST> ~] sudo kill –KILL 191032 191038
```
4. Verify that the dbnodeupdate.trc file moves and the Grid Infrastructure stack on the node is shutdown.
For more information about RHPhelper, see Using RHPhelper to Minimize Downtime During Planned Maintenance on Exadata (Doc ID 2385790.1).

Related Topics

Using RHPhelper to Minimize Downtime During Planned Maintenance on Exadata (Doc ID 2385790.1)

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Adding a VM to a VM Cluster Fails

Description: When adding a VM to a VM cluster, you might encounter the following issue:

[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home.
CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
ACTION: Ensure the above files are readable by grid.

Cause: Installer has detected a non-readable trace file, oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster VM to fail.

AHF ran as root created a trc file with root ownership, which the grid user is not able to read.

Action: Ensure that the AHF trace files are readable by the grid user before you add VMs to a VM cluster. To fix the permission issue, run the following commands as root on all the existing VM cluster VMs:

chown grid:oinstall /u01/app/19.0.0.0/grid/srvm/admin/logging.properties

chown -R grid:oinstall /u01/app/19.0.0.0/grid/oracle.ahf*

chown -R grid:oinstall /u01/app/grid/oracle.ahf*

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Nodelist is not Updated for Data Guard-Enabled Databases

Description: Adding a VM to a VM cluster completes successfully, however, for Data Guard-enabled databases, the new VM is not added to the nodelist in the /var/opt/oracle/creg/<db>.ini file.

Cause: Data Guard-enabled databases will not be extended to the newly added VM. And therefore, the <db>.ini file will also not be updated because the database instance is not configured in the new VM.

Action: To add an instance to primary and standby databases and to the new VMs (Non-Data Guard), and to remove an instance from a Data Guard environment, see My Oracle Support note 2811352.1.

Related Topics

https://support.oracle.com/epmos/faces/DocContentDisplay?id=2811352.1

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

CPU Offline Scaling Fails

Description: CPU offline scaling fails with the following error:

** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information

Cause: After provisioning a VM cluster, the /var/opt/oracle/cprops/cprops.ini file, which is automatically generated by the database as a service (DBaaS) is not updated with the common_dcs_agent_bindHost and common_dcs_agent_port parameters and this causes CPU offline scaling to fail.

Action: As the root user, manually add the following entries in the /var/opt/oracle/cprops/cprops.ini file.

common_dcs_agent_bindHost=<IP_Address>
common_dcs_agent_port=7070

Note

The common_dcs_agent_port value is 7070 always.

Run the following command to get the IP address:

netstat -tunlp | grep 7070

For example:

netstat -tunlp | grep 7070
tcp 0 0 <IP address 1>:7070 0.0.0.0:* LISTEN 42092/java
tcp 0 0 <IP address 2>:7070 0.0.0.0:* LISTEN 42092/java

You can specify either of the two IP addresses, <IP address 1> or <IP address 2> for the common_dcs_agent_bindHost parameter.

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Standby Database Fails to Restart After Switchover in Oracle Database 11g Oracle Data Guard Setup

Description: After performing the switchover, the new standby (old primary) database remains shut down and fails to restart.

Action: After performing switchover, do the following:

Restart the standby database using the srvctl start database -db <standby dbname> command.
Reload the listener as grid user on all primary and standby cluster nodes.
- To reload the listener using high availability, download and apply patch 25075940 to the Grid home, and then run lsnrctl reload -with_ha.
- To reload the listener, run lsrnctl reload.

After reloading the listener, verify that the <dbname>_DGMGRL services are loaded into the listener using the lsnrctl status command.

To download patch 25075940

Log in to My Oracle Support.
Click Patches & Updates.
Select Bug Number from the Number/Name or Bug Number (Simple) drop-down list.
Enter the bug number 34741066, and then click Search.
From the search results, click the name of the latest patch.
You will be redirected to the Patch 34741066: LSNRCTL RELOAD -WITH_HA FAILED TO READ THE STATIC ENTRY IN LISTENER.ORA page.
Click Download.

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Using Custom SCAN Listener Port With Data Guard On Disaster Recovery Network Causes Data Guard Association Verification Failures

Description: If the SCAN listener port for the client network and disaster recovery network (DR network) are different, then Data Guard (DG) configuration fails during verification phase of create data guard association.

Action: Use the same SCAN listener ports (or default port) on all networks. To fix the listener port after the cluster has been configured, run the GI home/bin/srvctl modify listener -listener listener_name -endpoints endpoints command. For more information, see srvctl modify listener in the Oracle Real Application Clusters Administration and Deployment Guide.

Parent topic: Troubleshooting Exadata Database Service on Cloud@Customer Systems

Oracle Cloud Infrastructure Documentation

Patching Failures on Exadata Database Service on Cloud@Customer Systems

Determining the Problem

Troubleshooting and Diagnosis

Database Server VM Issues

Oracle Grid Infrastructure Issues

Oracle Databases Issues

Obtaining Further Assistance

Collecting Cloud Tooling Logs

Collecting Oracle Diagnostics

VM Operating System Update Hangs During Database Connection Drain

Adding a VM to a VM Cluster Fails

Nodelist is not Updated for Data Guard-Enabled Databases

CPU Offline Scaling Fails

Standby Database Fails to Restart After Switchover in Oracle Database 11g Oracle Data Guard Setup

Using Custom SCAN Listener Port With Data Guard On Disaster Recovery Network Causes Data Guard Association Verification Failures