RAC Cluster Failure Detection and Recovery Process ~ Oracle DBA Secrets

RAC relies on the Cluster Services for failure detection. The Cluster Services are a distributed kernel component that monitors whether Cluster Members can communicate with each other and through this process enforces the rule of Cluster Membership. This is taken care of by Cluster Synchronization Service (CSS) with the CSSD process. The functions performed by CSS can be listed below.

1. Form a cluster, and add/remove members from a cluster.
1. Tracks in which members in a cluster are active.
2. Maintains a Cluster Membership list, which is consistent on all member nodes.
3. Provides timely notification of Membership changes.

When a node polls another Node (Target) in the Cluster, and the target has not responded successfully after repeated attempts, a timeout occurs after approximately 60 seconds. Among the responding nodes, the node that was started first and that is alive declares that the other node is not responding and has failed. This node becomes the new MASTER and starts evicting the non-responding node from the cluster. Once the eviction is complete, cluster reformation begins. The reorganization process regroups accessible nodes and removes the failed ones.

LMON is a background process that monitors the entire cluster to manage the global resource. By constantly probing the other instances, it checks and manages instance death and associated recovery for Global Cache Service (GCS). When a node joins or leaves the cluster, it handles the reconfiguration of locks and associated resources. LMON handles the part of recovery associated with global resources. Failover of service is also triggered by the EVMD process by firing a down event.

Once the reconfiguration of the nodes is complete, Oracle in, coordination with the EVMD and CRSD, performs several tasks.

1. Database/Instance recovery.

2. Failover of VIP system service.

3. Failover of the user/database services to another instance.

Database/Instance Recovery

After a node in the cluster fails, it goes through several steps of recovery to complete changes at both the instance (cache) level and database level:

1. During the first phase of recovery, Global Enqueue Services (GES) remasters the enqueues, and Global Cache Services (GCS) remasters its resources from the failed instance among the surviving instances.

2. The first step in the GCS remastering process is for Oracle to assign a new incarnation number.

3. Oracle determines how many more nodes are remaining in the cluster. (Nodes are identified by a numeric starting with zero and incremented by one for every additional node in the cluster).

4. In An Attempt To Recreate The Resource Master Of The Failed Instance, All GCS Resource Requests And Write Requests Are Temporarily Suspended (Grd Is Frozen).

5. All the dead shadow processes related to the GCS are cleaned from the failed instance.

6. After enqueues are reconfigured, one of the surviving instances can grab the instance recovery enqueue.

7. At the same time as GCS resources are remastered, SMON determines the set of blocks that need recovery. This set is called the Recovery set. With Cache Fusion an instance ships the contents of its block to the requesting instance without writing that dirty block to the disk (i.e. the on-disk version of the blocks may not contain the changes that are made by either instance). Because of this behavior, SMON needs to merge the content of all the online redo logs of each failed instance to determine the recovery set and the order of recovery.

8. At this stage, buffer space for recovery is allocated, and the resources that were identified in the previous reading of the redo logs are claimed as recovery resources. this is done to prevent other instances from accessing those resources.

9. A new master node for the cluster is created (A New Master Node Is Only Assigned If The Failed Node Was The Previous Master Node In The Cluster). All GCS shadow processes are now traversed from a frozen state, and this completes the reconfiguration process.

10. During the remastering of GCS from the failed instance (during cache recovery), Most Work On The Instance Performing Recovery Is Paused, And While Transaction Recovery Takes Place, Works Occur At A Slower Pace. Subsequently, Oracle starts the database recovery process and begins the cache recovery process (i.e., rolling forward committed transactions). This is made possible by reading the redo log files of the failed instance. Because of the shared storage subsystem, redo log files of all instances participating in the cluster are visible to other instances. This makes any one instance that detected the failure read the redo log files of the failed instance and start the recovery process.

11. After completion of the cache recovery, Oracle starts the transaction recovery operation i.e. roll forward the committed transaction and rollback the uncommitted transactions.

Please feel free to ask. thank you 🙂
Toufique Khan

Oracle DBA Secrets

Friday, August 30, 2024

RAC Cluster Failure Detection and Recovery Process

Database/Instance Recovery

No comments:

Post a Comment

Oracle Recognition

Recent Posts

Create ACFS on Oracle Exadata for Database Migration (Step-by-Step DBA Guide)

About Me

Labels

Popular Posts

Blog Archive

Oracle DBA Resources