Wednesday, September 7, 2016

RAC Node Eviction - Why? Who? When? How to diagnose?


Why Node Eviction?  OR  Why Split Brain syndrome? 

The node eviction/reboot is used for I/O fencing to ensure that writes from I/O capable clients can be cleared avoiding potential corruption scenarios in the event of a network split, node hang, or some other fatal event in clustered environment.

By definition, I/O fencing (cluster industry technique) is the isolation of a malfunctioning node from a cluster's shared storage to protect the integrity of data.

We want to reboot here as fast as possible. Choosing not to flush local disks or kill off processes gracefully helps us shutdown quickly. It is imperative that we do not flush any IO to the shared disks. Else it may write irrelevant information to clusterware components(OCR or Voting disk) or to database files. 

Who evicts/reboot the node?

The daemons for Oracle Clusterware (CRS) are started by init when the machine boots. Viz. CRSD, OCSSD,EVMD, OPROCD (when vendor clusterware is absent), OCLSOMON.

There are three fatal processes, i.e. processes whose abnormal halt or kill will provoke a node reboot
1. the ocssd.bin  (run as oracle)
2. the oclsomon.bin (monitors OCSSD and run as a root)
3. the oprocd.bin     (I/o fencing in non-vendor clusterware env and run as a root)

—Other non-CRS Processes Capable of Evicting:

◦OCFS2 (if used)
◦Vendor Clusterware (if used)
◦Operating System (panic)

When Node Eviction?

Read below to understand when OCSSD will trigger the node reboot.. 

OCSSD's primary job is internode health monitoring (via NM and GM services), other roles not discussed here in-depth.

—It is a multi-threaded application. i.e. several jobs or threads runs simultaneously performing specific tasks. The ocssd.log reveals all the thread names clss%, clsc%, etc

Eg. Threads for performing heartbeat (network & disk) and monitoring, send/receive cluster messages, etc.

—Evictions occur when CSS detects a heartbeat problem that must be dealt with.  For example, lost network communication with another node(s) or lost disk heartbeat information from another node(s). CSS initiated evictions (via poison packet or kill block) should always result in a node reboot.

—init spawns init.cssd, which in turn spawns OCSSD as a child. If ocssd dies or is killed or exit, the node kill functionality of the init script will kill the node. Killing init.cssd (respawn creates DUP. cssd) will also result in reboot.

Read below to understand when OCLSOMON will trigger the node reboot..  

By now we know working of CSS is very crucial for cluster functioning. This calls for a process that can keep track of its(CSS's) good health. Name: OCLSOMON. 

This process monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang of CSS threads.
Variety of problems such as with OS scheduler, resources, hardware, driver, network, misconfiguration, Oracle code bug, etc may cause a process/thread to hang/crash.

Some routines in OS kernel will cause kernel 'lockup and are non-preemtable, this causes the CPU starvation or a scheduling issues. Typically, on AIX memory over commitment may lead to heavy paging activity resulting in scheduling issues.


Read below to understand when OPROCD will trigger the node reboot..  

Unlike CSS which is responsible to maintain and monitor good health of all nodes in cluster, the OPROCD helps monitor the health of node locally for any issues with self node where it runs ((when vendor clusterware is absent).

The OPROCD process is locked in memory to monitor each local cluster node where it executes,  to detect scheduling latencies caused by hardware and driver freezes on a machine, and provide I/O fencing functionality. (only in 10g and 11gR1)

To provide this functionality OPROCD performs its check, stops running (sleeps/timeout –t 1000ms ), and if the wake up is beyond (margin –m 500ms) the expected time, OPROCD reboots the local node.   (alarm clock snooze, exam nightmare)

default values for OPROCD can be overly sensitive to scheduling latencies and may cause FALSE reboot. Also, more so in pre-11.2 releases because its code does not function in tandem with CSS eviction code (e.g. NM polling threads).

A FALSE reboot is when a reboot takes place when no formal CSS eviction was in progress.CSS expiring misscount/disktimeout and rebooting the node is not considered a 'false reboot' 

Due to the fast nature of the reboot, the CRS logging messages might not actually get flushed to the disk.  However, with newer CRS releases and on some platforms (except AIX), we now perform kernel crash dump /panic /TOC on reboot for OS support to investigate what the system looked like when we crashed the node.

IMPORTANT: OCLSOMON and OPROCD does not exists in 11gR2. CSSD Monito(ora.cssdmonitor) will take over the functionality of oclsomon and oprocd. 

- Also CPU Starvation or Memory starvation caused by non clusterware services may lead the node eviction. Somtime Hardware freeze will also cause the node eviction. 

Troubleshooting - Who Evicted Me?

- Check the OS log messages file for clues  What was the reboot command?  Were there any ocfs2, vendor clusterware, or panic related messages? 

Below are 3 specimens which we can look in /var/log/messages before getting on CRS logs.

Example 1:System Restart due to Node Eviction
Log files may look like this.
Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed.
Feb 18 17:20:44 db01 kernel: type=1701 audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295 pid=8368 comm="ocssd.bin" sig=6
Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
Feb 18 17:24:26 db01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb 18 17:24:26 db01 kernel: Linux version 2.6.18-238.12.2.0.2.el5 (xyz@ca-build9.us.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50))

Explanation:
In above logs important point to notice is "kernel: SysRq : Resetting". Whenever you see SysRq command issued. This shows this node is evicted by some other node from cluster.
So you should investigate in Node eviction direction by collecting diagcollection from evicted and other nodes. 

Example 2:Node Reboot/System Restart due to Memory Pressure
Log files may look like this.
Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free swap:        97972kB
Feb 12 07:32:42 db02 kernel: 25690112 pages of RAM
Feb 12 07:32:42 db02 kernel: 978057 reserved pages
Feb 12 07:32:42 db02 kernel: 175190262 pages shared
Feb 12 07:32:42 db02 kernel: 45130 pages swap cached
Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176 from=::ffff:10.77.9.254
Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
Feb 12 07:57:57 db02 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Explanation:
In above case, No SysRq is reported. This shows a case of self-suicide, So don't look at cluster side to investigate. Though, there could be many possible reason for system self-suicide but most command are high load cause memory pressure on the system and system got reboot.  To investigate high load start looking at OS Watcher Top command output at system restart time or prior to that.
  
Example 3:Node Reboot/System Restart due to Linux kernel Bug:
Another, Reason for self-suicide could be some Linux kernel bug which cause system panic and system got reboot.

Example 4: Oracle CSSD failure. Rebooting for cluster integrity" or similar message in OS log message file…CSS probably did it!

-  Check the CSSD log for clues. Eviction messages + missed checkins or disktimeout in CSS log *prior* to the node rebooting…CSS probably did it!


Private N/W heartbeat timeout values













Disk heartbeat timeout values

The CSS misscount for disk heartbeat also varies on the versions of Oracle clustereware. For oracle 10.2.1 and up, the default value is 200 seconds.

Commands to get the value of n/w HB /Disk HB timeout values:
crsctl get css misscount
crsctl get css disktimeout

Example 1:
Cause :
   Network failure (e.g. packet errors/retransmits) or Network latency (e.g. packet snd/rcv high ms with retransmits) between nodes, heartbeats are issued once per sec.
Rediscovery :
    The syslog or /var/log/messages populated with following messages..
       <timestamp> <nodename> logger: Oracle clsomon failed with fatal status 12.
       <timestamp> <nodename> logger: Oracle CSSD failure 134.
       <timestamp> <nodename> logger: Oracle CRS failure. Rebooting for cluster integrity.
   The missed checkins in the CSS log, occurs when local node CSS receiver thread not receiving incoming packets from the CSS sending thread from the remote node.
     WARNING: clssnmPollingThread: node <node name> (n) at 50% heartbeat fatal, eviction in 29.100 seconds
     WARNING: clssnmPollingThread: node <node name> (n) at 75% heartbeat fatal, eviction in 14.960 seconds
     WARNING: clssnmPollingThread: node <node name> (n) at 75% heartbeat fatal, eviction in 13.950 seconds
Note
If the OS messages file reboot timestamp < missed checkin timestamp then the node eviction was likely not due to these missed checkins.
If the OS messages file reboot timestamp > missed checkin timestamp then the node eviction was likely a result of the missed checkins.

Example 2:
Cause :
   Majority of the voting disks are unavailable or I/O latency, heartbeats are issued once per sec.
Rediscovery :
   The voting disk device access hanging or unavailable message in the CSS log.
    ERROR: clssnmDiskPingMonitorThread: voting device access hanging (200008 miliseconds)
    -and/or-
    ERROR: clssnmDiskPMT: Aborting, 2 of 3 voting disks unavailable
Note
If strictly more than half of the voting disks are up and contain consistent information, the CSS can use that consistent data.
This self-termination is to prevent forming of disjoint sub-clusters cohort (split-brain)


Check the OCLSOMON log for clues. This file exists in <CRS_HOME>/log/<hostname>/cssd/oclsomon dir for errors. If there are errors, oclsomon probably did it!

   $CRS_HOME/log/<host>/cssd/oclsomon/oclsmon.log
A problem with the executables (for Ex. removing CRS Home files, patch issues)
A problem with CRS_HOME filesystem mount point (free space/slow/hanging)
Misconfiguration of CRS. Possible misconfigurations:
> Wrong network selected as the private network for CRS
    > Putting the CSS vote file on a Netapp that's shared over some kind of public  network or otherwise excessively loaded/unreliable network

An unexpected failure of the OCSSD process, no stack, no core. Use 605449.1 

- Check the OPROD log for clues. Error in the oprocd log prior to the node rebooting…oprocd probably did it!

Example 1:
Cause :
   OS scheduler problem, OS is getting locked up in a driver or hardware, Excessive amounts of load, system time clock backward change.
   /etc/oracle/oprocd/<hostname>. oprocd.log
   /etc/oracle/oprocd/<hostname>. oprocd.lgl
Rediscovery :
   OPROCD will write an error message out to the file <hostname>.oprocd.log
    May 10 18:01:40.668 | INF | monitoring started with timeout(1000), margin(500)
    May 10 18:23:02.490 | ERR | AlarmHandler:  timeout(1739751316) exceeds interval(1000000000)+margin(500000000)
    With Newer releases same log shows historical trends of clock skewing
    May 13 16:03:32.061 | INF | TrackHistoricalTrends: added first sample 1500000000 in 90 to 100 percentile
    May 13 16:03:32.187 | INF | AlarmHandler: clock skew detected, reseting interval timer
    May 13 16:03:32.187 | INF | skew: ctime 4413579130000000, ltime 4413578800000000, cclock 1210716212187524000, lclock 1210716212061590000, cdiff 330000000, tdiff 125934000, diff 20406600
    Last Gasp file <hostname>.oprocd.lgl  shows
     Feb 03 21:31:50.030602 | LASTGASP | AlarmHandler: timeout(1570 msec) exceeds
interval(1000 msec)+margin(500 msec). Rebooting NOW


- Check the OSWATCHER logs for clues. CPU or Memory starvation just before the eviction indicates resource issue in the server. 

Read below link for interpreting OSW

http://www.dbas-oracle.com/2013/05/How-to-Analyze-or-Read-OS-Watcher-Output-in-three-easy-steps-With-Example.html

Refer flowchart diagram in MOS document: Note 265769.1 Troubleshooting CRS Reboots 

CLUSTERWARE Logs and It's location:









4 comments:

  1. Pioneer Park - Address: 3501 35 Ave
Hours: 24 hrs, fenced: yes, parking: yes
Comments: open field, fenced , walking paths, hilly
    Metal Fence

    ReplyDelete
  2. Also CPU Starvation or Memory starvation caused by non clusterware services may lead the node eviction. Somtime Hardware freeze will also cause the node eviction. Wood Privacy

    ReplyDelete
  3. Thanks vishwanath explained very well ..

    ReplyDelete
  4. Thank you very much for your valuable information Vishwanath

    ReplyDelete