Why Node Eviction? OR Why Split Brain syndrome?
The node eviction/reboot is used for I/O fencing to ensure that writes from I/O capable clients can be cleared avoiding potential corruption scenarios in the event of a network split, node hang, or some other fatal event in clustered environment.
By definition, I/O fencing (cluster industry technique) is the isolation of a malfunctioning node from a cluster's shared storage to protect the integrity of data.
We want to reboot here as fast as possible. Choosing not to flush local disks or kill off processes gracefully helps us shutdown quickly. It is imperative that we do not flush any IO to the shared disks. Else it may write irrelevant information to clusterware components(OCR or Voting disk) or to database files.
Who evicts/reboot the node?
The daemons for Oracle Clusterware (CRS) are started by init when the machine boots. Viz. CRSD, OCSSD,EVMD, OPROCD (when vendor clusterware is absent), OCLSOMON.
There are three fatal processes, i.e. processes whose abnormal halt or kill will provoke a node reboot
1. the ocssd.bin (run as oracle)
2. the oclsomon.bin (monitors OCSSD and run as a root)
3. the oprocd.bin (I/o fencing in non-vendor clusterware env and run as a root)
Other non-CRS Processes Capable of Evicting:
◦OCFS2 (if used)
◦Vendor Clusterware (if used)
◦Operating System (panic)
When Node Eviction?
Read below to understand when OCSSD will trigger the node reboot..
OCSSD's primary job is internode health monitoring (via NM and GM services), other roles not discussed here in-depth.
It is a multi-threaded application. i.e. several jobs or threads runs simultaneously performing specific tasks. The ocssd.log reveals all the thread names clss%, clsc%, etc
Eg. Threads for performing heartbeat (network & disk) and monitoring, send/receive cluster messages, etc.
Evictions occur when CSS detects a heartbeat problem that must be dealt with. For example, lost network communication with another node(s) or lost disk heartbeat information from another node(s). CSS initiated evictions (via poison packet or kill block) should always result in a node reboot.
init spawns init.cssd, which in turn spawns OCSSD as a child. If ocssd dies or is killed or exit, the node kill functionality of the init script will kill the node. Killing init.cssd (respawn creates DUP. cssd) will also result in reboot.
Read below to understand when OCLSOMON will trigger the node reboot..
By now we know working of CSS is very crucial for cluster functioning. This calls for a process that can keep track of its(CSS's) good health. Name: OCLSOMON.
This process monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang of CSS threads.
This process monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang of CSS threads.
Variety of problems such as with OS scheduler, resources, hardware, driver, network, misconfiguration, Oracle code bug, etc may cause a process/thread to hang/crash.
Some routines in OS kernel will cause kernel 'lockup and are non-preemtable, this causes the CPU starvation or a scheduling issues. Typically, on AIX memory over commitment may lead to heavy paging activity resulting in scheduling issues.
Read below to understand when OPROCD will trigger the node reboot..
Unlike CSS which is responsible to
maintain and monitor good health of all nodes in cluster, the OPROCD helps
monitor the health of node locally for any issues with self node where it runs ((when vendor clusterware is absent).
The OPROCD process is locked in memory to monitor each local cluster node where it executes, to detect scheduling latencies caused by hardware and driver freezes on a machine, and provide I/O fencing functionality. (only in 10g and 11gR1)
To provide this functionality OPROCD performs its check, stops running (sleeps/timeout –t 1000ms ), and if the wake up is beyond (margin –m 500ms) the expected time, OPROCD reboots the local node. (alarm clock snooze, exam nightmare)
default values for OPROCD can be overly sensitive to scheduling latencies and may cause FALSE reboot. Also, more so in pre-11.2 releases because its code does not function in tandem with CSS eviction code (e.g. NM polling threads).
A FALSE reboot is when a reboot takes place when no formal CSS eviction was in progress.CSS expiring misscount/disktimeout and rebooting the node is not considered a 'false reboot'
Due to the fast nature of the reboot, the CRS logging messages might not actually get flushed to the disk. However, with newer CRS releases and on some platforms (except AIX), we now perform kernel crash dump /panic /TOC on reboot for OS support to investigate what the system looked like when we crashed the node.
IMPORTANT: OCLSOMON and OPROCD does not exists in 11gR2. CSSD Monito(ora.cssdmonitor) will take over the functionality of oclsomon and oprocd.
- Also CPU Starvation or Memory starvation caused by non clusterware services may lead the node eviction. Somtime Hardware freeze will also cause the node eviction.
Troubleshooting - Who Evicted Me?
- Check the OS log messages file for clues… What was the reboot command? Were there any ocfs2, vendor clusterware, or panic related messages?
Below are 3 specimens which we can look in /var/log/messages before getting on CRS logs.
Example
1:System Restart due to Node Eviction
Log files may look like this.
Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed.
Feb 18 17:20:44 db01 kernel: type=1701
audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295
pid=8368 comm="ocssd.bin" sig=6
Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
Feb 18 17:24:26 db01 kernel: klogd 1.4.1, log source =
/proc/kmsg started.
Feb 18 17:24:26 db01 kernel: Linux version
2.6.18-238.12.2.0.2.el5 (xyz@ca-build9.us.com)
(gcc version 4.1.2 20080704 (Red Hat 4.1.2-50))
Explanation:
In above logs important point to notice is "kernel: SysRq : Resetting".
Whenever you see SysRq command issued. This shows this node is evicted by some
other node from cluster.
So you should investigate in Node eviction direction by
collecting diagcollection from evicted and other nodes.
Example
2:Node Reboot/System Restart due to Memory Pressure
Log files may look like this.
Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free
swap: 97972kB
Feb 12 07:32:42 db02 kernel: 25690112 pages of RAM
Feb 12 07:32:42 db02 kernel: 978057 reserved pages
Feb 12 07:32:42 db02 kernel: 175190262 pages shared
Feb 12 07:32:42 db02 kernel: 45130 pages swap cached
Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176
from=::ffff:10.77.9.254
Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
Feb 12 07:57:57 db02 kernel: klogd 1.4.1, log source =
/proc/kmsg started.
Explanation:
In above case, No SysRq is reported. This shows a case of self-suicide, So don't
look at cluster side to investigate. Though, there could be many possible
reason for system self-suicide but most command are high load cause memory
pressure on the system and system got reboot. To investigate high load start looking at OS Watcher Top
command output at system restart time or prior to that.
Example
3:Node Reboot/System Restart due to Linux kernel Bug:
Another, Reason for self-suicide could be some Linux kernel
bug which cause system panic and system got reboot.
Example 4: Oracle CSSD failure. Rebooting for cluster integrity" or similar message in OS log message file…CSS probably did it!
- Check the CSSD log for clues. Eviction messages + missed checkins or disktimeout in CSS log *prior* to the node rebooting…CSS probably did it!
Private N/W heartbeat timeout
values
Disk heartbeat timeout values
The CSS misscount for
disk heartbeat also varies on the versions of Oracle clustereware. For
oracle 10.2.1 and up, the default value is 200 seconds.
Commands
to get the value of n/w HB /Disk HB timeout values:
crsctl get css misscount
crsctl get css disktimeout
Example 1:
Cause :
Network
failure (e.g. packet errors/retransmits) or Network
latency (e.g. packet snd/rcv
high ms with retransmits) between nodes, heartbeats are issued once per sec.
Rediscovery :
The syslog or /var/log/messages populated with following
messages..
<timestamp> <nodename>
logger: Oracle clsomon failed with fatal status 12.
<timestamp> <nodename> logger:
Oracle CSSD failure 134.
<timestamp> <nodename> logger:
Oracle CRS failure. Rebooting for cluster integrity.
The missed checkins
in the CSS log, occurs when local node CSS receiver thread not receiving
incoming packets from the CSS sending thread from the remote node.
WARNING: clssnmPollingThread:
node <node name>
(n) at 50% heartbeat fatal, eviction in 29.100 seconds
WARNING: clssnmPollingThread:
node <node name>
(n) at 75% heartbeat fatal, eviction in 14.960 seconds
WARNING: clssnmPollingThread:
node <node name>
(n) at 75% heartbeat fatal, eviction in 13.950 seconds
Note
•If
the OS messages file reboot timestamp < missed checkin timestamp then
the node eviction was likely not due to these missed checkins.
•If
the OS messages file reboot timestamp > missed checkin timestamp then
the node eviction was likely a result of the missed checkins.
Example 2:
Cause
:
Majority of the voting disks are unavailable or I/O latency, heartbeats
are issued once per sec.
Rediscovery :
The voting disk device access hanging or unavailable message in the CSS
log.
ERROR: clssnmDiskPingMonitorThread:
voting device access hanging (200008 miliseconds)
-and/or-
ERROR: clssnmDiskPMT:
Aborting, 2 of 3 voting disks unavailable
Note
•If
strictly more than half of the voting disks are up and contain consistent
information, the CSS can use that consistent data.
•This
self-termination is to prevent forming of disjoint sub-clusters cohort
(split-brain)
- Check the OCLSOMON log for clues. This file exists in <CRS_HOME>/log/<hostname>/cssd/oclsomon dir for errors. If there are errors, oclsomon probably did it!
$CRS_HOME/log/<host>/cssd/oclsomon/oclsmon.log
•A
problem with the executables (for Ex. removing CRS Home files, patch
issues)
A problem with CRS_HOME filesystem mount point (free space/slow/hanging)
A problem with CRS_HOME filesystem mount point (free space/slow/hanging)
•Misconfiguration
of CRS. Possible misconfigurations:
> Wrong network selected as the private network for CRS
> Wrong network selected as the private network for CRS
> Putting the CSS vote file on a Netapp
that's shared over some kind of public network or otherwise excessively
loaded/unreliable network
•An
unexpected failure of the OCSSD process, no stack, no core. Use 605449.1
- Check the OPROD log for clues. Error in the oprocd log prior to the node rebooting…oprocd probably did it!
Example 1:
Cause :
Cause :
OS scheduler problem, OS is getting locked up in a driver or hardware,
Excessive amounts of load, system time clock
backward change.
/etc/oracle/oprocd/<hostname>.
oprocd.log
/etc/oracle/oprocd/<hostname>.
oprocd.lgl
Rediscovery :
OPROCD
will write an error message out to the file <hostname>.oprocd.log
May 10 18:01:40.668 | INF |
monitoring started with timeout(1000), margin(500)
May 10 18:23:02.490 | ERR | AlarmHandler: timeout(1739751316) exceeds
interval(1000000000)+margin(500000000)
With
Newer releases same log shows historical trends of clock skewing
May 13 16:03:32.061 | INF | TrackHistoricalTrends:
added first sample 1500000000 in 90 to 100 percentile
May 13 16:03:32.187 | INF | AlarmHandler:
clock skew detected, reseting
interval timer
May 13 16:03:32.187 | INF | skew: ctime
4413579130000000, ltime
4413578800000000, cclock
1210716212187524000, lclock
1210716212061590000, cdiff
330000000, tdiff
125934000, diff 20406600
Last Gasp file <hostname>.oprocd.lgl
shows
Feb 03 21:31:50.030602 | LASTGASP | AlarmHandler:
timeout(1570 msec)
exceeds
interval(1000 msec)+margin(500
msec).
Rebooting NOW.
- Check the OSWATCHER logs for clues. CPU or Memory starvation just before the eviction indicates resource issue in the server.
Read below link for interpreting OSW
http://www.dbas-oracle.com/2013/05/How-to-Analyze-or-Read-OS-Watcher-Output-in-three-easy-steps-With-Example.html
Refer flowchart diagram in MOS document: Note 265769.1 Troubleshooting CRS Reboots
CLUSTERWARE Logs and It's location:
CLUSTERWARE Logs and It's location:
Pioneer Park - Address: 3501 35 Ave Hours: 24 hrs, fenced: yes, parking: yes Comments: open field, fenced , walking paths, hilly
ReplyDeleteMetal Fence
Also CPU Starvation or Memory starvation caused by non clusterware services may lead the node eviction. Somtime Hardware freeze will also cause the node eviction. Wood Privacy
ReplyDeleteThanks vishwanath explained very well ..
ReplyDeleteThank you very much for your valuable information Vishwanath
ReplyDelete