Monday, October 15, 2018

CSSD start failed on one node of two-nodes RAC database cluster after reboot

Symptom:

One node has CSSD start failed after reboot
The other node is OK.

Log file show disk HB is godd and network HB failed

Checking :

olsnodes -n -i -s -t # check nodes

node1 1 active unpinned
node2 2 inactive unpinned ## unpinned is not an issue.

According Metalink doc Top 5 Grid Infrastructure Startup Issues (Doc ID 1368382.1)
#3

check
ping -s 8900 hostname/IP # check jumbo frame setting
ping -s 1500 hostname/IP # check if default value MTU 1500 is good

root cause:

MTU set as 9000 on unix OS level, but network is not enabled jumbo frame on both private and public network interfaces.

Fix:

network team need enable jumbo frame as MTU 9000 and make sure "ping -s 9000 IP" work.
restart cluster/reboot server.

reference:

Troubleshooting Clusterware startup problems with detailed debugging info

https://www.hhutzler.de/blog/troubleshooting-clusterware-startup-problems/

benjamindba