Tag Archives: Cluster

Steps to Check the Host Name for a Clustered SQL Server Instance

While troubleshooting a SQL Server cluster failover issue, it is essential to know the time needed for the cluster failover and the node name where SQL Server was running before the failover occurred. In this tip, I will show you the different options to find the failover time and node name where SQL Server was running before the failover over.


Split-Brain/tiebreaker in Cluster

HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One difficult, but serious condition every clustering software must be able to handle is split-brain. Split-brain occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.

It’s also used as a tiebreaker when nodes can no longer communicate (that is, are “split-brain”). When it cannot communicate with the nodes, Cluster Service cannot really detect the problem: It’s possible that the nodes are dead, but it may also be possible that just the communication links are. In this situation, to prevent each node from thinking that it is the sole survivor and bringing your database online, they go into arbitration, using the quorum resource.

The node that owns the quorum resource puts a reservation on the device every three seconds; this guarantees that the second node cannot write to the quorum resource. When the second node determines that it cannot communicate with the quorum-owning node and wants to grab the quorum, it first puts a reset on the bus.

The reset breaks the reservation, waits for about 10 seconds to give the first node time to renew its reservation at least twice, and then tries to put a reservation on the quorum for the second node. If the second node’s reservation succeeds, it means that the first node failed to renew the reservation. And the only reason for the failure to renew is because the node is dead. At this point, the second node can take over the quorum resource and restart all the resources.

Reference: http://technet.microsoft.com/en-us/library/bb742593.aspx