This is one of the common issues which I heard from my clients using SQL Server AlwaysOn and they don’t know why it happens.
What is resolving state in SQL Server AlwaysOn?
When there is an availability group, the replica would be either in primary state or secondary state – when its online in failover cluster manager. Resolving is an intermediate state when the transition is happening from primary to secondary or vice versa. If due to some reason the transition is not successful, it goes to “resolving” state. In this state, the database is not accessible.
What you should do to find the cause?
I always ask for ERRORLOG in such situations to know the series of events. Where is ERRORLOG? Various Ways to Find its Location
2016-02-22 17:36:14.75 Server ***Unable to get thread context for spid 0
2016-02-22 17:36:14.75 Server * *******************************************************************************
2016-02-22 17:36:14.75 Server *
2016-02-22 17:36:14.75 Server * BEGIN STACK DUMP:
2016-02-22 17:36:14.75 Server * 04/22/15 17:36:14 spid 7416
2016-02-22 17:36:14.75 Server *
2016-02-22 17:36:14.75 Server * Non-yielding Scheduler
2016-02-22 17:36:14.75 Server *
2016-02-22 17:36:14.75 Server * *******************************************************************************
2016-02-22 17:36:14.76 Server Stack Signature for the dump is 0x000000000000025C
……
2016-02-22 17:36:26.19 Server Error: 19407, Severity: 16, State: 1.
2016-02-22 17:36:26.19 Server The lease between availability group ‘ProdAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
2016-02-22 17:36:26.20 Server AlwaysOn: The local replica of availability group ‘ProdAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
When I dug further I found that non-yielding came due to a restore command. The restore contributed to the IO load on the server, causing a non-yielding scheduler to occur within the SQL Server. This kept the SQL Server from renewing the lease with the cluster, causing the lease timeout for the availability group, which then caused the availability to go into a resolving state
Along with Errorlog, it’s always good to have Windows event logs (Application and System) and Cluster log from all participating nodes.
Have you ever faced such situation where the availability group was in resolving error?
If you like this blog post, here is another blog post where we have discussed how you can add files to the database in the AlwaysOn Availability Group.
Reference: Pinal Dave (https://darkslategrey-bat-805937.hostingersite.com)