Hi everyone,
first of all I'll briefly describe the enviroment:
no. 2 Windows 2008 R2 Servers (virtual) w/ 16vCpu and 64GB ram each
no. 5 SQL Server 2012 instances on each virtual server (w/ max memory and cpu affinity settings)
no. 5 Availability Group
This SQL Subsystem is used by no. 5 Sharepoint 2013 farms
Since almost one year we're experiencing some issues related to Availability Groups.
For example, during 24h we have more than once connection termination of all databases in AG, and immediate reconnection (eg. there's no failover)
AlwaysOn Availability Groups connection with secondary database terminated for primary database 'WordAutomationServices_e666ce2ffff24e6592f081cd755d3e9e' on the availability replica with Replica ID: {48f50542-dcad-493a-a2bb-f2b2a4d6ed73}. This is an informational message only. No user action is required.
[...]
While receiving such email alert
The recovery LSN (403:52:1) was identified for the database with ID 8. This is an informational message only. No user action is required.
Moreover, at least once per day we receive this alert email about Thread pool exaustion:
The thread pool for AlwaysOn Availability Groups was unable to start a new worker thread because there are not enough available worker threads. This may degrade AlwaysOn Availability Groups performance. Use the "max worker threads" configuration option to increase number of allowable threads.
Those events happens on every instance, obviously in different moments; btw Production instances are more affected than others (because of n. of databases in AG.
Recently this issues seems to have an impact on performances too.
Trying to drill down the bottleneck with waits stats analysis, I found that the culprit seems to be PREEMPTIVE, HADR and THREADPOOL waits:
Image may be NSFW.
Clik here to view.
It also seems not to show any evidence of I/O, memory or cpu pressure.
Any clue to pinpoint those issue and trying to isolate the problems?