SQL Server 2017 with 500 databases - Frequent AG disconnects since CU9
Update:
- The Frequent Availability Group disconnects were confirmed to be a regression that was introduced by CU9 and they were resolved after installing CU12.
The blocking issues on the secondary replica were confirmed to be an issue with an update to the VSS writer code that was introduced in CU10. Hopefully it will be resolved in CU 13. The interim solution is to manually replace the VSS writer DLLs with the Pre-CU10 DLLs...
BEGIN RANT-SACTION;
Unfortunately, Microsoft seem to be repeatedly failing to properly QA not only Windows 10 updates, but enterprise mission critical software such as SQL Server as well.
I much preferred their previous strategy of service packs, at least they had enough time to test them properly before inflicting production crisis and data loss to their customers with careless release of half baked updates.
COMMIT RANT-SACTION;
Did you check the worker threads? Normally always on use more workers thread to work and nornally the default value is not sufficient. I had the same issue with 600 databases in an always on, So we add more threads on the instance paramater and that fixed our issue. Hope this helps!