How far to go when Validating a SQL Service Pack install?
The ideal situation is to have a server that, in terms of OS and other software, exactly matches the production environment, upon which you can perform the update first. That way you know everything that is required.
This would preferably be a VM that you could snapshot, so if you encounter and fix a problem you can revert to the snapshot and start again with your updated procedure. Repeat this until the upgrade procedure works and you know the reboot requirements and such, then plan to repeat the process in production.
One of your development/testing VMs might be ideal for this if you have them (i.e. if your dev/test/release process isn't just "smash code together and throw directly into production"!). This way you are essentially treating the service pack the same as you would one of your own bug fix or feature releases meaning you can perform a full regression test on your application after the service pack is applied to the test environment (to make sure MS haven't introduced any bugs or changes to undefined behaviour that your application depends upon - or that they haven't fixed a bug that your code is depending on!).
Obviously this "ideal" could be more time consuming than other options...
Should I stop validation at "Check Files in Use" or "Ready to update"? Have I validated everything I can by the end of "Check Files in Use"?
Just go ahead click NEXT
and apply the service pack, you can safely ignore this process. The check files in use process
is to handle scenario where end user does not want to restart after applying service pack in that case you need to make sure all such processes are stopped, but in any case I strongly recommend to start windows machine/node on which you are applying service pack.
The only thing which can happen is after successful upgrade you need to start the windows machine. This as such is NOT going to cause SP to fail
Does going to "Ready to update" add value to the validation?
The ready to update basically shows you what all features you are going to update and nothing more than that. You have to click update
here, it does not adds any value but would just show you what configuration you have choosen
To answer your question you can skip the files in use validation; it doesn't ever block you from proceeding and is only there to inform you if you might need to reboot AFTERWARDS.
There are many more situations where you must still reboot and it won't tell you (usually .NET Framework related) so you're always going to reboot no matter what. Besides, if you don't reboot now you'll need to do it next month when the next pack comes out because that IS a patch blocker.
But to address the elephant in the room even if you're patching just one or two servers then you should allocate more time than 30 minutes; 60-120 minutes is about right especially if you have AGs/FCIs/mirroring/replication and Enterprise features. If you have a few dozen servers you can compress that into about four hours because you'll be partly automating at that stage and it's pretty rare that they'll all fail with completely different issues.
The reason you need more time is you never know what's going on with an ESX host, temporarily slow SAN, or the bugs in 2012 they allegedly fixed recently with slow update installs. Or you somehow forgot to remove SSISDB from an AG first and now it's hosed and you need to fix it. Or the repeated failures from MS screwing up instances with filestream so you have to go into Add/Remove Programs and do a repair before reapplying the update. Or you need to wait for the AG to come back in sync after patching (easily 30 minutes on a busy server), failing over and doing the replica.
What basic health checks have you automated? It takes a few minutes per server to run all the AG policy checks. If you're doing it by hand it's more; validating DQS MDS SSRS SSAS have all come back up and aren't throwing stupid errors.
I can fairly confidently say that while it's useful to test on QA first there has been many, many a time a patch has only failed in PROD because someone somewhere sometime did somethings differently.
Anyway the list isn't endless but it's definitely more than 30 minutes. You don't want to be looking at the clock while you're trying to fix a disaster just because you under quoted with a short time limit. I understand managers want to hear it - and that's why DBAs get paid the big bucks because we need to say No.