The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
Recently we had a disaster recovery event on a couple of mirrored servers. By that, I mean we were required to move primary database operations to an offsite mirror partner. Because of the amount of data as well as the distance between the two mirrors they run in an ASYNC mode. To be clear, this isn’t a high availability solution, it is a DR solution. Our high availability solution (clustering) went offline as a whole during this event.
We were required to make a decision, hope we can get primary back up in a matter of moments, or force failover with the possibility of data loss. Immediately the question from the CIO was – well, how much data could we lose? “Probably not much, if any, but it’s hard to say” isn’t a great answer at the C or E level.
My mission after that event was to not be in a position where I couldn’t have a more educated answer to that question. It got put on the back burner while the complexities of daily life for a team of 24/7 production DBAs worked themselves out, but we finally got a chance to get some additional monitoring in place this week that might help us if we are ever in that situation again.
The question was … how could we estimate the potential data loss of a force online DR event. Checking the redo queue on the mirror only tells you how much data is yet to harden – the key being here, it can still harden. What about the data on the other end of the mirror?
Buried in the myriad of stats and counters in that DMV are a set of metrics related to mirroring. The particular stat of note for us was Log Send Queue KB. This is a point in time counter that reports how many kilobytes of data are currently sitting in the queue waiting to be sent to the mirroring partner. Revealed by using something like this on your primary box.
SELECT counter_name , instance_name , cntr_value AS Log_Send_queue FROM sys.dm_os_performance_counters WHERE object_name = 'SQLSERVER:Database Mirroring' AND counter_name = 'Log Send Queue KB'
The obvious problem is the primary is likely not available when the decision to force the partner online needs to be made, so you should have some type of monitoring, logging in place that can capture this information for you pre-emptively. Pssst, if your databases are important enough to need mirroring, monitoring is hopefully already in place. This specific metric is pretty easy to set up yourself, or add as a custom metric to your off the shelf monitoring solution (i.e. Redgate SQL Monitor, SolarWinds DPA, or Idera DM).
Then when the time for action comes you can reference your monitoring tool and get an idea of the amount of data that might have built up on the primary Log Send Queue just prior to failure. Decision making is so much more fun when you have numbers.
There are a lot of metrics related to mirroring that could hopefully help you avoid a DR event to begin with, but everyone needs a fallback plan.