The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
We recently did a hardware/software upgrade of a set of servers participating in synchronous mirroring for HA purposes with a log shipping solution for disaster recovery to a remote location. The availability requirements of this server is 99.999 (5 nines) and approximate footprint of the entire operation is about 13TB. Encryption is involved as are linked servers and read only snapshots on the failover partner for ad hoc access.
The starting point was SQL Server 2008 (in 2000 compatibility mode). Today these servers are all running SQL 2012 native on brand new hardware. Next spring we will rebuild this entire environment again on the same hardware as 2014 Always On groups.
We did the entire upgrade with less than 30 seconds where the primary server was unavailable to connections at any level. It was not without a lot of hiccups, bumps, and struggles along the way. If you have a similar upgrade on your plate let me share some of the lessons learned.
Elements of a Rolling Upgrade (the plan):
Microsoft publishes a decent walkthrough for performing a rolling upgrade under these conditions. There are a few things however that applied to our environment not covered in that document that I point out below.
Step One: Fixing Dependencies on Compatibility Mode 80.
SQL Server 2012 no longer supports compat mode 80 (thank heavens). So, we needed to get rid of these dependencies. The extended events for capturing deprecated code were invaluable in assisting us to identify these legacy code blocks. In a 13TB environment however, there is a lot of code that may or may not be run in a EE session window, or even a week or month of time. Yet, this could be code that is still highly critical when it is needed (i.e. year end financials etc.).
To identify those legacy code blocks within the database that the extended events didn’t catch, we relied on building Visual Studio database projects out of our schemas. A great side effect/project of this upgrade was moving to TFS for all database source controlling. When attempting to compile legacy code into a single project, TFS generates a lot of errors when deprecated code is present. A great tool for digging out the kruft. Drill down on the errors, update the syntax to non deprecated approaches.
Finally, to catch code blocks outside of the database we had to work with the dev teams for rudimentary code scans looking for items such as star joins ‘ *= ‘, inferred schemas and other code smells we could automate searches for.
Pushing this modernized code to production (pre-upgrade) required the typical politics of change management, business priorities, bribery, begging, and threats that any project originating in operations must undergo. I swear sometimes working with change management is like buying-off drug lords to look the other way while the good guys are inserted into enemy beaches.
Step Two: Upgrading DR and the Failover Partner
Mercifully, SQL Server allows you to log ship and mirror to servers that are a higher version than the primary. See the rolling upgrade white paper above. So we upgraded orange and red without breaking links to blue. This doesn’t work in the other direction. The rub? Once you fail over to these servers, there is no failing back. Partners have been upgraded permanently. For this reason we disabled the witness server and set the mirror to synch with manual failover.
1. While you can mirror a database to a partner that is running on a higher version of SQL Server, you cannot take read only snapshots on that server until you have failed over, making it live, at least once. As stated above, once you do failover on servers with mismatched versions, there is no going back until the second one has been upgraded.
Attempting to do this creates: Cannot open database <db_name> version <version>. Upgrade the database to the latest version.
Our read only traffic (Orange) needed a new home until we could finalize the upgrade on the primary – something still a few days out. Luckily we were able to repurpose Blue on short notice. We restored our databases with new names matching the old snapshots, put them in stand-by read-only mode, setup a quick job to update nightly, and then used a DNS alias to redirect traffic looking for snapshots there.
2. Make sure you check non mirrored databases (i.e. the master) for user tables that may have been created there. Don’t trust legacy documentation for everything. When doing schema compares for fail-safe checking don’t leave out the system tables. Red Gate and Idera both make great schema compare tools.
3. Migrate login’s and double check SIDs for your SQL accounts (if applicable). Again, I love the Idera security account migration tool that is part of the admin toolset.
4. Make sure all Agent jobs exist on both servers. Make sure to document which ones are running on both servers, and which ones are disabled on one and enabled on the other. Some jobs may have error checking to only run if their side of the mirror is primary. Others might rely on manual enabling after a failover. Don’t wait until after you upgrade to start wondering which were which.
Step Three: Fail Over / Official Upgrade
Since failing over is a one way operation we made sure to have the replacement Green (old primary) server and related storage space ready to move into place ASAP. As soon as we failed over we took old Green offline and implemented another domain level ALIAS pointing traffic looking for Green at the new partner (replacement for Orange). Now we could bring up the Green replacement. After it was up and healthy we then pointed the old secondary ALIAS at the new server (Old Green is now new Orange, Old Orange is now New Green). Replacement for old secondary is now primary. Replacement for old primary is now secondary. We could then redirect the snapshots that were moved to Blue back to new Orange.
1. Using domain aliases in this way is very helpful – but don’t let it become a long term crutch or solution. Our project now that the new boxes are up is to begin migrating all connections to the new server names. As part of this project we had to update old aliases from the previous upgrade that still hadn’t been fixed 4 years ago. Yuck.
2. Communicate, communicate, communicate. We worked very had to make sure everyone knew our plan, knew the server names, prepared to update connections, documented known deprecated blocks. We had the NOC online. A crit-sit call open. Sys admin, san admin, security admins on the phone. We had notified customers that there would be a small maintenance window blip for connectivity, but we were still caught by surprises.
3. Don’t assume all connections will come in looking for server names or aliases. A number of our connections were connecting VIA IP address. The domain aliases weren’t any help there. Those simply broke and had to be fixed post upgrade.
4. Save DR for last. If an emergency, failed upgrade, or un-identified show-stopper appears. You’ll want DR available under the old environment until everything else has been verified.
5.We originally intended to practice the upgrade on our stage server (Blue). Luckily we did that elsewhere however since we ended up needing Blue for redirected snapshot access.
All in all? We felt very good about the entire process. As stated, critical business traffic was without a primary connection for less than 30 seconds. We had already communicated this to all clients as a scheduled maintenance window. We did not receive a single help desk call re the upgrade that was customer facing. Only a few internally as documented above (surprise IP connections, snapshot traffic).
Next step – 2014 and AlwaysOn!