The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
It’s TSQL Tuesday again !! Props to host SQLSoldier for the topic, WAITS.
We have been trying to grow the DBA team – so the past few months have been busy on the hiring front. I have a few favorite questions that I ask potential candidates:
Another question I regularly ask:
This question tells me what kind of background they have and gives me a jumping off point to assess their troubleshooting skills. If they draw a blank – that also tells me something. I expect answers such as CXPACKET, PAGEIOLATCH, ASYNC_NETWORK_*, the various LCK_ waits – with steps to address such as: “fiddled with MAXDOP”, “bought a tray of flash; implemented tiered storage”, “rethought isolation levels” etc.
If I were on the other side of the table, I’d be tempted to answer: WAITFOR
Of all the wait types, a WAITFOR is unique; you only get one when you specifically ask for it. The coder is literally telling the server to wait for a specified amount of time before continuing. Like cursors, whenever I see a WAITFOR my instinct is that someone, somewhere, is doing it wrong.
Don’t get me wrong – I’d find a way to strategically reassure the interviewer that I was comfortable with the waits above – blah blah blah – but to sell myself I’d tell this story.
There is some artistic liberty here – this all happened but is compressed into a concise story with a single actor for space. I will call him Horacio.
Awhile back Horacio was troubleshooting a slow query – fairly easy to fix – blocking. While troubleshooting he saw a WAITFOR in the data collection. Once the primary issue was resolved he returned to investigate further.
Horacio found a sproc whose job was to insert a new record. The PK was a natural-ish key based on client id and a timestamp out to hhmmss. A try/catch had been implemented to capture collisions when two keys were generated for the same client within the same one second window. The proc would wait one second (WAITFOR) and then try the insert again.
In short: Each client would be allowed no more activity than 60 new records per minute. Why would they do that?
Finding someone who actually knew where that design decision had come from took some footwork. He had to track down a few people – ask a few questions – and ultimatley win a few hearts and minds.
A prior design used an auto-incrementing primary key which was doing double duty as the reference number clients used when calling for support. Support staff complained that they spent too much time looking up related information and that the reference number should be more descriptive. Some processes were still manual, mind you.
The try/catch with WAITFOR had been added later after business had grown and the random PK violations started occurring – classic technical debt – solutions that work now but demand payment (with interest) later.
Horacio met resistance when he started asking questions. Don’t you dare change it – threatened support staff. The developers had already been “recognized” for their brilliant new PK so they weren’t about to go back. “We’ll cross that bridge when we have to”, was the response when he warned that some day soon they could outgrow this design in tragic fashion. A ton of foreign keys and business processes had been built off this new key and they had no intention of rethinking schema.
With more investigation Horacio established that while support staff and the business wanted the time of the issue in the ref number; hour and minute was accurate enough. The seconds had only been added to avoid collissions (un-succesfully). So he proposed replacing the seconds with a sequential number – when a collision was detected code could continue without a system enforced requirement to wait for the actual passage of time. Turns out the way issues were added there weren’t really any more collisions at the minute level than there were at the second level.
The business was happy because the existing PK entries could stay in place. The dev team was happy because they didn’t have to drastically rethink anything and roll back a “feature”. From the DBA perspective it was still far from ideal; he had to design a mechnism to issue consecutive numbers reliably upon collisions (this was pre SQL 2012 and the Sequence Object). The company didn’t pay off all technical debt related to this systems design, but they made a down payment. More importantly; Horacio gained the trust of the other teams and they were now more likely to involve him in design decisions earlier.
Why would I like Horacio as an interviewer?
Rarely are we as DBAs able to address WAITS or other resource related issues in a vaccume. CPU contention can only be tuned so much from the server end. At some point we have to involve actual breathing bi-peds. We may have to pitch ideas to the business, prioritize backlog for the dev team, mentor the architects, and even compromise (shutter). Throwing money and hardware at a problem only goes so far, configuration and server tuning goes a little further – succesfully tackling the underlying architecture as a thought leader and becoming a proactive champion of the business? That’s varsity team right there kid.
Diagnosing a WAIT is important but if you can’t do anything about it, you’re kind of like a high maintenance girlfriend/boyfriend; attractive – but ultimately not much use.
As for my answers to the other interview questions.
Room smell funny?
Any animal and why?