The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
The logical joins (inner, outer, left, right etc) are familiar to any SQL pro. As the optimizer selects the physical join type that will actually perform the work however, this part of the operation is a bit of a mystery to many. The DBA challenge this month was to write three queries that would benefit most from each of the three physical join operations. Merge, hash, and loops joins. To complete the challenge myself I use AdventureWorks and some XKCD style stick figures to demonstrate my queries and the three physical joins in action
Books online gives a decent over-view of physical joins. Craig Freedman goes a lot deeper on his blog providing great depth on each physical join type. For a deep dive into the entire query optimization process I highly recommend Query Tuning & Optimization by Benjamin Nevarez.
Stalking with a Loops Join
There is a stocker down in the AdventureWorks warehouse in need of a date. Kim is a great gal. Her job title is an unfortunate homophone, she is not a stalker – she is a stocker. She can lift very heavy boxes over her head… also, she can drive a fork-lift. What else could you ask for?
Let’s help her find someone her own age. (pssst, age column added after calculating it from the birth date)
SQL picks a loops join to physically match up dates for Kim. A loop (or loops) join takes each candidate from one record set – in this case Kim – and loops through every candidate in the other data set looking for matches across the ON clause – in this case age (after filtering for single men).
Physical join type here is an easy choice for SQL Server. It only has to find matches for one record (Kim) and it’ll have to scan the table at least once anyway as age has no index. So, it can just look for dates for Kim while it does. Nothing creepy about that at all !! Good luck Kim!!
If more than one stocker were looking for dates the SQL engine might choose another join. Loops joins work best with smaller data sets. Especially when one is significantly smaller than the other.
Class Dating and the Merge Join
What if we wanted to line up potential romance for a larger group. We could do it a few ways. Nothing says long term relationship like salary. There are a few ways to write this query. To best illustrate a merge join however I broke it up. I first created two tables with clustering keys based on salary rank.
SELECT e.LoginID, p.rate , ROW_NUMBER() OVER ( ORDER BY p.rate ) AS salaryrank -- rank by salary INTO PayRankedMen -- new table FROM HumanResources.employee E INNER JOIN HumanResources.EmployeePayHistory P -- link salary to guys ON e.BusinessEntityID = p.BusinessEntityID WHERE E.gender = 'M' AND E.maritalstatus = 'S'; -- single men GO SELECT e.LoginID , p.rate , ROW_NUMBER() OVER ( ORDER BY p.rate ) AS salaryrank -- rank by salary INTO PayRankedWomen -- new table FROM HumanResources.employee E INNER JOIN HumanResources.EmployeePayHistory P -- link salary to gals ON e.BusinessEntityID = p.BusinessEntityID WHERE E.gender = 'F' AND E.maritalstatus = 'S'; -- single women GO -- add clustering primary keys to ensure sets are ordered on disk by salary rank ALTER TABLE [dbo].[PayRankedMen] ADD CONSTRAINT [PK_PayRankedMen] PRIMARY KEY CLUSTERED ([salaryrank]) GO ALTER TABLE [dbo].[PayRankedWomen] ADD CONSTRAINT [PK_PayRankedWomen] PRIMARY KEY CLUSTERED ([salaryrank]) GO
Now that we have two tables of single employees physically ranked by pay I can create my join with something like this:
In this circumstance SQL Server picks merge as the physical join operation. A merge takes two sets of data that already have order, lines them up, pairs them off, and sends them on their way. This can still be a one to many or many to many operation, though a merge would be a less likely choice in the case of the latter. The real key is that both sets MUST be ordered. SQL Server leverages this order for efficiency.
Using a join hint you CAN force non ordered sets to use a merge join but you’ll notice two additional operations in the execution plan. A merge must have ordered sets. Obviously this could be very expensive without pre-existing order.
Speed Dating and the Hash Join
Ok, maybe match-making by salary isn’t a great approach. What about available vacation hours? This would also be a great way for AdventureWorks to shore up the books before year end. Accountants hate vacation hoarders – it’s a budget liability. Workaholics need time off and someone to spend it with. Perfect!
Matching candidates on vacation hours is a perfect problem for the hash match approach. The engine takes one data set, builds a hash table made up of the join candidates, and then probes that hash table with the join candidates from the second set. The hash table approach allows the SQL engine to leverage parallelism and other efficiencies to speed this otherwise heavy operation.
SQL considers a hash match the best approach because no pre-existing order on vacation hours exists to support a merge join and the record sets are large and balanced enough where a loops approach is far too much work. Further, the lack of indexes on the join columns makes a hash an almost certain choice for the optimizer.
Just for fun I added an index to the vacation hours column named hx_loveconnection and re-ran the blind date query to see if the physical operation would change. In fact, SQL did change it’s mind, picking a loops join.
I also added a merge hint (not shown) and re-checked statistics with the hx_loveconnection index available. This showed to be nearly identical in performance and impact as the loops join. In a much larger data set with higher cardinality on the join key than vacation hours provides, I suspect a merge may have been picked.
As a side effect, this plan also demonstrates how a loop join can leverage an index to keep from having to do a scan on both sets, known in big brain circles as a naïve loop. Kim from our first example should index by age if she’s in a big hurry!!
Want to know more? Seriously, dig into the links in the first couple paragraphs and/or go buy Benjamin’s book. Understanding the operation of these physical joins can make a big difference in the design of your queries and more importantly the foundational architecture of your tables and relationships if you’re lucky enough to be involved that early in the process.