Russ Thomas – SQL Judo

The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded

Violating AdventureWorks Dating Policy with Merge, Hash, and Loop Joins

The logical joins (inner, outer, left, right etc) are familiar to any SQL pro.  As the optimizer selects the physical join type that will actually perform the work however, this part of the operation is a bit of a mystery to many.  The DBA challenge this month was to write three queries that would benefit most from each of the three physical join operations.  Merge, hash, and loops joins.  To complete the challenge myself I use AdventureWorks and some XKCD style stick figures to demonstrate my queries and the three physical joins in action

Books online gives a decent over-view of physical joins.  Craig Freedman goes a lot deeper on his blog providing great depth on each physical join type.  For a deep dive into the entire query optimization process I highly recommend Query Tuning & Optimization by Benjamin Nevarez.

truelove

Stalking with a Loops Join

There is a stocker down in the AdventureWorks warehouse in need of a date.  Kim is a great gal.  Her job title is an unfortunate homophone, she is not a stalker – she is a stocker.  She can lift very heavy boxes over her head… also, she can drive a fork-lift.  What else could you ask for?

Let’s help her find someone her own age.  (pssst, age column added after calculating it from the birth date)

loop_join_kim

SQL picks a loops join to physically match up dates for Kim.  A loop (or loops) join takes each candidate from one record set – in this case Kim – and loops through every candidate in the other data set looking for matches across the ON clause – in this case age (after filtering for single men).

loops

Physical join type here is an easy choice for SQL Server.  It only has to find matches for one record (Kim) and it’ll have to scan the table at least once anyway as age has no index.  So, it can just look for dates for Kim while it does. Nothing creepy about that at all !!  Good luck Kim!!

If more than one stocker were looking for dates the SQL engine might choose another join.  Loops joins work best with smaller data sets.  Especially when one is significantly smaller than the other.

Class Dating and the Merge Join

What if we wanted to line up potential romance for a larger group.  We could do it a few ways.  Nothing says long term relationship like salary.  There are a few ways to write this query.  To best illustrate a merge join however I broke it up.  I first created two tables with clustering keys based on salary rank.

SELECT  e.LoginID,  p.rate ,
        ROW_NUMBER() OVER ( ORDER BY p.rate ) AS salaryrank  -- rank by salary
INTO    PayRankedMen                                         -- new table
FROM    HumanResources.employee E
        INNER JOIN HumanResources.EmployeePayHistory P       -- link salary to guys
        ON e.BusinessEntityID = p.BusinessEntityID
WHERE   E.gender = 'M' AND E.maritalstatus = 'S';            -- single men
GO
SELECT  e.LoginID , p.rate ,
        ROW_NUMBER() OVER ( ORDER BY p.rate ) AS salaryrank  -- rank by salary
INTO    PayRankedWomen                                       -- new table
FROM    HumanResources.employee E
        INNER JOIN HumanResources.EmployeePayHistory P       -- link salary to gals
        ON e.BusinessEntityID = p.BusinessEntityID
WHERE   E.gender = 'F' AND E.maritalstatus = 'S';            -- single women
GO
-- add clustering primary keys to ensure sets are ordered on disk by salary rank
ALTER TABLE [dbo].[PayRankedMen] ADD  CONSTRAINT [PK_PayRankedMen] PRIMARY KEY CLUSTERED ([salaryrank])
GO
ALTER TABLE [dbo].[PayRankedWomen] ADD  CONSTRAINT [PK_PayRankedWomen] PRIMARY KEY CLUSTERED ([salaryrank])
GO

Now that we have two tables of single employees physically ranked by pay I can create my join with something like this:

merge_join_snobbery

In this circumstance SQL Server picks merge as the physical join operation.  A merge takes two sets of data that already have order, lines them up, pairs them off, and sends them on their way.  This can still be a one to many or many to many operation, though a merge would be a less likely choice in the case of the latter.  The real key is that both sets MUST be ordered.  SQL Server leverages this order for efficiency.

merge

 

Using a join hint you CAN force non ordered sets to use a merge join but you’ll notice two additional operations in the execution plan.  A merge must have ordered sets.  Obviously this could be very expensive without pre-existing order.

merge_force

Speed Dating and the Hash Join

Ok, maybe match-making by salary isn’t a great approach.  What about available vacation hours?  This would also be a great way for AdventureWorks to shore up the books before year end.  Accountants hate vacation hoarders – it’s a budget liability.  Workaholics need time off and someone to spend it with.  Perfect!

hash_join

Matching candidates on vacation hours is a perfect problem for the hash match approach. The engine takes one data set, builds a hash table made up of the join candidates, and then probes that hash table with the join candidates from the second set.  The hash table approach allows the SQL engine to leverage parallelism and other efficiencies to speed this otherwise heavy operation.

hash

SQL considers a hash match the best approach because no pre-existing order on vacation hours exists to support a merge join and the record sets are large and balanced enough where a loops approach is far too much work.  Further, the lack of indexes on the join columns makes a hash an almost certain choice for the optimizer.


Just for fun I added an index to the vacation hours column named hx_loveconnection and re-ran the blind date query to see if the physical operation would change.  In fact, SQL did change it’s mind, picking a loops join.

I also added a merge hint (not shown) and re-checked statistics with the hx_loveconnection index available.  This showed to be nearly identical in performance and impact as the loops join.  In a much larger data set with higher cardinality on the join key than vacation hours provides, I suspect a merge may have been picked.

blind_date_loop

As a side effect, this plan also demonstrates how a loop join can leverage an index to keep from having to do a scan on both sets, known in big brain circles as a naïve loop.  Kim from our first example should index by age if she’s in a big hurry!!


Want to know more?  Seriously, dig into the links in the first couple paragraphs and/or go buy Benjamin’s book.  Understanding the operation of these physical joins can make a big difference in the design of your queries and more importantly the foundational architecture of your tables and relationships if you’re lucky enough to be involved that early in the process.

Stick Image Copyrights:  Russ Thomas, Oct 13, 2014 – Created with paint.net and major assistance from my 12 year old daughter.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Information

This entry was posted on October 13, 2014 by in Database Development, Query Optimizer.
%d bloggers like this: