The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
I ran into a question I haven’t gotten in a while – does including a clustered key as part of a non-clustered index offer any benefit?
It came up because a major ETL job that we are trying to tune has a stage table that has to update a reference key by doing a join against the live table. I recommended adding an index on the join columns in order to speed up that ETL lookup step. The reference key that needs to be updated is also the clustered index / primary key of the live table – so the index create was built like this:
CREATE NONCLUSTERED INDEX [IX_RefIDHelper] ON [pv].[MyTable] ([RefID] ASC) INCLUDE (PKID);
Looking at the deploy scripts I took the opportunity to drop some knowledge pointing out that including a column that is already the cluster key is redundant, because the cluster key is always a part of every non clustered key, that’s how it finds it’s way back to the underlying rows on the cluster. So, the include should be discarded.
CREATE NONCLUSTERED INDEX [IX_RefIDHelper] ON [pv].[MyTable] ([RefID] ASC);
If that’s news to you, you might be asking yourself… really ??? how does a non clustered key on a heap work then. The answer is that SQL Server has to create it’s own row ID called a RID that it adds to all of your indexes to help them find their way back to the actual underlying row. That’s a topic for another day however.
Being curious, the immediate follow up question was, will re-creating the index without the INCLUDE statement reduce total index size, or is the underlying index on disk the same… since the cluster key is always included anyway. In other words, even though including the cluster key adds no obvious benefit, does it introduce a tangible disadvantage (performance, overhead, disk space waste etc.)
Frankly, I didn’t know, I kinda figured it would be the same, but for science we did a quick test. First we created and then sized out the index that included the cluster key explicitly as part of the include clause. We then looked up the objectID and hit the DM to find physical stats:
SELECT * FROM sys.indexes WHERE name = 'IX_RefIDHelper' -- returned objectid 59147256, indexid 19 for query below SELECT page_count 'No. Pages', convert(float, page_count * 8.00/1024.00) 'Used Space (MB)' FROM sys.dm_db_index_physical_stats(db_id(), 59147256, 19, NULL , NULL)
Total size of index on disk: 6626.67MB
We then dropped the original index and recreated without the include of the PKID and ran again. Footprint on disk was absolutely identical. 6,626.67MB and 848,214 data pages.
Now, it may be the same on disk, but storage of the index definition as found in the sys.index_columns still explicitly includes the cluster key in it’s own row with is_included_column = 1.
SELECT * FROM sys.index_columns WHERE index_id = 19
Unable to find any disadvantage other than a single extra row in the definition, I started pondering any possible advantage. I suppose it could offer some amount of protection to the intent of the index if the cluster key was ever changed, unlikely, but hey, possible. Really, however, that is the only benefit I could think of to explicitly include a cluster key in an index definition that needs it. What say you?
I prefer including the clustering key because it’s explicit. When I look at an index, I don’t want any guessing or surprises. By including the clustering key, I don’t rely on the person coming after me knowing that the clustering key is automatically included in the index.
I suppose in your line of work, constantly opening the hood of other people’s boxes, it’s helpful when more people are explicit about their intent. Thanks for the take.
Pingback: (SFTW) SQL Server Links 16/10/15 - John Sansom