The Art of SQL Server Database Administration, Development, and Career Skills for the Technically Minded
I’m excited to be hosting TSQL Tuesday this month. The topic is focused performance tuning. Please be sure to check out the other posts submitted for this month’s blog party celebrating it’s 64th anniversary since inception by Adam Machanic.
A few months ago I got a brutal introduction into the world of large scale ETL against legacy format data. Don’t get me wrong, I’ve done a lot of ETL over the years – but nothing like the project we just finished up. The final package loads several hundred flat files at a time from a COBOL environment encoded with EBCDIC and comp3 data storing nybbles. Typical runs process several hundred thousand records at a time; hundreds of millions rows monthly.
A nybble, by the way, is a way to store integers as only four bits instead of the normal eight that make up a byte. A clever way to save storage space if you’re a 1970’s COBOL developer, after all digits 0-9 don’t need all 8 bits; but it’s a horrible way to make friends with a 2015 SSIS ETL developer working in windows code page 1252.
We recently finished up all the business logic and vetting out the transformations for our warehouse. The goal now that everything works is to speed it up. Moving that much data, that often, get’s pretty time consuming. Enter SQL Server In-Memory OLTP for the staging tables.
I get why Microsoft calls this technology In-Memory OLTP, the vast majority of scenarios it will improve with it’s optimistic-no lock strategy are related to concurrency and other OLTP problems. ETL however is an excellent use case for this technology; and while OLTP is still technically accurate from a data architecture point of view, people typically relate ETL to the data warehousing side of the fence not OLTP.
In-Memory ETL Lessons Learned
A lot of documents talking about the merits of In-Memory optimizations say something like, “few, if any changes need to be made for existing code to work against memory optimized tables”. My answer? Not in my experience.
One of the reasons ETL is such a great fit for memory optimized table in the stage layer is these are typically heaps, there are typically varying amounts of business logic and lookups that occur and the schema-only durability can be leveraged as these are short term storage structures anyway. These all line up really well with the MSDN matrix on what types of tables would benefit most from being memory optimized.
Here goes a play by play of the tuning process I experienced as if you were in my head.
Alright, I’m going to re-create this staging table as “memory optimized”. I’ll just drop and recreate with the memory_optimized=on and durability=schema_only options. (pssst, memory optimized file groups are already in place).
Msg 41327, Level 16, State 7
The memory optimized table ‘SampleData’ must have at least one index or a primary key.
Oh yeah, forgot about that requirement. It WAS a heap but I can figure out something to index it on so it can dropped and re-created as a memory optimized table. I’ll specify an index on one of the internal reference keys I’ll use later in a lookup. I re-run like this.
CREATE TABLE [stage].[SampleData]( [LookupID] [bigint] NULL, [LegacyKey] [varchar](50) NOT NULL INDEX idxLegacyKey NONCLUSTERED, [SampleData] [varchar](20) NOT NULL ) WITH (memory_optimized=on, durability=schema_only);
Msg 12328, Level 16, State 102
Indexes on character columns that do not use a *_BIN2 collation are not supported with indexes on memory optimized tables.
What the heck is BIN2? I’d never admit I don’t have any idea to my peers – but in my head – that’s what I’m thinking. After some BING-FU I discover BIN 2 is a binary collation – it makes sense to use binary as an in-memory object. It’s not my default collation though. Is it anyone’s??
CREATE TABLE [stage].[SampleData]( [LookupID] [bigint] NULL, [LegacyKey] [varchar](50) COLLATE Latin1_General_100_BIN2 NOT NULL INDEX idxLegacyKey NONCLUSTERED, [SampleData] [varchar](20) NOT NULL ) with (memory_optimized=on, durability=schema_only);
Hooray – I have a memory optimized table!!! Let’s do the same to the 30 other staging tables and then try our ETL now. The table names and structures are the same, let’s just run the ETL – should work no problem.
What happens next
“Invalid object name ‘[stage].[SampleData]’.”
I struggle with it for awhile, check the spelling and existence of the object like 1000 times, finally figure out that the destination simply doesn’t like in-memory tables. I find the SQL Server destination wired against an OLE connection in the project finally works. Others might work as well – but for whatever reason the default destination type didn’t. Ok, run the ETL again.
After wading through endless code page warnings (due to the BIN2 collation) that I fix by selecting “default code page” instead of specifying a specific one in SSIS, I get presented with this error:
Msg 10794, Level 16
The statement ‘TRUNCATE TABLE’ is not supported with memory optimized tables.
One of the most common steps in any ETL process using staging tables is truncating, but In-Memory doesn’t support truncate. Makes sense considering the architecture of the storage.
So, change all the steps to delete statements instead of truncate and we now see a ton errors telling me this:
Msg 10794, Level 16
The table option ‘tablock’ is not supported with memory optimized tables.
A common performance booster for ETL is to use bulk load and table locks. Not going to fly here. Need to remember that locking In-Memory tables is not even remotely like traditional storage objects. Basically – it’s non existent. In-Memory is entirely optimistic via row versioning or simply failing conflicts.
So, remove all the table lock requests from SSIS destinations as pictured above and try again.
Cannot resolve the collation conflict
between “SQL_Latin1_General_CP1_CI_AS” and “Latin1_General_100_BIN2” in the equal to operation.
Remember that BIN2 violation we got while trying to build the table/index. We solved it by adding that collation to the column. Now anywhere I compare that column to another column I’ll have to manually specify what collation to use as well.
In my case I got to replace around 40 separate lookups and other mechanisms using that original column as an equality comparison. Specifying a collation on a single column now looks like this in my lookup scripts:
UPDATE A SET ID = C.NewID FROM stage.ATable A INNER JOIN dbo.CTable C ON A.ID = C.NewID COLLATE Latin1_General_100_BIN2
Mercifully from here on out things started flowing much smoother. I notice immediate performance gains on the steps that perform heavy operations such as decrypt, encrypt, and transforms. Not as much performance gain as I would have though in the bulk loads – turns out the bottle neck there was reading from fixed width files – not storing to the heaps. Overall experience however is noticeably faster, cut ETL times nearly in half – from end to end the stage portions are only about 25% of total work, so 100% improvement is pretty good return on over 1.5 million rows loaded. Cut run from ~ 20 minutes down to just over 10.
Implementing In-Memory is not for the faint of heart. It’s a totally different structure with some very specific rules and constraints. This post has gone long – but in the next one I’ll dig into how we got even better performance with the new Hash Index that In-Memory makes possible.