Staging data is an important part of the ETL (or ELT) process. The general concept is this: before you bring your data into your database, data store, data mart or data warehouse, you need to pre-load it. This is the process of extracting and then loading before moving it on, otherwise known as “staging”.
In the examples below I’m going to load in the contents of a csv file to a SQL table using BULK INSERT.
Truncate and Load
This is the simplest option and normally good if you have a data-set in the thousands. With this strategy you simply delete the copy of the data you have from your last load and replace it with a fresh copy.
TRUNCATE TABLE dbo.Products BULK INSERT dbo.Products FROM 'C:\Temp\Products_20200408_235959.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', TABLOCK );
The append strategy works if your data doesn’t change. A good example of data like this might be event logs or till transactions. You’ll want to put some checks in place though to ensure you don’t insert the same data more than once.
-- Truncate and load file data to staging table. TRUNCATE TABLE dbo.Staging_Trans; BULK INSERT dbo.Staging_Trans FROM 'C:\Temp\Transactions_20200408_235959.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', TABLOCK ); -- Append staged data to loading table. INSERT INTO dbo.Transactions (tranId, tranDate, productId, tillId, amount, discount) SELECT tranId, tranDate, productId, tillId, amount, discount FROM dbo.Staging_Trans AS S -- Source WHERE NOT EXISTS (-- don't insert duplicate rows. SELECT * FROM dbo.Transactions AS T -- Target WHERE S.tranId = T.tranId )
This process is needed when your data-set changes often. For example, let’s say we have a company that receives orders and then sends out the goods. We may therefore have an Order Date, Shipped Date, maybe even the odd change of address.
With this strategy you want to merge your data set on an identifiable subset.
Let us assume that generally none of my orders take longer than a month to fulfil, and I’ve tested this by reviewing the history of my data.
This being the case I can extract a few months of data at a time (a delta) and merge the results into my target table.
So I would have a primary key and a delta key. The delta key is important because it helps me work out what rows to delete without deleting historical data.
-- ====================== -- Truncate and load file data to staging table. TRUNCATE TABLE dbo.Staging_Stock; BULK INSERT dbo.Staging_Stock FROM 'C:\Temp\Stock_20200408_235959.csv' WITH ( FIRSTROW = 2, FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', TABLOCK ); -- ====================== -- Get the delta key value. DECLARE @YearMonth int -- Get the last month in the file. SELECT TOP(1) @YearMonth = YEAR(inDate) * 100 + MONTH(inDate) FROM dbo.Staging_Stock ORDER BY inDate DESC -- ====================== -- Merge differences from starting to target MERGE dbo.Stock AS T -- Target USING dbo.Staging_Stock AS S -- Staging ON T.stockItemId = S.stockItemId WHEN MATCHED AND EXISTS ( -- Check if column values are different. SELECT T.stockItemId, T.inDate, T.outDate, T.productId, T.quantity, T.listPrice EXCEPT SELECT S.stockItemId, S.inDate, S.outDate, S.productId, S.quantity, S.listPrice ) THEN UPDATE SET -- Update rows that have changed. T.stockItemId = S.stockItemId ,T.inDate = S.inDate ,T.outDate = S.outDate ,T.productId = S.productId ,T.quantity = S.quantity ,T.listPrice = S.listPrice ,T.updateDate = GETDATE() -- update stamp WHEN NOT MATCHED BY TARGET THEN INSERT -- Insert new rows. ( stockItemId ,inDate ,outDate ,productId ,quantity ,listPrice ,insertDate ,yearMonthKey ) VALUES ( S.stockItemId ,S.inDate ,S.outDate ,S.productId ,S.quantity ,S.listPrice ,GETDATE() -- insert stamp ,YEAR(inDate) * 100 + MONTH(inDate) -- Insert delta key. ) WHEN NOT MATCHED BY SOURCE AND yearMonthKey >= @YearMonth -- Only delete rows relative to the delta. THEN DELETE;
Now we’ve covered three main strategies.
- Truncate and load for small and simple data sets.
- Append for datasets that never change.
- Merge deltas for large and changing data sets.