Creating a slowly changing dimension (SCD) is an essential part of Data Warehousing. There are many tools for the job but I prefer to do the work using T-SQL against the database rather than using something like SSIS.
This will generally involve staging the data from source and then merging the source copy against target.
Source > Stage > Target
In this post I use a sample data set and go through how to accomplish this using a merge statement in a sub query with an output clause.
I’ve created the below with T-SQL against SQL Server 2016.
Example Data Set
Lets say we have our sample data of Imperial employees from the Star Wars universe.
You can follow along with the below SQL.
DROP TABLE IF EXISTS dbo.Employee; GO CREATE TABLE dbo.Employee ( EmployeeId int IDENTITY(1,1) NOT NULL ,Forename varchar(100) NOT NULL ,Surname varchar(250) NOT NULL ,CurrentLocation varchar(100) NOT NULL ,CurrentStatus varchar(50) NOT NULL CONSTRAINT [PK_dbo_Employee_EmployeeId] PRIMARY KEY CLUSTERED (EmployeeId) ); GO INSERT INTO dbo.Employee (Forename, Surname, CurrentLocation, CurrentStatus) VALUES ('Emperor', 'Palpatine' , 'Coruscant', 'Active') ,('Darth', 'Vader' , 'Mustafar', 'Active') ,('Wilhuff', 'Tarkin' , 'Death Star', 'Active') ,('Orson', 'Krennic' , 'Scarif', 'Active') ,('Galen', 'Erso' , 'Eadu', 'Active') GO
The SCD Type 2 Table
I now need my target SCD Type 2 table that captures the changes through a start and end date range. Every-time a row is changed in our sample data a new record needs to be inserted into the target. The only exception to this will be when a row is deleted.
DROP TABLE IF EXISTS dbo.dimEmployeeT2; GO CREATE TABLE dbo.dimEmployeeT2 ( EmployeeKey int IDENTITY(1,1) NOT NULL ,EmployeeId int NOT NULL ,Forename varchar(100) NOT NULL ,Surname varchar(250) NOT NULL ,CurrentLocation varchar(100) NOT NULL ,CurrentStatus varchar(50) NOT NULL ,_$DateFrom datetime2 NOT NULL ,_$DateTo datetime2 NULL ,_$IsCurrent bit NOT NULL ,_$IsDeleted bit NOT NULL CONSTRAINT [PK_dbo_Employee_EmployeeKey] PRIMARY KEY CLUSTERED (EmployeeKey) ); GO
You’ll notice that we have a column for identifying when a record has been deleted, _$IsDeleted. When a row is deleted we will mark it with an end date and and set _$IsDeleted to true.
The table also contains another boolian column _$IsCurrent for reconigsing when a row is the most current record.
You’ll also notice that we have a surrogate key named EmployeeKey with IDENTITY and the business key EmployeeId so we can join source and target rows in our merge statement.
The SCD SQL Logic
This SQL logic is long so I’m going to break it down into parts as I explain but first here it is in its entirety. You can run and create this procedure in the same db as your test tables from above.
DROP PROCEDURE IF EXISTS dbo.uspEtlDimEmployeeT2 GO CREATE PROCEDURE dbo.uspEtlDimEmployeeT2 AS BEGIN TRY /******************************* * Title: Merge dbo.dimEmployeeT2 * SCD Type: Type 2 ********************************/ SET NOCOUNT ON; -- Variables DECLARE @Date datetime2 = GETDATE() -- Insert changed record. INSERT INTO dbo.dimEmployeeT2 -- Target ( EmployeeId ,Forename ,Surname ,CurrentLocation ,CurrentStatus ,_$DateFrom ,_$IsCurrent ,_$IsDeleted ) SELECT M.EmployeeId ,M.Forename ,M.Surname ,M.CurrentLocation ,M.CurrentStatus ,@Date ,1 -- IsCurrent ,0 -- IsDeleted FROM ( -- Merge MERGE dbo.dimEmployeeT2 AS T -- Target USING dbo.Employee AS S -- Source -- Join on business key. ON T.EmployeeId = S.EmployeeId WHEN MATCHED AND -- Match the record on the latest (IsCurrent) and there being a diference. _$IsCurrent = 1 AND EXISTS ( -- Column values are different. SELECT T.Forename ,T.Surname ,T.CurrentLocation ,T.CurrentStatus EXCEPT SELECT S.Forename ,S.Surname ,S.CurrentLocation ,S.CurrentStatus ) THEN UPDATE SET _$DateTo = @Date ,_$IsCurrent = 0 -- Insert new rows. WHEN NOT MATCHED BY TARGET THEN INSERT ( EmployeeId ,Forename ,Surname ,CurrentLocation ,CurrentStatus ,_$DateFrom ,_$IsCurrent ,_$IsDeleted ) VALUES ( S.EmployeeId ,S.Forename ,S.Surname ,S.CurrentLocation ,S.CurrentStatus ,@Date ,1 ,0 ) -- When deleted update deleted flag. WHEN NOT MATCHED BY SOURCE AND _$IsCurrent = 1 THEN UPDATE SET _$DateTo = @Date ,_$IsCurrent = 0 ,_$IsDeleted = 1 -- Output the rows that have been updated in order to insert in outer query. OUTPUT $ACTION AS ChangeType ,S.EmployeeId ,S.Forename ,S.Surname ,S.CurrentLocation ,S.CurrentStatus ) AS M WHERE -- Return rows that where updated inside the MERGE sub query. M.ChangeType = 'UPDATE' AND EmployeeId IS NOT NULL -- Don't include deleted rows. END TRY BEGIN CATCH THROW END CATCH GO
The Merge Statement
Firstly lets start with the merge statement. The merge is inside the sub query and essentially it’s joining on the business key and looking for differences. The logic works like this:
- If the row has changed then mark it as no longer current and add an end date.
- If a row is new (exists in the source but not target) then insert a new row with a current start date and no end date. Mark as the most current record.
- If a row has been deleted (exists in target but not in source) then mark it as deleted and add an end date.
The Output Clause
Output clauses are useful for archiving, logging changes or throwing up messages from results to an application layer. In this instance we want to use it as if we’re archiving. So the procedure outputs changes from the sub query and in the SELECT we insert the new changes into the target table.
Lets Test the SCD
So now we have all our SQL objects ready lets run the procedure and check the results.
EXEC dbo.uspEtlDimEmployeeT2 GO SELECT * FROM dbo.dimEmployeeT2 GO
Excellent, we have all our records and they’re the most current. But what happens when Galen Erso runs away from the Empire before Rogue One? Run the following to see.
-- Galen runs away. UPDATE dbo.Employee SET CurrentStatus = 'Missing' WHERE Surname = 'Erso' GO -- Update the SCD. EXEC dbo.uspEtlDimEmployeeT2 GO SELECT * FROM dbo.dimEmployeeT2 GO
As you can see we now have two records for Erso. The first one is no longer marked as active and has an end date. The second row is a reflection of the updated row in the source table.
This time lets run through some more changes with the below…
-- Krennic and Erso don't make it. UPDATE dbo.Employee SET CurrentStatus = 'KIA' WHERE Surname IN ('Krennic','Erso') GO EXEC dbo.uspEtlDimEmployeeT2 GO -- Erso gets deleted from imperial records. DELETE FROM dbo.Employee WHERE Surname = 'Erso' GO EXEC dbo.uspEtlDimEmployeeT2 GO SELECT * FROM dbo.dimEmployeeT2 GO
Now we can see an example of further updates and a deletion.
So there you go. Hope that’s helpful!
For finding out the useful applications of Slowly Changing Dimensions I highly recommend reading The Data Warehouse Toolkit by Ralph Kimball and Margy Ross.