<< Click to Display Table of Contents >> Navigation: 5. Detailed description of the Actions > 5.24. TA - Data Engineering > 5.24.1. Historized Slowly Changing Dimensions |
Icon:
Function: SlowlyChangingDimensions
Property window:
Short description:
Create a historized view of a SCD table.
Long Description:
When engaging in predictive analytic activities, we must be able to “see” any table as it was at a specific point in time in the past (i.e. in technical terms, we must see a table at the “observation date”). Unfortunately, most operational systems do not keep an history of their tables and “going back in time” is almost never possible… …unless you use the Anatella “Slowly Changing Dimensions” Action!
This Action creates an output table that keeps tracks of all the changes made to the specified (database) table. This process is typically referred as “Keeping a historical view of a Slowly Changing Dimension”: you’ll find more details on this process here:
https://en.wikipedia.org/wiki/Slowly_changing_dimension
Typically, you run this action every day (or week) and it will look for all the changes inside an input table and logs them all inside one output table.
According to wikipedia, there exists 6 different techniques to handle slowly changing dimensions:
SCD Type |
Description |
Type - 0 |
Retaining Original: you just get one dump |
Type - 1 |
Overwriting: you overwrite the dimension with the last version |
Type - 2 |
Adding new rows (you keep history as rows) |
Type - 3 |
Adding columns (you keep history as extra columns) |
Type - 4 |
Adding new records on new table (History Table) |
Type - 6 |
Hybrid approach (Type 1 + Type 2 + Type 3) ß used here |
Here we will be using a ‘Type - 6" implementation including :
•A surrogate (or technical) key as unique id: This is a composite key composed of the columns “SCD_Key” and “SCD_Seq”.
•Active line indicator: “SCD_current_flag”
•Start date and End date: [SCD_from_date, SCD_to_date[
•Seq number of history: “SCD_Seq”
•A very basic deactive flag (only when the parameter P3 is “contains the full table”): “SCD_active”
The parameter P4 is the name of a unique key (e.g. customer key) inside the input table. This key must be unique in the input file.
Ideally, the input file should also contain a column that contains the date of the last update of a row (the date format is yyyyMMdd). The name of this “lastUpdateDate” column is given in the parameter P7. If you don’t have such a column, no worries: This is a very common situation: Just set the parameter P6 to “…does not contains a "lastUpdateTime" column (use the current date instead)”.
The output table contains these additional columns (in addition to all the columns from the input table):
SCD_Key |
Logical key of the slowly changing dimension. Equal to the key (e.g. customer key) in the input file |
SCD_seq |
The sequence of statuses… first status = 1… next update = 2 … |
SCD_from_date |
Start moment of the validity for the record (YYYYMMDD) |
SCD_to_Date |
End date of the validity (YYYYMMDD) |
SCD_current_flag |
1 if the record is the current active record 0 if the record is a historic record |
SCD_active |
This is a technical active flag only relevant when the parameter P3 is “contains the full table”): A = active D = deleted If the primary key is in the input file, then the status is set to A. If the primary key is missing in the input file the status will be ‘D’. |