Slowly Changing Dimensions using dbt snapshots

Sarath Dhulipalla
6 min readDec 30, 2023

--

The picture was created with Bing Image Creator.

As data engineers, we are often tasked with designing and implementing data warehouses that can provide accurate and timely insights into our business. One key challenge in designing data warehouses is managing changes to data over time. Slowly changing dimensions (SCDs) are a data warehousing concept that helps us effectively manage these changes.

There are three main types of SCDs — Type 1, Type 2 and Type 3. There are a couple of more types — Type 4 and Type 6, which are infrequently used.

Let’s explore Different Types of Slowly Changing Dimensions

Type 1: Overwriting the Past

Type 1 SCD is straightforward. It involves overwriting existing data with new information, retaining only the current version of the data. Historical changes are not preserved, simplifying the dataset but at the cost of losing historical context.

Type 2: Preserving Historical Records

Type 2 SCD is widely adopted for its balance between complexity and practicality. When there’s a change in one or more attributes, a new record is added to the dataset. Two additional columns, start_date and end_date, manage the changes. The old record is ‘end dated,’ indicating its validity period, while the new record is inserted with a future end date, preserving historical changes for analysis.

Type 3: Selective Historical Tracking

Type 3 SCD is a compromise between simplicity and historical context. Instead of adding new rows, this method introduces new columns to track changes. However, it only maintains the current and immediate previous state of selected attributes. This approach is useful when detailed historical data is not necessary for all columns.

Choosing the Right Approach

Each SCD type serves a specific purpose, catering to different data management needs. The choice depends on the project requirements, business goals, and the level of historical granularity required. Understanding the nuances of each type is crucial for data engineers, ensuring they select the right method to maintain data integrity and relevance over time.

Now that we have explored the intricacies of Slowly Changing Dimensions (SCD) types, let’s delve into the practical implementation of the most commonly used method, SCD Type 2, using dbt (data build tool). dbt’s intuitive approach makes handling these complex transformations surprisingly straightforward.

Understanding the SCD Type 2 implementation in dbt

In dbt, SCD Type 2 models are aptly referred to as “snapshots”. These snapshots essentially capture the historical changes in the dataset, employing a strategy that distinguishes between different versions of the same record over time.

What sets dbt apart is its user-friendly approach to building snapshots. dbt abstracts away the complexity, allowing data engineers to focus on the core logic. Here’s how it works:

1. Mention the Strategy and Columns

To build an SCD Type 2 snapshot model in dbt, you start by specifying the strategy you want to use. dbt offers two options:

  • Timestamp Strategy: This method requires a timestamp column (e.g., updated_at) in the source data. dbt uses this timestamp to track changes over time, ensuring that historical versions are correctly preserved.
  • Check Strategy: When a reliable timestamp is absent, you use the Check strategy. In this approach, dbt compares a list of columns that you provide to detect changes. If any of these columns’ values changed since last time they were loaded, dbt recognizes it as a change and creates a new record.

2. Streamlined Implementation

Once you define the strategy and columns, dbt takes care of the heavy lifting. It automatically generates the necessary SQL queries to handle the historical tracking of records. This automation significantly reduces the manual effort required, allowing data engineers to focus on other critical aspects of the project.

Here is an example of how you would do this implementation in dbt

Let say we are tracking changes on orders table in the jaffle shop schema. Here are a few sample records

Here is how the orders_snapshot table looks like

The two additional columns valid_from and valid_to represent the validity period of the data.

Now lets say status on order_id 123 changed from pending to shipped here is how the data looks in the source and the snapshot tables

As you can see, the order_id 123 record with pending status has been end dated and a new record has been created.

Now let’s look at the implementation of dbt model

First you would define your source as yml

version: 2
sources:
- name: jaffle_shop
tables:
- name: orders
columns:
- name: order_id
- name: order_description
- name: quantity
- name: order_status

and then the snapshot model

{% snapshot orders_snapshot %}

{{
config(
target_database='target_db',
target_schema='jaffle_shop_snapshots',
unique_key='order_id',

strategy='check',
check_cols=['order_status', 'quantity'],
)
}}

select * from {{source('jaffle_shop','orders')}}

{% endsnapshot %}

If you are looking for more steps, there are none. That is all. it’s that simple. dbt essentially made it that easy to implement a SCD type 2 table or a snapshot table as dbt calls it.

Armed with the knowledge gained in this article, you’re ready to track changes in your data with the power of SCD Type 2 tables. But for those who yearn to peek behind the curtain of magic, the journey continues…

What’s being done under the hood?

First dbt creates a temporary table

With target as 
(
SELECT
order_id
, order_description
, quantity
, order_status
, valid_from
, valid_to
FROM orders_snapshot
WHERE valid_to is NULL
)
, source as
(
SELECT
order_id
, order_description
, quantity
, order_status
, CURRENT_TIMESTAMP as valid_from
, NULL as valid_to
FROM source_table
)

, inserts as
(
SELECT
s.order_id
, s.order_description
, s.quantity
, s.order_status
, s.valid_from
, s.valid_to
, 'insert' as change_type
FROM source s
left join target t
ON s.order_id = t.order_id
WHERE t.order_id is null
)
, updates as
(
SELECT
t.order_id
, t.order_description
, t.quantity
, t.order_status
, t.valid_from
, s.valid_from as valid_to
, 'update' as change_type
FROM source s
INNER JOIN target t
ON s.order_id = t.order_id
AND t.valid_to is NULL
WHERE t.quantity <> s.quantity
OR t.order_status <> s.order_status
)

, all_changes
(
SELECT * FROM inserts
UNION ALL
SELECT * FROM updates
)

INSERT INTO changes_stage_table AS
SELECT * FROM all_cahnges;

and then it merges the data into the final table.

below is the final merge statement. Please note that there are two more metadata columns dbt adds to the final table. dbt_updated_at and dbt_scd_id which we did not talk about.

dbt_updated_at is self explanatory. It is the time when the record was updated. It is has no business meaning. It is pure timestamp when the record was either inserted or updated.

dbt_scd_id is a surrogate key dbt generates for each record. dbt uses this key to join to the target table. This key is generated based on the the unique key we specified.

MERGE INTO
orders_snapshot AS DBT_INTERNAL_DEST
USING
changes_stage_table AS DBT_INTERNAL_SOURCE
ON
DBT_INTERNAL_SOURCE.dbt_scd_id = DBT_INTERNAL_DEST.dbt_scd_id
WHEN MATCHED AND DBT_INTERNAL_DEST.dbt_valid_to IS NULL AND DBT_INTERNAL_SOURCE.dbt_change_type IN ('update', 'delete') THEN UPDATE SET dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
WHEN NOT MATCHED
AND DBT_INTERNAL_SOURCE.dbt_change_type = 'insert' THEN
INSERT
(`order_id`,
`order_description`,
`quantity`,
`order_status`,
`dbt_updated_at`,
`dbt_valid_from`,
`dbt_valid_to`,
`dbt_scd_id`)
VALUES
(`order_id`, `order_description`, `quantity`, `order_status`, `dbt_updated_at`, `dbt_valid_from`, `dbt_valid_to`, `dbt_scd_id`)

A final note.

While we’ve explored the mechanisms for handling new insertions and updates within dbt snapshots, there will be scenarios of record deletions as well. To ensure data integrity and consistency, it’s crucial that our snapshots accurately reflect these removals from the source.

dbt empowers us to gracefully handle these scenarios through the invalidate_hard_deletes parameter. This boolean flag, when set to True within your snapshot configuration, activates dbt's ability to track and invalidate deleted records.

In the forthcoming article, we’ll delve into the intricacies of implementing an SCD Type 3 model within the dbt framework. Stay tuned…

--

--