Snapshots
Snapshots help capture and track slow changes in data over time. A snapshot turns a table with slow in-place updates into a slowly changing dimension (SCD) Type 2 table, a concept widely used in data warehousing. Snapshots maintain the record of a rows changes over time.
Slowly Changing Dimensions (SCD) Type 2
Slowly Changing Dimensions (SCD) Type 2 is a technique used in data warehousing to handle changes in slow changing attributes over time. It maintains a full history of changes by creating a new record whenever a change occurs in a dimension attribute.
Consider an example to illustrate how SCD Type 2 works with Quary snapshots.
Raw Table (Source or Seed)
Suppose there is the following raw product table representing customer data. It's a table that captures the current state of the data and updates are made in place. In data warehouse speak, this is an SCD Type 1 table. For warehouse purposes, it's important though to track changes in the data over time.
customer_id | created_at | updated_at | name | country |
---|---|---|---|---|
1 | 2024-04-17 01:00:00 | 2024-04-17 01:00:00 | John | Germany |
Initial snapshot
When the initial snapshot is run at 2024-04-17 01:00:00
, Quary captures the data from the raw table and creates a snapshot table that looks like this:
customer_id | created_at | updated_at | name | country | quary_valid_from | quary_valid_to | quary_scd_id | |
---|---|---|---|---|---|---|---|---|
1 | 2024-04-17 01:00:00 | 2024-04-17 01:00:00 | John | Germany | 2024-04-17 01:00:00 | (NULL) | 8c7804f5fcbc88d035204462ccdb79f7 |
The table includes a few additional columns:
quary_valid_from
represents the timestamp when the record was first inserted into the snapshot table.quary_valid_to
is set toNULL
, indicating that the record is currently valid. All records will have aquary_valid_to
= null on the first snapshot run.quary_scd_id
is a unique identifier generated for each snapshot record, composed of the unique_key & updated_at values.
Snapshot Update
Now suppose the customer's address is changed in place in the original table and looks like this:
customer_id | created_at | updated_at | name | country |
---|---|---|---|---|
1 | 2024-04-17 01:00:00 | 2024-04-18 01:00:00 | John | France |
Without snapshots, the change and record would be lost. When the snapshot is run again, Quary detects the change and updates the snapshot table, creating a permanent record of the change:
customer_id | name | country | updated_at | quary_valid_from | quary_valid_to | quary_scd_id | |
---|---|---|---|---|---|---|---|
1 | John | Germany | john@example.com | 2024-04-17 01:00:00 | 2024-04-17 13:44:10.354+01 | 2024-04-18 13:49:16.806+01 | 4840341e7eb62c1dc3f976a35c889588 |
1 | John | France | john@example.com | 2024-04-18 01:00:00 | 2024-04-18 13:49:16.806+01 | (NULL) | 8c7804f5fcbc88d035204462ccdb79f7 |
- The previous record's
quary_valid_to
is updated with the timestamp of the new snapshot, indicating the end of its validity period. - A new record is inserted with the updated email address, a
new quary_valid_from
timestamp, and a newquary_scd_id
. - The new record's
quary_valid_to
is set toNULL
, indicating that it is the currently valid record.
This process continues with each subsequent snapshot run, capturing changes in the raw data and maintaining a historical record of those changes in the snapshot table.
Using Snapshots in Quary
To define a snapshot in Quary:
- Create a SQL file with the
.snapshot.sql
extension in the project'smodels
directory. For example,customers_snapshot.snapshot.sql
with the following content:
- Define snapshot configuration should be defined in the project's schema. The configuration should include:
name
: The file name of the snapshot (e.g.customers_snapshot
)unique_key
: The primary key column or expression that uniquely identifies each record in the snapshot source table (e.g.,customer_id
).strategy
: The snapshot strategy to use. Currently, Quary only supports thetimestamp
strategy.
Here's an example snapshot configuration in the schema:
In this example, the updated_at
field is set to updated_at
, indicating that the updated_at
column represents when the source row was last updated.
Strategies
Timestamp
The timestamp
strategy uses a timestamp column to determine if a row has changed. When the snapshot is run, Quary compares the value of the specified timestamp column for each row with the previous snapshot. If the timestamp is more recent than the last snapshot, Quary will update the existing record and create a new snapshot record with the updated values.
Snapshot Meta-fields
Quary adds the following meta-fields to the snapshot table to track changes over time:
Field | Meaning | Usage |
---|---|---|
quary_valid_from | The timestamp when this snapshot row was first inserted | This column can be used to order the different "versions" of a record. |
quary_valid_to | The timestamp when this row became invalidated. | The most recent snapshot record will have quary_valid_to set to NULL . |
quary_scd_id | A unique key generated for each snapshot record. | This is used internally by Quary. |
Running Snapshots
To run snapshots in Quary, use the quary snapshot
CLI command. This command executes the snapshot queries and updates the snapshot tables based on the changes detected. It's recommended to run snapshots at a regular interval (e.g., daily or weekly) to capture changes in the data over time. The appropriate interval depends on the specific data and business requirements, for example how frequently the source table data is updated.
Best-practices
Timing snapshots correctly
One of the most important best practices is to think about how quickly the dimensions change and to ensure that snapshots are run frequently enough to capture those changes. The appropriate frequency depends on the specific data and business requirements. For example, if the source table is updated daily, running snapshots weekly may not be sufficient to capture changes. In such cases, consider running snapshots more frequently, such as daily or twice daily, to help capture important changes.
While it is tempting to run snapshots frequently, it's also important to consider the performance implications of running snapshots too often. Frequent snapshots can put a strain on the database and may not be necessary if the data changes infrequently. It's a balancing act to find the right frequency that captures changes without overloading the system.
It some cases, it may be very well be that the source data changes too frequently to be captured by snapshots. In such cases, consider addressing the source itself.
Snapshot the data at the source
When defining a snapshot query, it's important to keep it as simple as possible and focused on capturing the source data in its rawest form. This means:
- Snapshotting the source data with minimal transformations, using
SELECT *
if performance allows. Even if a column doesn't seem useful at the moment, it might be beneficial to include it in case it becomes useful later, as it won't be possible to recreate the column retrospectively. - Avoiding joins in the snapshot query, as they can make it challenging to build a reliable
updated_at
timestamp. Instead, snapshot the tables separately and join them in downstream models as required. - Minimizing business logic or complex transformations in the snapshot query, as changes to the logic in the future may be difficult or impossible to apply to historical snapshots.
By keeping the snapshot query simple and focused on capturing the source data, you can ensure that the snapshots accurately represent the state of the data at each point in time and remain flexible for future analysis.
Ensure your unique_key is unique
The unique key column is used by Quary to match rows. It's therefore important to make sure this key is actually unique. To ensure uniqueness, use tests to verify that the unique key is indeed unique in the source data.