Snapshots

Snapshots help capture and track slow changes in data over time. A snapshot turns a table with slow in-place updates into a slowly changing dimension (SCD) Type 2 table, a concept widely used in data warehousing. Snapshots maintain the record of a rows changes over time.

Slowly Changing Dimensions (SCD) Type 2

Slowly Changing Dimensions (SCD) Type 2 is a technique used in data warehousing to handle changes in slow changing attributes over time. It maintains a full history of changes by creating a new record whenever a change occurs in a dimension attribute.

Consider an example to illustrate how SCD Type 2 works with Quary snapshots.

Raw Table (Source or Seed)

Suppose there is the following raw product table representing customer data. It's a table that captures the current state of the data and updates are made in place. In data warehouse speak, this is an SCD Type 1 table. For warehouse purposes, it's important though to track changes in the data over time.

customer_id	created_at	updated_at	name	country
1	2024-04-17 01:00:00	2024-04-17 01:00:00	John	Germany

Initial snapshot

When the initial snapshot is run at 2024-04-17 01:00:00, Quary captures the data from the raw table and creates a snapshot table that looks like this:

customer_id	created_at	updated_at	name	country		quary_valid_from	quary_valid_to	quary_scd_id
1	2024-04-17 01:00:00	2024-04-17 01:00:00	John	Germany	2024-04-17 01:00:00	(NULL)	8c7804f5fcbc88d035204462ccdb79f7

The table includes a few additional columns:

quary_valid_from represents the timestamp when the record was first inserted into the snapshot table.
quary_valid_to is set to NULL, indicating that the record is currently valid. All records will have a quary_valid_to = null on the first snapshot run.
quary_scd_id is a unique identifier generated for each snapshot record, composed of the unique_key & updated_at values.

Snapshot Update

Now suppose the customer's address is changed in place in the original table and looks like this:

customer_id	created_at	updated_at	name	country
1	2024-04-17 01:00:00	2024-04-18 01:00:00	John	France

Without snapshots, the change and record would be lost. When the snapshot is run again, Quary detects the change and updates the snapshot table, creating a permanent record of the change:

customer_id	name	country	email	updated_at	quary_valid_from	quary_valid_to	quary_scd_id
1	John	Germany	john@example.com	2024-04-17 01:00:00	2024-04-17 13:44:10.354+01	2024-04-18 13:49:16.806+01	4840341e7eb62c1dc3f976a35c889588
1	John	France	john@example.com	2024-04-18 01:00:00	2024-04-18 13:49:16.806+01	(NULL)	8c7804f5fcbc88d035204462ccdb79f7

The previous record's quary_valid_to is updated with the timestamp of the new snapshot, indicating the end of its validity period.
A new record is inserted with the updated email address, a new quary_valid_from timestamp, and a new quary_scd_id.
The new record's quary_valid_to is set to NULL, indicating that it is the currently valid record.

This process continues with each subsequent snapshot run, capturing changes in the raw data and maintaining a historical record of those changes in the snapshot table.

Using Snapshots in Quary

To define a snapshot in Quary:

Create a SQL file with the .snapshot.sql extension in the project's models directory. For example, customers_snapshot.snapshot.sql with the following content:

SELECT customer_id, created_at, updated_at, name, country FROM q.raw_customers

Define snapshot configuration should be defined in the project's schema. The configuration should include:

name: The file name of the snapshot (e.g. customers_snapshot)
unique_key: The primary key column or expression that uniquely identifies each record in the snapshot source table (e.g., customer_id).
strategy: The snapshot strategy to use. Currently, Quary only supports the timestamp strategy.

Here's an example snapshot configuration in the schema:

snapshots:
  - name: customers_snapshot
    unique_key: customer_id
    strategy:
      timestamp:
        updated_at: updated_at

In this example, the updated_at field is set to updated_at, indicating that the updated_at column represents when the source row was last updated.

Strategies

Timestamp

The timestamp strategy uses a timestamp column to determine if a row has changed. When the snapshot is run, Quary compares the value of the specified timestamp column for each row with the previous snapshot. If the timestamp is more recent than the last snapshot, Quary will update the existing record and create a new snapshot record with the updated values.

Snapshot Meta-fields

Quary adds the following meta-fields to the snapshot table to track changes over time:

Field	Meaning	Usage
quary_valid_from	The timestamp when this snapshot row was first inserted	This column can be used to order the different "versions" of a record.
quary_valid_to	The timestamp when this row became invalidated.	The most recent snapshot record will have `quary_valid_to` set to `NULL`.
quary_scd_id	A unique key generated for each snapshot record.	This is used internally by Quary.

Running Snapshots

To run snapshots in Quary, use the quary snapshot CLI command. This command executes the snapshot queries and updates the snapshot tables based on the changes detected. It's recommended to run snapshots at a regular interval (e.g., daily or weekly) to capture changes in the data over time. The appropriate interval depends on the specific data and business requirements, for example how frequently the source table data is updated.

Best-practices

Timing snapshots correctly

One of the most important best practices is to think about how quickly the dimensions change and to ensure that snapshots are run frequently enough to capture those changes. The appropriate frequency depends on the specific data and business requirements. For example, if the source table is updated daily, running snapshots weekly may not be sufficient to capture changes. In such cases, consider running snapshots more frequently, such as daily or twice daily, to help capture important changes.

While it is tempting to run snapshots frequently, it's also important to consider the performance implications of running snapshots too often. Frequent snapshots can put a strain on the database and may not be necessary if the data changes infrequently. It's a balancing act to find the right frequency that captures changes without overloading the system.

It some cases, it may be very well be that the source data changes too frequently to be captured by snapshots. In such cases, consider addressing the source itself.

Snapshot the data at the source

When defining a snapshot query, it's important to keep it as simple as possible and focused on capturing the source data in its rawest form. This means:

Snapshotting the source data with minimal transformations, using SELECT * if performance allows. Even if a column doesn't seem useful at the moment, it might be beneficial to include it in case it becomes useful later, as it won't be possible to recreate the column retrospectively.
Avoiding joins in the snapshot query, as they can make it challenging to build a reliable updated_at timestamp. Instead, snapshot the tables separately and join them in downstream models as required.
Minimizing business logic or complex transformations in the snapshot query, as changes to the logic in the future may be difficult or impossible to apply to historical snapshots.

By keeping the snapshot query simple and focused on capturing the source data, you can ensure that the snapshots accurately represent the state of the data at each point in time and remain flexible for future analysis.

Ensure your unique_key is unique

The unique key column is used by Quary to match rows. It's therefore important to make sure this key is actually unique. To ensure uniqueness, use tests to verify that the unique key is indeed unique in the source data.

sources:
  - name: raw_customers
    columns:
      - name: customer_id
        tests:
          - type: unique
          - type: not_null

Seeds

BigQuery