DuckDb Guide
DuckDb is an in-process SQL database. That's a mouthful but really matters is that it is a great way to query data locally on your machine. It's fast, feature rich and has a number of plugins that make it easy to pull various sources of data, from csv
files, parquet
files, to other databases like Postgres
. In this guide, we'll show you how to get up and running with DuckDb locally with quary
and give you the basics to understand the DuckDB template. We'll show you how to add data, transform it and test it.
Installation
DuckDb is bundled into Quary, so you don't need to install anything else other than Quary. If you haven't already, download Quary and install it.
Once you have Quary installed, both as a CLI and as VSCode extension, you can start using DuckDb with its full capability.
Demo Project
If you want to look at the DuckDb template, you can find it here. You can also use the template by running the following command in a new folder:
This will create a new project with the DuckDb template. You can then open the project in VSCode and start working with it. Rather than do that though, this guide will walk you through the basics of DuckDb.
Starting From Scratch
Configuring DuckDb
To start from scratch, create a new folder. In that folder, let's create the config file for DuckDb. Create a new file called quary.yaml
and add the following content:
This tells Quary to use DuckDb and to store the database in a file called data.db
. You can change the path to whatever you want.
Adding source data
Now that we have the config file, let's add some data. Create a new file called people.csv
, place it in a folder called data
and add the following content:
This is a simple CSV file with three columns: id
, name
, and age
. We'll use this file as the data source in DuckDb. To do this, create a folder called models
and add a new file called sources.yaml
with the following content:
The above file names the source people
and tells Quary to use the people.csv
file as the source. It also specifies the columns and their tests. The id
column should be unique and not null, the name
column should not be null, and the age
column should not be null and greater than or equal to 0.
From the root of the project, run the following command. This will run the tests on the source and ensure that the data is valid.
At this point, your file structure should look like this and if that is the case you now have a source that you can use in DuckDb to start transforming data.
Transforming data
Now that we have the source, let's query the data. Create a new file called age_distribution.sql
in the models
folder. In this file, we will write a SQL query to get the count of the number of people in each age group (1-10, 11-20, 21-30, etc.).
That query groups the ages into different buckets and counts the number of people in each bucket. The q.people
refers to the people
source that we defined earlier. The name for a source is always q.<source_name>
where the source name is the name of the source in the sources.yaml
file. To build this query as a view in DuckDb, run the following command:
This will create the structure in the data.db
file. You can now query the data using the duckdb CLI for example.
At this point, your file structure should look like this.
Testing transformation
In addition to building the view, you can also test the transformation. Create a new file called schema.yaml
in the models
folder and add the following content:
This file specifies the tests for the age_distribution
model. The age_group
column should not be null and the count
column should not be null and greater than or equal to 0. To run the tests, run the following command:
Note that the name of the model is inferred from the file name. The file name should be the name of the model with the extension .sql
. Just like you can refer to a source using q.<source_name>
, you can refer to a model using q.<model_name>
.