Can I create an auto-incrementing ID in dbt?

momomoro · September 16, 2019, 7:26pm

Is it possible to create an auto number field while creating models?

claire · September 17, 2019, 11:29am

TL:DR: No, it’s not possible to create an auto-incrementing ID in dbt.

In fact, auto-incrementing keys in general are a bad idea. The alternative is to generate a surrogate_key using a hash, and test the uniqueness of this key.

Auto-incrementing IDs in the time of ELT

This is good context for anyone that doesn’t come from this background!

A long time (a few years) ago, SQL statements that were used for ETL usually had two parts – the first created the table using DDL (data definition language), and the second inserted/updated/deleted/copied data using DML (data manipulation language).

Here’s an example that should works on Redshift:

create table if not exists analytics.customer (
  customer_key     integer identity(1,1),
  customer_id      integer not null,
  name             varchar(25) not null,
  address          varchar(25) not null,
  created_at       timestamp not null default current_timestamp
);

copy analytics.customer (
  customer_id,
  name,
  address
) from 's3://<your-bucket-name>/load/key_prefix'
credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>'
;

This approach was often required because the same engineers writing the code to transform data, were also loading it into their warehouse from an S3 bucket. These days, it’s a good idea to use Stitch or Fivetran to get data into your warehouse instead.

In this example there’s an auto-incrementing key in this query (customer_key). Often, an auto-incrementing key was used to create a primary key for a table, especially when there can be duplicate records for the same “natural key” (in this case, if a customer changes address, there will be two records with the same customer_id, but different customer_keys.

Does dbt support auto-incrementing keys?

In a word: no.

Behind the scenes, if you materialize a model as a table, dbt uses a create table as statement to create that table.

Let’s say that you already have your customers table in your warehouse, now you can run a query like this to create the analytics version of customers:

create table analytics.customers as (
    select
        id,
        name,
        address,
        created_at
    from raw_jaffle_shop.customers
)

^ This is kind of a silly example because in all likelihood, you’ll be doing some transformation along the way, but it’s just an example

We can do this because in dbt we assume that data is already loaded into your warehouse, this is why we refer to dbt as being the T in ELT (extract, load, transform) process

The create table as syntax does not support auto-incrementing keys, since you don’t explicitly define the columns. Similarly, it doesn’t support column default values, constraints, or encoding.

Why use the `create table as` syntax then?

create table as statements are so powerful because they allow you express your transformations as select statements – there’s no need to run two separate DDL & DML statements. In a dbt project, every model is a select – dbt handles wrapping the select statement in the appropriate DDL.

Defining every model as a select enables anyone that knows how to write these queries to own the transformation layer. We frequently see analysts with no prior knowledge of DDL and DML get up and running with dbt really quickly as a result!

Why do you say that auto-incrementing keys are a bad idea?

Auto-incrementing keys make it impossible (or at least very difficult) to maintain separate development and production versions of your transformations. This is because an auto-incrementing key is not idempotent.

If you were to create a development environment (say, a separate schema named dev_claire), and re-run all your transformations, the results of your auto-incrementing key would be different to your production environment. Maintaining a robust development environment is critical for a high-functioning data team, as it allows you to work quickly and test changes before deploying them.

But I need auto-incrementing keys to track versions of a record!

One of the primary use-cases for auto-incrementing keys is to generate a unique, “surrogate key” in a table, where they may be multiple versions of the same “natural key”.

For example:

Your customer table has a customer_key, which is a unique, auto-incrementing key. This is required because your customer can change address over time.
Your order table has a foreign key to customers (also customer_key). Since you’re using customer_key, you can tell which version of the customer record was valid when the order was placed.

If you’re using an auto-incrementing key on top of mutable source data, instead consider snapshotting your source data and building models on top of your snapshots.

OK, but I need a primary key for my model, what are my other options?

Yup, I agree with you there! We advocate for having a primary key (i.e. unique and not null) on every model.

Every model should have some “grain” to it, i.e. what one record represents. For example:

customers has one record per customer
subscription_periods has one record per subscription, per valid period
ad_spend_by_campaign_by_day has one record per campaign, per day

We use the surrogate_key macro from dbt-utils to generate a primary key based on the grain of a model. For our ad_spend_by_campaign_by_day example:

select
  {{ dbt_utils.surrogate_key(
      'campaign_name',
      'date_day'
  ) }} as ad_spend_id,
  campaign_name,
  platform,
  date_day,
  spend,
  ...

Behind the scenes, this macro hashes the combination of columns, which means that your resulting id will always be the same for a particular record. Or, put another way, the generated ad_spend_id is now idempotent.

So how do I force my surrogate key to be unique/not null?

In DDL statements, when you define a table first, you can add these column constraints to your table definition:

create table if not exists analytics.customer (
  customer_id      integer not null unique,
  ...
)

Not so with the create table as syntax! So how can you enforce this constraint?

(As an aside, did you know that Redshift doesn’t actually enforce uniqueness constraints? Check out the code at the end!)

In dbt, rather than enforcing constraints on the table definition and hoping the transformation fails, we instead state our constraints as assumptions about the results of our select statements, and then test those assumptions. Here’s a test to make sure that our ad_spend_id is unique and not null:

version: 2

models:
  - name: ad_spend_by_campaign_by_day
    columns:
      - name: ad_spend_id
        tests:
          - unique
          - not_null

Behind the scenes, these tests get compiled to queries that return an error if the result > 0.

-- unique test
select count(*)
from (
    select
        ad_spend_id

    from analytics.ad_spend_by_campaign_by_day
    where ad_spend_id is not null
    group by ad_spend_id
    having count(*) > 1
) validation_errors

-- not_null test 
select count(*)
from analytics.ad_spend_by_campaign_by_day
where ad_spend_id is null

We can run these tests when writing code (since we can now maintain a separate development environment thanks to our idempotent primary keys), before merging it, and after our transformations have run in production. Basically – we’re going to find out whenever this assumption doesn’t hold true anymore!

This pattern of testing is much more powerful than column-level constraints, as you can define custom tests for any constraint that can be turned into a query. It’s also much more readable IMO.

Final thoughts

There are some teams using dbt who have tried to add auto-incrementing keys to their dbt models. If you go down this route, implementing dbt will likely feel like paddling upstream. To quote myself ():

dbt is definitely a different approach to data transformation, and it could potentially feel a little unusual as you first get started. We encourage those who are just picking it up to find ways to work with it instead of attempting to hack around it: you may find that your code looks a little different than you’re used to, but we think you’ll be happier for it over the long run.

Side notes

This shouldn’t execute on Redshift, yet, it does. Good one Redshift!

$
create table jaffle_shop_dev.test_uniqueness (
  id integer unique
);

insert into jaffle_shop_dev.test_uniqueness (
    select 1
    union all
    select 1
);

select * from jaffle_shop_dev.test_uniqueness;
>>>
| id | 
|----| 
| 1  | 
| 1  |

Kimcha · December 31, 2019, 12:55pm

Thank you for the explanation @claire.

What about the guideline that fact tables should be thin, long and avoid unique strings?

Do you have benchmarks on what effect 128 bit varchar keys have on performance?

Even with column store DBs there would be many unique values and high memory usage.

jerco · January 13, 2020, 5:26pm

@Kimcha As I understand it, the conventional wisdom against fact tables containing millions of unique varchar strings is based on storage considerations. (varchar is a byte-hungry data type, and global uniqueness means a given value can’t be stored more efficiently as a numeric factor mapped to its original string value.)

One of the premises of modern databases, in addition to their columnar structure, is that storage is cheap—much cheaper than it was a few decades ago, when many principles of database design were canonized.

In fact, we’ve seen performance gains in Redshift by creating a common surrogate key across multiple tables that we can distribute and join on, rather than distributing evenly and joining on several columns at once. It’s a varchar surrogate key, for instance, that powers our join between ad spend and session data when modeling UTM performance.

To take it a step further, we believe that storage is cheap and human intuition is expensive. To my analyst eyes, integer keys look like natural keys deriving from source databases proper; 128-bit hashes look like something produced as part of an analytics workflow. That ease of association is often worth a marginal increase in storage cost.

That said, there are functions available in most databases (Redshift, Snowflake, BigQuery) that enable you to create/convert and store your surrogate keys as integers.

(I may be missing something major, in which case I’d be curious to hear more of your take!)

Kimcha · January 13, 2020, 9:03pm

Thank you for your response jerco. All of these are very interesting points.

To be honest I am not experienced enough with databases to comment on this. I only just started my DWH research. Most of what I have learned was based on Kimball’s and similar star schema books.

Either way, after working with dbt for a bit there’s no way I will go back to stored procedures and custom table creation-- even if there is a decent performance hit in my environment (MS SQL Server & Power BI).

DBT is just too elegant and speaks to my software-developer background very much

Topic		Replies	Views
Surrogate key for multiple-source dimension table Archive	2	2959	May 25, 2022
Surrogate keys in dbt: Integers or hashes? In-Depth Discussions devblog	8	6469	February 1, 2024
Does dbt add primary/foreign key constraints? Help testing , data-quality	7	27663	July 11, 2019
Duplicate Records on First run with Incremental Model on BigQuery Archive	3	7543	November 29, 2021
Insert only specific columns with DBT Archive	0	2244	August 3, 2022