Generating Supernatural (Guaranteed Unique) Keys in Snowflake

Why Unique Keys?

There are about 1 billion reasons not to use system natural keys ( ie the source record identifiers like user_id or account_number ) in a production data warehouse. The TLDR version is that your warehouse is where the buck stops - as the warehouse engineer you are responsible for the validity of unique identifiers, so it only makes sense to take full ownership of them.

So how do we generate unique keys?
Snowflake supports postgres-like sequences. A sequence is an RDBMS object that maintains a series of always-incrementing integers - so every time you ask a sequence for a number, you get one that is guaranteed to:
a. have never been given out before
b. be higher than any number given out before

You can create sequences just like in postgres:
CREATE OR REPLACE SEQUENCE your_sequence START = 1000
I like to start mine with a higher number just to make the resulting ID’s easier to read.

Then when calling the sequence, simply join it in and get the resulting nextval.

    SELECT 
        *,
        seq.nextval AS totally_unique_key
    FROM
        public.hamburgers_table, 
        table(getnextval(your_sequence)) seq

I’ll write more about surrogate keys vs ids, vs natural keys later - but for now this is how you can populate (and own) the uniqueness of your data instances.

Thanks, @EthanK this is helpful.

I find myself having to answer the question “why not natural keys” all the time so if you get around to writing a thorough article on why they are a bad idea it would be great.

How do you retrieve these generated sequence number using DBT. I have created a sequence number in the snowflake db and i am trying to include that in the dbt model.

The problem with using sequences with dbt is that sequence key generation is not idempotent. So you can end up with broken references between tables.

Sequences, as I remember, also seemed pretty costly/blocking in terms of generating enough values for large tables.

We’ve moved consistently to using hash keys generated off of unique business elements in the underlying data. This has the benefit of being an entirely deterministic and idempotent approach to key generation, and also a handy and more performant way to detect duplicate records based on combinations of attributes in incoming business data.

1 Like

The only solution I came up with was to use some jinja to select the max(warehouse_id) from the target table, then when you go to insert data in to the table is to use that and a ROWNUMBER() function.

Its a bit rubbish, but it seems to work OK. not sure how scalable it becomes if you have a large number of rows in your target table.