Generating Supernatural (Guaranteed Unique) Keys in Snowflake

EthanK · August 15, 2018, 8:42pm

Why Unique Keys?

There are about 1 billion reasons not to use system natural keys ( ie the source record identifiers like user_id or account_number ) in a production data warehouse. The TLDR version is that your warehouse is where the buck stops - as the warehouse engineer you are responsible for the validity of unique identifiers, so it only makes sense to take full ownership of them.

So how do we generate unique keys?
Snowflake supports postgres-like sequences. A sequence is an RDBMS object that maintains a series of always-incrementing integers - so every time you ask a sequence for a number, you get one that is guaranteed to:
a. have never been given out before
b. be higher than any number given out before

You can create sequences just like in postgres:
CREATE OR REPLACE SEQUENCE your_sequence START = 1000
I like to start mine with a higher number just to make the resulting ID’s easier to read.

Then when calling the sequence, simply join it in and get the resulting nextval.

    SELECT 
        *,
        seq.nextval AS totally_unique_key
    FROM
        public.hamburgers_table, 
        table(getnextval(your_sequence)) seq

I’ll write more about surrogate keys vs ids, vs natural keys later - but for now this is how you can populate (and own) the uniqueness of your data instances.

josh · February 10, 2019, 8:08pm

Thanks, @EthanK this is helpful.

I find myself having to answer the question “why not natural keys” all the time so if you get around to writing a thorough article on why they are a bad idea it would be great.

jaggu · April 20, 2021, 2:37am

How do you retrieve these generated sequence number using DBT. I have created a sequence number in the snowflake db and i am trying to include that in the dbt model.

josh · June 4, 2021, 4:31am

The problem with using sequences with dbt is that sequence key generation is not idempotent. So you can end up with broken references between tables.

Sequences, as I remember, also seemed pretty costly/blocking in terms of generating enough values for large tables.

We’ve moved consistently to using hash keys generated off of unique business elements in the underlying data. This has the benefit of being an entirely deterministic and idempotent approach to key generation, and also a handy and more performant way to detect duplicate records based on combinations of attributes in incoming business data.

davidbeaty · June 25, 2021, 1:55pm

The only solution I came up with was to use some jinja to select the max(warehouse_id) from the target table, then when you go to insert data in to the table is to use that and a ROWNUMBER() function.

Its a bit rubbish, but it seems to work OK. not sure how scalable it becomes if you have a large number of rows in your target table.

Topic		Replies	Views
Custom Materialization with unique_key Archive	0	1710	October 16, 2020
Error when using Sequence: Object '...' does not exist or not authorized. Help snowflake , dbt-cloud	3	2355	January 19, 2023
Sequence object deployment on snowflake Archive	2	3507	November 10, 2020
Dynamic schema/model generation based on a metadata table in snowflake Help jinja , incremental , best-practice , snowflake	0	1129	March 7, 2023
Snowflake cluster keys in dbt -- learnings and thoughts Show and Tell snowflake	0	2997	September 17, 2022

Generating Supernatural (Guaranteed Unique) Keys in Snowflake

Why Unique Keys?

Related Topics