This is a companion discussion topic for the original entry at Surrogate keys in dbt: Integers or hashes? | dbt Developer Blog
Great article, I do have a question:
BI tools like (Azure) Analysis Services & Power BI perform best when join columns have a small datatype (like int) and memory usage is reduced as much as possible. Importing a pure star schema with hashed values as PK/FK’s in to these tools wouldn’t be the best solution especially not with wide fact tables with many dimension FK’s.
Since last year databricks / delta lake supports auto increment generated columns.
Is it possible to include this column in the model of a dimension, I did some attempts but weren’t able to succeed. I wonder if there is a way to hook in to the create table statement of the dimension and add a line like:
pk_customer_id BIGINT GENERATED ALWAYS AS IDENTITY
Looking forward to your reply.
Did you ever figure anything out for this? I’m currently researching the same thing.
@mwielen @Johnny
I saw your questions when I was researching this topic, and I thought I would add some color, and mention why I believe that auto-incremented generated columns will also not work as a substitute for an idempotent surrogate key:
One thing I think that it’s easy to forget, is that from the DWH side, the human thought pattern is that tables get created once, and they are never dropped, and then just altered after that.
I suspect that is because the human act of creating a table is time-consuming SQL to write and execute, so naturally the idea that we would drop and re-create a table for any reason whatsoever is hard to embrace as a task we would humanly want to manage, and also likely feels familiar in larger organizations where a dbadmin often had to get involved to create a new table.
When that is the case (a table is created once by a db admin), a function like MIISK or sequences, etc… can work and at least be expected that id 1 will always relate to “bob’s order”, not because it’s idempotent, but ONLY because it’s generated once for any given row of data.
But the dbt paradigm drops and recreates tables every time a model is run. That’s the principal reason that non surrogate key strategies that are not idempotent fail: Since The table is recreated on every dbt run (with the exception of incremental non-refreshes), a key generated without idempotency based on the row’s data means there is no guarantee that id 1 will remain linked to bob’s order.
This is true for auto-incremented generated columns, MIISK, sequences, etc… When dbt has the Data Warehouse rebuild a table, there is no guarantee from the DWH that the rows will be recreated in the same order with the same data. As a result, the auto incremented generated columns for example, will be re-created in perfect order, but won’t necessarily link to the same data. That’s the real problem.
Even an “order by” change to the SQL model (or the upstream model) would result in these strategies creating row ID’s linked to different data.
And the answer is not “dbt’s paradigm is wrong”, because MIISK row id strategies don’t work as a result. (That’s a topic for a different debate, but the disadvantages of not using dbt’s paradigm far outweigh the loss of MIISK strategies as a records primary key, in my opinion)
The answer is that you want to be able to put a human-readable identifier for rows in the BI tool if possible (maybe there is an account ID that is an integer created at the source for example), but behind the scenes, every row has a unique and idempotent surrogate key that joins and other functions are performed against, and that follows that fact or dim throughout the data journey.
To me, that’s the key here (no pun intended): We need an idempotent key in the backend, in order to reliably support operations against non-idempotent row information in the front end.
If you want a surrogate key that results in an integer, try a function like snowflake’s md5_number_lower64() or md5_number_upper64(). The qty of rows will be limited to 64 bits ( 18446744073709551615 values), so that ID might only be unique in some contexts, but it could be enough to keep you in integers as surrogate keys.
Hope this helps!
Maybe dbt isn’t the tool for me and my org then.
We’re building curated data models using a star schema pattern in either Fabric or Databricks for consumption in Power BI further downstream. The need for integer surrogate keys is a real need, regardless of the origins of that pattern from a ddl perspective.
You’re asserting that the answer is not that the “the dbt paradigm is wrong”, but maybe it is that “the dbt paradigm does not fit my use case”.
Before Databricks introduced generated IDs I did have a work around for this that involved using row_number that worked ok. I may try and rework that to work with dbt and will report back if I have any success.
@Johnny @will-sargent_dbtlabs thanks for your thoughts, at the time of writing the initial post I always thought that having an integer surrogate key would always lead to better compression and join performance in tabular models. Thanks to Marco Russo I have learned that the vertipaq engine always uses a ‘hash’ datatype in PK and FK relations:
Increasing compression in Power BI and Analysis Services – Unplugged #15 - SQLBI
That said there is no performance or storage reason to have integer surrogate keys in your star schema as vertipaq will convert these values to hash anyway. We did some extensive testing and the storage consumption and performance of a star schema with hundreds of million’s of records with integer surrogate keys or composite business keys instead was the same.
Due to that we have decided not to use integer surrogate keys in our star schemas anymore.
Thanks for the share - I’ll check it out.
…still makes me feel icky though
@will-sargent_dbtlabs
Currently looking into truncating the result of dbt utils generate surrogate key to fit into numeric 38.
It handles all the requirements but not sure about performance yet.