This is a companion discussion topic for the original entry at Surrogate keys in dbt: Integers or hashes? | dbt Developer Blog
Great article, I do have a question:
BI tools like (Azure) Analysis Services & Power BI perform best when join columns have a small datatype (like int) and memory usage is reduced as much as possible. Importing a pure star schema with hashed values as PK/FKâs in to these tools wouldnât be the best solution especially not with wide fact tables with many dimension FKâs.
Since last year databricks / delta lake supports auto increment generated columns.
Is it possible to include this column in the model of a dimension, I did some attempts but werenât able to succeed. I wonder if there is a way to hook in to the create table statement of the dimension and add a line like:
pk_customer_id BIGINT GENERATED ALWAYS AS IDENTITY
Looking forward to your reply.
Did you ever figure anything out for this? Iâm currently researching the same thing.
@mwielen @Johnny
I saw your questions when I was researching this topic, and I thought I would add some color, and mention why I believe that auto-incremented generated columns will also not work as a substitute for an idempotent surrogate key:
One thing I think that itâs easy to forget, is that from the DWH side, the human thought pattern is that tables get created once, and they are never dropped, and then just altered after that.
I suspect that is because the human act of creating a table is time-consuming SQL to write and execute, so naturally the idea that we would drop and re-create a table for any reason whatsoever is hard to embrace as a task we would humanly want to manage, and also likely feels familiar in larger organizations where a dbadmin often had to get involved to create a new table.
When that is the case (a table is created once by a db admin), a function like MIISK or sequences, etc⌠can work and at least be expected that id 1 will always relate to âbobâs orderâ, not because itâs idempotent, but ONLY because itâs generated once for any given row of data.
But the dbt paradigm drops and recreates tables every time a model is run. Thatâs the principal reason that non surrogate key strategies that are not idempotent fail: Since The table is recreated on every dbt run (with the exception of incremental non-refreshes), a key generated without idempotency based on the rowâs data means there is no guarantee that id 1 will remain linked to bobâs order.
This is true for auto-incremented generated columns, MIISK, sequences, etc⌠When dbt has the Data Warehouse rebuild a table, there is no guarantee from the DWH that the rows will be recreated in the same order with the same data. As a result, the auto incremented generated columns for example, will be re-created in perfect order, but wonât necessarily link to the same data. Thatâs the real problem.
Even an âorder byâ change to the SQL model (or the upstream model) would result in these strategies creating row IDâs linked to different data.
And the answer is not âdbtâs paradigm is wrongâ, because MIISK row id strategies donât work as a result. (Thatâs a topic for a different debate, but the disadvantages of not using dbtâs paradigm far outweigh the loss of MIISK strategies as a records primary key, in my opinion)
The answer is that you want to be able to put a human-readable identifier for rows in the BI tool if possible (maybe there is an account ID that is an integer created at the source for example), but behind the scenes, every row has a unique and idempotent surrogate key that joins and other functions are performed against, and that follows that fact or dim throughout the data journey.
To me, thatâs the key here (no pun intended): We need an idempotent key in the backend, in order to reliably support operations against non-idempotent row information in the front end.
If you want a surrogate key that results in an integer, try a function like snowflakeâs md5_number_lower64() or md5_number_upper64(). The qty of rows will be limited to 64 bits ( 18446744073709551615 values), so that ID might only be unique in some contexts, but it could be enough to keep you in integers as surrogate keys.
Hope this helps!
Maybe dbt isnât the tool for me and my org then.
Weâre building curated data models using a star schema pattern in either Fabric or Databricks for consumption in Power BI further downstream. The need for integer surrogate keys is a real need, regardless of the origins of that pattern from a ddl perspective.
Youâre asserting that the answer is not that the âthe dbt paradigm is wrongâ, but maybe it is that âthe dbt paradigm does not fit my use caseâ.
Before Databricks introduced generated IDs I did have a work around for this that involved using row_number that worked ok. I may try and rework that to work with dbt and will report back if I have any success.
@Johnny @will-sargent_dbtlabs thanks for your thoughts, at the time of writing the initial post I always thought that having an integer surrogate key would always lead to better compression and join performance in tabular models. Thanks to Marco Russo I have learned that the vertipaq engine always uses a âhashâ datatype in PK and FK relations:
Increasing compression in Power BI and Analysis Services â Unplugged #15 - SQLBI
That said there is no performance or storage reason to have integer surrogate keys in your star schema as vertipaq will convert these values to hash anyway. We did some extensive testing and the storage consumption and performance of a star schema with hundreds of millionâs of records with integer surrogate keys or composite business keys instead was the same.
Due to that we have decided not to use integer surrogate keys in our star schemas anymore.
Thanks for the share - Iâll check it out.
âŚstill makes me feel icky though
@will-sargent_dbtlabs
Currently looking into truncating the result of dbt utils generate surrogate key to fit into numeric 38.
It handles all the requirements but not sure about performance yet.