surrogate key, why?

The problem I’m having

I don’t understand the need for surrogate keys through generate_surrogate_key, when I have a perfectly fine natural key (product_id, INT). I get the need for it when a table doesn’t have a primary key, its nice for testing uniqueness rather than doing “test unique combination of columns”

The first confusion is that dbt seems to conflict with other literature.
dbt says generate_surrogate_key(natural key), while other documents say you should not do this but make it independent of the natural key.

The second problem, I don’t find the arguments compelling.

  1. If you have a product table and buy another company you would need to merge the product tables of the two companies.
    Okay so then I union my tables, and add a ‘generate_surrogate_key’ on that unioned column.
    It doesn’t help me to create a surrogate key when this isn’t a problem yet. And since I need to make DWH changes anyway adding it at that point is trivial. I just did the work earlier.

  2. If the source system deletes records. Not sure how this helps me, my surrogate key is a hash of the natural key so I would just be missing this record still.

The main reasons for using surrogate keys:

  1. Consolidate information from different sources, not sharing the same natural key, into the same target data set. This is not about doing the work earlier. It is about avoiding changing the downstream investments when upstream systems providing the same data context are added or replaced.

  2. To protect the downstream data sets from upstream key structure change. For this purpose, a hash of the source natural key is not a good idea, as this does not lead to the intended stability.

(1) and (2) cannot be implemented simply by creating a surrogate key, although surrogate keys are an essential part of the solution. To achieve (1) and (2), It will also be required to model homogenous definitions of the data items, which are close to the definition in the analytical world.

Applying it is not about doing work earlier. It is about anticipating upstream change causing instability and code impact on hundreds or even thousands of downstream pipelines or use cases. So it is about a huge saving, not about preponing a cost which will be there anyway.

  1. To simplify complex keys, composite keys

4…To absorb new or changed key value scenarios on the same object. Some systems update keys, unfortunately. A classic example is when part of the key is also used to order the items in a bigger set. Also, some systems instantiate new keys for an object to complete the task/process, whereas the object is not new and functionally requires stability in the analytical context.

  1. To manage data retention policies different from the source, anticipate key recycle mechanisms by the sources

The importance and relevancy of the above will depend a lot on the data pipeline and modelling strategies.

The extremes:

A. Each data use case has its dedicated data pipeline, a stovepipe starting from the operational system’s data

B. All data is correctly integrated into a data warehouse using definitions close to most of the (analytical) use cases. These definitions often differ a lot from the operational system’s raw data, which is targeted to support the operational processes. Think of automating the same operational process with system vendor X or Y. The data truth about the process you want to include to support analytics will show a high level of stability, but the raw data from vendors X and Y will differ significantly.

The reality is most likely a mix of A and B. When thinking about developing new pipelines or changes requiring retest, consider that B will probably save a lot of downstream costs, increase agility, and increase quality. When starting with A, after some time, the teams will spend more time maintaining A-type pipelines. Agility and time to market will go down a lot due to the portion of the budget spent on maintenance. Whereas A was intended to increase time to market and cost, the gain is only a short-term one if the lifecycle of the pipelines being built is not short. Also, quality will decrease because of missed impacts, human errors, and test scope cuts upon upstream change.

The question comes down to the maturity level and protection level you want to achieve with the investments in downstream data pipelines and use cases.

People see (B) as a problem when large companies start implementing (B) in the corporate-level data zone only. They associate (B) with central data structure management’s difficulties and inability to delegate and create agility. This is, however, the wrong association. Delegation and scale issues of a central data warehouse are not to be confused with modelling concepts required to protect and stabilise investments.

The go-between might be seen as a mesh-like architecture, delegated data pipeline responsibilities. Even in this context, a downstream data product will require key stability (structure and assignment) from the upstream data products. This would also mean surrogate keys in many scenarios and data definition stability instead of exposing raw data down the supply chain. Since data modelling is delegated much more, the need for definition and structure stability of the data product is even higher when individual data products rely on each other. Hence, surrogate keys and stable definitions/structures are essential, even more so in a data mesh.