I’m modeling a customer dimension table which is made from various sources
e.g. sql server, mysql and hubspot. I read Claire’s comment on using surrogate key
in dbt, but I couldn’t come up with a good surrogate key inputs for the table.
A customer data might be created at different time in different sources.
For example, at start, we have staging customer data like this.
user | mssql | mysql | hubspot
-----|-------|-------|--------
A | O | X | X
B | X | O | X
C | O | O | X
D | O | X | O
E | X | O | O
Then after time passes, other sources populate customer data also.
user | mssql | mysql | hubspot
-----|-------|-------|--------
A | O | O | O
B | O | O | X
C | O | O | X
D | O | O | O
E | O | O | O
The grain for the dimension table is 1 row 1 customer.
So, the surrogate key should be made from dbt_utils.surrogate_key(['mssql_id', 'mysql_id', 'hubspot_id', ])
.
But, this means the customer A’s key will change from “O-X-X” to “O-O-O”.
The key change would affect all fact tables downstream.
How should I implement a surrogate key for this table?
So that a row can be pinned and updated when new sources arrive.