Bitemporality is the tracking, in a database, of two different aspects of temporal information associated with each tuple:
- Valid Time: The time period during which a fact is true in the real world
- Transaction Time: The time period during which a fact stored in the database was known.
I think we all understand why we need to track the notion of Valid Time; Valid Time is usually what is being tracked in a Type 2 SCD dimension. However, it can be helpful to also track Transaction Time to understand the difference or the lag between when a fact was or will be true in the real world, and when we became aware of it in the database. One application of this that I was recently exploring was trying to understand data latency from data being a) created/edited in upstream source systems to b) landing in our raw tables to c) being processed via snapshots and stored in a snapshot table.
I found that I couldn’t really accurately differentiate between these steps because the details of how dbt processes timestamps and applies them seems to vary depending upon the snapshot strategy used. There doesn’t seem to be a way with the “timestamp/lastchanged” strategy to actually answer the question “When was this row written into the snapshot?” because the
dbt_updated_at column actually is set to the value provided in the lastchanged timestamp from the upstream source system, not the timestamp of being written into the snapshot.
So the challenge is that dbt snapshots, in their current incarnation, don’t really differentiate between the these aspects of temporality. Or sometimes they are mixed depending on which snapshot approach is used. When using the
timestamp strategy, which depends on a lastchanged timestamp coming from the source database, dbt uses the source timestamp in question to set the various dbt timestamps in the snapshot table (e.g. dbt_updated_at, dbt_valid_from, and dbt_valid_to). However, when using the check strategy, dbt instead uses the database time at which the snapshot transaction was applied or changed.
In the former case, whether the lastchanged timestamp is reflective of the Valid Time or the Transaction Time depends on the notion of what the source system is doing. It’s more likely to be the Transaction Time but may not be. In the latter case of the check strategy, it’s likely going to be the Transaction Time. This gets even more confusing when loading historical data all at once into a snapshot. It’s not really clearly enunciated, however, and recently when I switched some snapshots from the “check” strategy to the “lastchanged” strategy (as a thankful result of us finding how to get lastchanged timestamps from an upstream SaaS service which is providing us data) the pattern of how the timestamps were written changed and I could see dbt switching from using the (BigQuery in this case) database
CURRENT_TIMESTAMP() to using the values from the source system timestamp.
There is probably yet another time dimension to this in that the “Transaction Time” as per the source system we are getting the data from is probably different from the “Transaction Time” as per the actual database platform that we are loading the raw data into and doing processing with (like Snowflake or BigQuery). So perhaps part of my issue is also further differentiating between these two Transaction Times.
Many databases such as SQL Server, Oracle, Db2, Teradata, and others support bitemporal tables. Snowflake, BigQuery, and Redshift don’t appear to support it but I’m not making an argument here for actually using database-engine supported temporality with dbt, but rather giving examples to show that the concept is well established and available in commercial software. References here:
Also, aspects of bitemporality have been integrated into various ANSI SQL standards. I would be interested in having the notion of bitemporality more clearly explained and managed in dbt snapshots to help better with managing these aspects of time. Would others be interested in the same thing?