dbt, Bitemporality, and Snapshots

Bitemporality is the tracking, in a database, of two different aspects of temporal information associated with each tuple:

  1. Valid Time: The time period during which a fact is true in the real world
  2. Transaction Time: The time period during which a fact stored in the database was known.

I think we all understand why we need to track the notion of Valid Time; Valid Time is usually what is being tracked in a Type 2 SCD dimension. However, it can be helpful to also track Transaction Time to understand the difference or the lag between when a fact was or will be true in the real world, and when we became aware of it in the database. One application of this that I was recently exploring was trying to understand data latency from data being a) created/edited in upstream source systems to b) landing in our raw tables to c) being processed via snapshots and stored in a snapshot table.

I found that I couldn’t really accurately differentiate between these steps because the details of how dbt processes timestamps and applies them seems to vary depending upon the snapshot strategy used. There doesn’t seem to be a way with the “timestamp/lastchanged” strategy to actually answer the question “When was this row written into the snapshot?” because the dbt_updated_at column actually is set to the value provided in the lastchanged timestamp from the upstream source system, not the timestamp of being written into the snapshot.

So the challenge is that dbt snapshots, in their current incarnation, don’t really differentiate between the these aspects of temporality. Or sometimes they are mixed depending on which snapshot approach is used. When using the timestamp strategy, which depends on a lastchanged timestamp coming from the source database, dbt uses the source timestamp in question to set the various dbt timestamps in the snapshot table (e.g. dbt_updated_at, dbt_valid_from, and dbt_valid_to). However, when using the check strategy, dbt instead uses the database time at which the snapshot transaction was applied or changed.

In the former case, whether the lastchanged timestamp is reflective of the Valid Time or the Transaction Time depends on the notion of what the source system is doing. It’s more likely to be the Transaction Time but may not be. In the latter case of the check strategy, it’s likely going to be the Transaction Time. This gets even more confusing when loading historical data all at once into a snapshot. It’s not really clearly enunciated, however, and recently when I switched some snapshots from the “check” strategy to the “lastchanged” strategy (as a thankful result of us finding how to get lastchanged timestamps from an upstream SaaS service which is providing us data) the pattern of how the timestamps were written changed and I could see dbt switching from using the (BigQuery in this case) database CURRENT_TIMESTAMP() to using the values from the source system timestamp.

There is probably yet another time dimension to this in that the “Transaction Time” as per the source system we are getting the data from is probably different from the “Transaction Time” as per the actual database platform that we are loading the raw data into and doing processing with (like Snowflake or BigQuery). So perhaps part of my issue is also further differentiating between these two Transaction Times.

Many databases such as SQL Server, Oracle, Db2, Teradata, and others support bitemporal tables. Snowflake, BigQuery, and Redshift don’t appear to support it but I’m not making an argument here for actually using database-engine supported temporality with dbt, but rather giving examples to show that the concept is well established and available in commercial software. References here:

  1. Temporal database - Wikipedia
  2. https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables
  3. IBM Documentation
  4. Oracle 12c Temporal Database | by Eye On | Eye on Databases
  5. Teradata Online Documentation | Quick access to technical manuals

Also, aspects of bitemporality have been integrated into various ANSI SQL standards. I would be interested in having the notion of bitemporality more clearly explained and managed in dbt snapshots to help better with managing these aspects of time. Would others be interested in the same thing?

5 Likes

I know this doesn’t address your larger questions around bitemporal modeling, but I’ve recently started adding my own CURRENT_TIMESTAMP()::TIMESTAMP AS dbt_snapshot_at column to all of my snapshots to track when the dbt snapshot was taken.

Really good post Josh. It is absolutely true that the whole SQL ecosystem does a really bad job at properly managing the temporal aspects of data.

The reasons for this are many. The concept of transactions really messes up natural concepts of time - so that does not help. Just look at e.g. statement_timestamp(), transaction_timestamp(), clock_timestamp() from Postgres, and while doing that, note that they don’t offer a commit_timestamp() because that is “tricky”, although Spanner manages it with pending_commit_timestampp() , albeit not as a value readable within a transaction of course. So, that means we have 4 DBMS timestamps before we start even looking at application generated timestamps. Typically these will be set close to, but rarely identical to, one of the DBMS timestamps, and so can also be considered as a strand of “transaction/recorded time”.

Other reasons include temporal being a bit of a backwater in database research. Darwen had to badger Date for many years to get him to take the topic seriously, and even then they focused on the consequences of interval (aka range) values/types to the relational model more than the arrow of time itself. Not does their ambivalence to transactions help (see page 447 of the 2nd edition of Time and Relational Theory, where they try and fail to answer their own question on “how transaction timestamps might be implemented in a real system)”). Other, more mainstream(?) academic work such as Snodgrass notwithstanding, this whole area is under researched IMHO.

And I would say that most the work on Temporal Tables in SQL DBMSes has been focused on OLTP use cases, rather than analytical ones such as ETL populated “type-2” dimensions. Maybe this is not such the case with Teradata’s implementation - but that not something I have worked with myself.

Now, Josh, you said at the start that “Valid Time is usually what is being tracked in a Type 2 SCD dimension”. I would disagree. Almost always those tables hold one kind of transaction/recorded time. I.e. what did the customer record (that we extracted from the source) look like yesterday, a week ago etc. Very rarely are business time columns used. E.g. if a customer says “I moved city last month” or “I’m moving city next month”, the SCD won’t be back or forward dated to reflect the “true in the real world” timeline. We don’t want our historic reports chaining just because a customer “forgot” to tell us something in due time.

Actually, the “bi-temporal” (end “tri-temporal”!) stuff is all an unimportant distraction. Anything other than the recorded/transaction timeline is just normal columns. Recording when a customer says something happened in the real world, or when a fraud investigator says something actually happened are all just normal business data items. Trying to make them special under the “bi-temporal” (or should that be n-temporal) banner obscures the more important matter of the serializable nature of data recording.

So what to do? I’m sure dbt could pay more attention to this. Even simple things such as recording dbt_snapshot_at as @gnilrets suggests would help. Leveraging Time Travel facilitates (where available/performant) to ensure dbt knows what the snapshot point is in the local warehouse timeline, would also help. Understanding what any timestamps coming from the source systems actually represent would be better still (e.g. CDC’ed sources can sometimes include a true(ish) source commit timestamp). Still I’m not sure dbt really works at a high enough level of abstraction to incorporate such matters.

1 Like