FAQ: Handling timezone conversion (and timestamp truncation) in data warehouses


#1

A common question we see in the dbt community is around how people are handling timestamps in their data warehouse – more specifically:

  • whether people store timestamps in UTC or in a timezone, or both, and
  • whether people perform timestamp truncations in their warehouse

The orthodox, Kimball-style, advice on the subject is to store timestamps as both UTC and local, and store date and time as separate fields. However, when working in a modern data stack (as described here, and here), that advice feels a bit outdated for a few reasons:

  • Data warehouses have advanced so that you do not need to be concerned about the computational burden of performing timestamp truncations and/or conversions at query time
  • Often, users interacting with your models will do so through a BI tool rather than raw SQL, and any BI tool worth its salt handles timestamp truncation and timezone conversions easily.
  • This approach does not necessarily prioritize the computational burden (i.e. brain time :wink:) of analysts using the warehouse. Adding extra fields, especially for time parts, can bloat models, degrading the information architecture. Building those fields in the first place can lead to extra code, when it’s likely your BI tool can do it for you.

So, if you’re using a modern data stack, and not strictly married to Kimball style modeling, how should you handle your timestamps? Well, like a lot of things, it depends on your situation. Below is some consolidated advice based on what we’ve seen in the amazing dbt Slack¹.

If you are consistently reporting from one timezone:

  • Store timestamps in UTC
  • Perform timezone conversion in your BI tool
  • Perform timestamp truncation in your BI tool

This represents the simplest approach, and for many organizations, is the best option. Use this approach until it doesn’t work.

If you are rolling up your data by a date part (e.g. daily signups) OR using a timestamp in a join to another model:

  • If you are doing timestamp conversions and/or truncations in your select statements when aggregating data in your models², à la:
select
date_trunc('day', convert_timezone('EST', created_at)) as date_est
, count(*) as number_of_signups
from users
  • Or if you are doing timestamp conversions and/or truncations in your join conditions when joining to a table at a different grain:
select
orders.id
, calendar.business_day
from orders
left join calendar on calendar.date_est = date_trunc('day', convert_timezone('EST', orders.created_at))
  • You might consider doing this calculation in an upstream model, so that your models look like:
select
created_date_est
, count(*) as number_of_signups
from users
select
orders.id
, calendar.business_day
from orders
left join calendar on calendar.date_est = orders.created_date_est
  • If implementing this approach, it’s likely a good idea to still provide UTC timestamps alongside the target timezone timestamps in your warehouse.

If you have different timezones for different records within the same model OR the local context is important:

  • It may be useful to store timestamps in both UTC and the local timezone. Each timestamp should be suffixed _utc and _local respectively.
  • Additionally, it may be useful to store the local timezone (e.g. 'EST') for each record in a local_timezone field.
  • You may be able to use your BI tool to use the local_timezone field and the UTC timestamp to convert timestamps to the correct timezone – I recently wrote about how I approached this using Looker. For me, this negated the need to store the local timestamp.

Generally:

  • Timezones are hard. There’s no single best approach, so choose one that makes sense for your organization.
  • Consistency is key – whatever approach you do choose, use it consistently throughout all your models.
  • If you are using timezones other than UTC (or even if you are using UTC but want to be explicit), suffix (as opposed to prefix) your timestamps with the timezones. For example, the fields created_at_utc and created_at_est are preferable to utc_created_at and est_created_at. Most BI tools order things alphabetically so using suffixes keeps the two fields together in a list, improving your information architecture.

¹A massive thanks to @TimothyBurke, @ted.conbeer, Leon T, Harlan H and @mplovepop (and others I’ve lost to the Slack ether!) for contributing to the discussion.

²Generally consider whether you should be doing this pre-aggregation in your warehouse as opposed to in your BI tool. Pre-aggregating often limits the ability to explore the data, but is required in cases, for example rolling up daily revenue to compare actuals to sales targets, where sales targets are defined at a daily level.