Is Kimball dimensional modeling still relevant in a modern data warehouse?

Here’s my chaos modeling special blueprint:

  1. Raw - Ingest as raw with Fivetran / Stitch / Matillion / Custom Stages, making sure to have an ingestion time and pulling out expected predicate columns if they’re packed in a variant. Strongly prefer append-only ingestion when possible.
  2. Stage - All tables named STG_<SOURCE_TABLE_NAME> with the most basic cleaning - timestamp conversion, upper/lowercasing of strings, null fills, maybe some basic mapping (like mapping AWS Account Id values to human-readable names using a case statement). These are often views.
  3. Analytics - Working from the end of the pipeline, identify use case / pain points and determine the data required for this. Anything a BI tool will hit should be materialized as a table and should have human-readable names with no prefix. Schemas and databases at this level should also be human-readable with schemas split by business function or access level
  4. Curated - Iteratively apply DRY concepts to pull useful models out of individual end-table computations and into a curated layer (prepended with CUR_<TABLE_NAME>). These are a mix of views and tables and should reference only other curated tables and stage tables. Once you do this a few times, some CURATED tables become obvious to you when doing step 3 so you don’t have to go down the path of repeating yourself a ton then reworking backward.

At no point do I normalize incoming sources simply to conform to a formal data model, particularly when I will need to denormalize them by the end of the pipeline.

This approach assumes a smaller data team and a very broad community of mostly BI-focused access. It takes a centralized authority view of the world and doesn’t account for citizen data scientists. This is an approach designed primarily for enterprise clients new to the cloud.

I think formal modeling is very useful, especially in an OLTP environment, but I find that too many large companies that I’ve worked with have gotten obsessed with formal modeling and have completely neglected end-user needs. It has largely been a theatrical exercise that gives the impression of adding value, but at the end of the day we’d spend all our time accommodating some arbitrary modeling needs and end up with no real usage of the warehouse.

I definitely don’t have anything against modeling, I would just prefer to focus efforts on understanding users and delivering value early and intentionally leveraging technical debt in the warehouse in a greenfield cloud data warehouse migration. If we feel the pain of the technical debt, it’ll be due to increases in usage and requirement (ostensibly due to users gaining value) so we can address modeling needs after we’ve delivered value (and using dbt makes it easy to fix this debt). If we don’t feel the pain, then we avoided a costly modeling engagement and can spend all our cycles iterating on use cases to find ways to deliver real value.

I feel like I’m probably in the minority and I’m definitely over biased towards the kinds of projects / companies I’ve worked with.

9 Likes