Is Kimball dimensional modeling still relevant in a modern data warehouse?

Is Kimball dimensional modeling still relevant in a modern data warehouse?

We see this discussion a lot on dbt Slack, so I’d love to move it to a Discourse post! Comment below with your thoughts!

5 Likes

I have a lot of thoughts from my anecdotal experience and knowledge of redshift internals, but would love to rigorously test them. If you or your company would like to sponsor an analysis of performance using snowflake / star / denormalized across the big 3 warehouse solutions, get at me (I don’t want to have to cover the redshift / bigquery / snowflake compute costs on my own).

Hey, so I guess I’d like to give some context and get a feel for what other people are doing. These are the “denominations” of data people that I know about. Feel free to add.

  • Inmon - Bill Inmon
  • Data Vault 2.0 - Dan Linstedt
  • Kimball - Ralph Kimball
  • Functional Kimball - Maxime Beauchemin
  • Agile Kimball - Lawrence Corr
  • Denormalised Tables

My org is looking at a green fields implementation so, I’m pretty attracted to Maximes Functional Kimball approach, but I hear lots of merit in just making wide tables.

The challenge with wide data is that its challenging to contextualise a 500 column wide table properly and there is no recognised “denomination” that I can train my analysts in a consistent fashion and recruit already skilled applicants. I don’t want to advertise for a “data munger”…

I’ll add Anchor Modeling proposed by Lars Rönnbäck to this as well for a method of Temporal Dimensional Modeling.

So in Kimball modeling there were some assumptions made in1996 that just aren’t true in 2019:

  1. Databases are slow and expensive
  2. SQL language is limited
  3. You can never join fact tables because of one to many or many to one joins
  4. Businesses are slow to change

A lot has happened:

  1. databases are now fast and cheap
  2. SQL has evolved to include many more date functions and window functions that cut down on ETL work required
  3. Looker solved this with Symmetric Aggregates
  4. Business doesn’t wait for you to finish your perfect DWH

So I respect the work of Kimball and there are some great tips and techniques in the DWH Toolkit that I use everyday. But I see too many “over engineered” data warehouses that don’t need to be that complex and don’t encompass new things happening in the business.

2 Likes

The “core” of Kimball to me is based on a few ideas:

  1. Users will need to query data in any number of different ways, but fully normalized data models are way too complex for non-developers to deal with. So we need a simpler way to model data for analytics.
  2. We need to think about dimensions in a conformed way. if you are slicing a bunch of facts by, say, customer, it needs to be the same, common list of customers no matter how many different things you slice it by. Otherwise users across the business eventually start getting inconsistent results and then everybody loses faith in the accuracy of the data in the warehouse. And no, denormalizing all of your dimensions down into every fact is not a good solution when you have 30 facts and 30 dimensions in a DW of respectable size.
  3. We need to model data in such a way that when we add new data it doesn’t break existing reports and analytics. This means making sure we think thoroughly about the grain of all things ahead of time. If we change the existing columns or the grain, especially the grain, it’s going to break a lot of existing stuff. Well, maybe you can change things if you work in a small shop, but in an Enterprise Data Warehouse with hundreds or thousands of users, getting everybody to change their existing analytics takes literally years.
  4. But, after #3 - we are also going to constantly find new data and reports. So we have to make sure that we have modeled in a way that allows us to add new data.
  5. We need a consistent conceptual way to get people to think about events we analyze (facts) versus the things we are analyzing them by so that we aren’t reinventing the wheel every time we go to build data models. I can tell everytime I see a DW or analytics solution built by somebody who doesn’t understand this and doesn’t follow some approach (whether Kimball or Inmon) because it’s invariably some crazy denormalization approach that is internally inconsistent and impossible to explain in an organized way. Or even worse they’ve just imported all the fully normalized datasets from the source and point the reports right at the source tables.
  6. Users will invariably want to use multiple different BI and query tools to access the same dataset. And you can try to herd them in a certain direction (like Looker) but you’ll never get them all. So the DW should be built so that, to the best extent possible, users can still query with the tool of their choice. Which means not relying on BI tools, like Looker, to perform major further transformations of the data in the BI layer. Minor transformations, documentation, labels, etc.? Sure.

Database performance is much better now than it was when Kimball was written , but Kimball’s model of conformed dimensional design was not primarily concerned about database performance. I will admit it is helpful in managing performance and he discusses performance and storage space, but that was a secondary benefit and regardless of that the primary use cases for it are still relevant.

22 Likes

Here’s my chaos modeling special blueprint:

  1. Raw - Ingest as raw with Fivetran / Stitch / Matillion / Custom Stages, making sure to have an ingestion time and pulling out expected predicate columns if they’re packed in a variant. Strongly prefer append-only ingestion when possible.
  2. Stage - All tables named STG_<SOURCE_TABLE_NAME> with the most basic cleaning - timestamp conversion, upper/lowercasing of strings, null fills, maybe some basic mapping (like mapping AWS Account Id values to human-readable names using a case statement). These are often views.
  3. Analytics - Working from the end of the pipeline, identify use case / pain points and determine the data required for this. Anything a BI tool will hit should be materialized as a table and should have human-readable names with no prefix. Schemas and databases at this level should also be human-readable with schemas split by business function or access level
  4. Curated - Iteratively apply DRY concepts to pull useful models out of individual end-table computations and into a curated layer (prepended with CUR_<TABLE_NAME>). These are a mix of views and tables and should reference only other curated tables and stage tables. Once you do this a few times, some CURATED tables become obvious to you when doing step 3 so you don’t have to go down the path of repeating yourself a ton then reworking backward.

At no point do I normalize incoming sources simply to conform to a formal data model, particularly when I will need to denormalize them by the end of the pipeline.

This approach assumes a smaller data team and a very broad community of mostly BI-focused access. It takes a centralized authority view of the world and doesn’t account for citizen data scientists. This is an approach designed primarily for enterprise clients new to the cloud.

I think formal modeling is very useful, especially in an OLTP environment, but I find that too many large companies that I’ve worked with have gotten obsessed with formal modeling and have completely neglected end-user needs. It has largely been a theatrical exercise that gives the impression of adding value, but at the end of the day we’d spend all our time accommodating some arbitrary modeling needs and end up with no real usage of the warehouse.

I definitely don’t have anything against modeling, I would just prefer to focus efforts on understanding users and delivering value early and intentionally leveraging technical debt in the warehouse in a greenfield cloud data warehouse migration. If we feel the pain of the technical debt, it’ll be due to increases in usage and requirement (ostensibly due to users gaining value) so we can address modeling needs after we’ve delivered value (and using dbt makes it easy to fix this debt). If we don’t feel the pain, then we avoided a costly modeling engagement and can spend all our cycles iterating on use cases to find ways to deliver real value.

I feel like I’m probably in the minority and I’m definitely over biased towards the kinds of projects / companies I’ve worked with.

8 Likes

I prefer a combination of two approaches:

  1. Some consistent implementation of a normalized model, e.g. any of the ones discussed above.
  2. Flat, denormalized tables built on top which contain the most commonly used facts and dimensions.

In terms of ownership: 1. is managed by data engineering or more engineering focused analysts, and 2. is managed by a committee of analysts or by a senior analyst. The flat table should aim to cover 80% of use cases, or common queries, for example users, orders, key events etc…

For the remaining 20% of use cases, or for new pieces of data and analysis, then the analyst needs to query from tables built in 1. Once an edge case becomes a common query, the facts and dimensions should be added into the flat table.

That’s a very interesting approach. For approach 2, presumably you still have keys to the dimensions from which those attributes are pulled to produce the flat tables?