Is Kimball dimensional modeling still relevant in a modern data warehouse?
We see this discussion a lot on dbt Slack, so I’d love to move it to a Discourse post! Comment below with your thoughts!
Is Kimball dimensional modeling still relevant in a modern data warehouse?
We see this discussion a lot on dbt Slack, so I’d love to move it to a Discourse post! Comment below with your thoughts!
I have a lot of thoughts from my anecdotal experience and knowledge of redshift internals, but would love to rigorously test them. If you or your company would like to sponsor an analysis of performance using snowflake / star / denormalized across the big 3 warehouse solutions, get at me (I don’t want to have to cover the redshift / bigquery / snowflake compute costs on my own).
Hey, so I guess I’d like to give some context and get a feel for what other people are doing. These are the “denominations” of data people that I know about. Feel free to add.
My org is looking at a green fields implementation so, I’m pretty attracted to Maximes Functional Kimball approach, but I hear lots of merit in just making wide tables.
The challenge with wide data is that its challenging to contextualise a 500 column wide table properly and there is no recognised “denomination” that I can train my analysts in a consistent fashion and recruit already skilled applicants. I don’t want to advertise for a “data munger”…
I’ll add Anchor Modeling proposed by Lars Rönnbäck to this as well for a method of Temporal Dimensional Modeling.
So in Kimball modeling there were some assumptions made in1996 that just aren’t true in 2019:
A lot has happened:
So I respect the work of Kimball and there are some great tips and techniques in the DWH Toolkit that I use everyday. But I see too many “over engineered” data warehouses that don’t need to be that complex and don’t encompass new things happening in the business.
The “core” of Kimball to me is based on a few ideas:
Database performance is much better now than it was when Kimball was written , but Kimball’s model of conformed dimensional design was not primarily concerned about database performance. I will admit it is helpful in managing performance and he discusses performance and storage space, but that was a secondary benefit and regardless of that the primary use cases for it are still relevant.
Here’s my chaos modeling special blueprint:
STG_<SOURCE_TABLE_NAME>
with the most basic cleaning - timestamp conversion, upper/lowercasing of strings, null fills, maybe some basic mapping (like mapping AWS Account Id values to human-readable names using a case statement). These are often views.CUR_<TABLE_NAME>
). These are a mix of views and tables and should reference only other curated tables and stage tables. Once you do this a few times, some CURATED
tables become obvious to you when doing step 3 so you don’t have to go down the path of repeating yourself a ton then reworking backward.At no point do I normalize incoming sources simply to conform to a formal data model, particularly when I will need to denormalize them by the end of the pipeline.
This approach assumes a smaller data team and a very broad community of mostly BI-focused access. It takes a centralized authority view of the world and doesn’t account for citizen data scientists. This is an approach designed primarily for enterprise clients new to the cloud.
I think formal modeling is very useful, especially in an OLTP environment, but I find that too many large companies that I’ve worked with have gotten obsessed with formal modeling and have completely neglected end-user needs. It has largely been a theatrical exercise that gives the impression of adding value, but at the end of the day we’d spend all our time accommodating some arbitrary modeling needs and end up with no real usage of the warehouse.
I definitely don’t have anything against modeling, I would just prefer to focus efforts on understanding users and delivering value early and intentionally leveraging technical debt in the warehouse in a greenfield cloud data warehouse migration. If we feel the pain of the technical debt, it’ll be due to increases in usage and requirement (ostensibly due to users gaining value) so we can address modeling needs after we’ve delivered value (and using dbt makes it easy to fix this debt). If we don’t feel the pain, then we avoided a costly modeling engagement and can spend all our cycles iterating on use cases to find ways to deliver real value.
I feel like I’m probably in the minority and I’m definitely over biased towards the kinds of projects / companies I’ve worked with.
I prefer a combination of two approaches:
In terms of ownership: 1. is managed by data engineering or more engineering focused analysts, and 2. is managed by a committee of analysts or by a senior analyst. The flat table should aim to cover 80% of use cases, or common queries, for example users, orders, key events etc…
For the remaining 20% of use cases, or for new pieces of data and analysis, then the analyst needs to query from tables built in 1. Once an edge case becomes a common query, the facts and dimensions should be added into the flat table.
That’s a very interesting approach. For approach 2, presumably you still have keys to the dimensions from which those attributes are pulled to produce the flat tables?
Hi there, I’ve been a Kimball DW Architect for a while now and have consistently built denormalized dimensional models as described in the books, without a normalized underlying layer like Inmon suggests. I’m trying to figure if or out how creating Kimball models is different in DBT than it is in the Microsoft world. Any thoughts?
One implication of ColumnStore indexing to me is that the cost of degenerate dimensions is reduced, which means you might consider using more of them. If you took this to the extreme it would result in just one big table (OBT), but everyone I’ve spoken to has seen that this hits a wall at scale.
I am probably totally naïve about the difference - but I can’t see how dbt or MS or any other for that matter, has any bearing on wide table design. Now there are cost differences - based on what some people have mentioned in the Slack community, is that BigQuery has some wierd cost associated to this.
I dont know about this wall, but there are some practical implications for the extreme, where EVERYTHING is in one table… for one, you’ll never find anything and there will be many ambiguous columns.
I think Kimball might not suit mobile app/games for the following reasons:
1 - Needs very fast implementation. Analysts are usually into the field as soon as data regarding engagement is out. A full DWH, even a MVP (say 5-7 dim tables and maybe 2-3 fact tables) would wait for tons of data and would take many weeks for development. It’s just not practical.
2 - Columnar database welcomes wide table and hate a lot of joins. Kimball naturally needs a lot of joins for one analysis. Again the DBAs won’t be very happy.
I think the most practical method is:
Step 1 - Gather enough requirements (of the analysis) BEFORE the developers start developing the feature
Step 2 - Communicate the requirement (transferred to fields needed so that developers can better understand) to developers and make sure the data contains some connection fields.
Step 3 - When the raw data is out, just import into a raw table with very little transformation
Step 4 - Look at the requirements again, build up a DWH that results in a wide table that analysts don’t need to join (or very few joins, <= 3 joins should be good). It’s definitely not gonna be Kimball, but each wide table is related to an analysis requirement.
Step 5 - Done and profit. No need to use keys,. Got a data issue? If it’s less than say 2% just fix and ignore. Move fast and as long as it doesn’t derail it’s OK. Worst case we can still fall back to the raw table and rebuild everything.
OK this also has some implicit requirements on the setup of the teams:
1 - The data team (analysts and developers and DBA) should be grouped together and the head should at least be a Director so that he can fight the developers and business sides on equal footing.
2 - Training of any member in the team should include training on all three fields – analysis, data developer including DE and DWH dev, admin with the right weights. So if you are a data developer, you still get exposed to analysis and admin but 60% of the training goes to developing.
3 - The data developer squadron needs to sync with app developer or project managers on a per Sprint basis. This is to make sure that nothing important is neglected. Before the meeting the data developers should also make sure they consume the requirements taken from the analysts and the suggestions taken from DBA.
4 - You can remove DBA team if data developer does the job.
It is a great question and one question people should ask when entering data analytics. I’ll answer by first providing a question, why do any data modelling at all?
It is an enterprise question, and when you think of the word enterprise you think of large banks or car manufacturers but in fact an enterprise question relates to another discipline, enterprise architecture. The first block of enterprise architecture is business architecture. Here’s what I know of business architecture:
It defines business capabilities - these are the building blocks of any business, and the collection of which helps fuel the business strategy (example capabilities are attached)
Each capability may have one or many overlapping business processes, start to finish these are things like onboarding prospects, hardening applications to accounts, processes to deploy one or many accounts for a customer, relating to products and so on. All of these fall under capabilities supporting a strategy right?
Each business process in turn have overlapping business rules, the automated procedures to get work done - unit of work. There is always a data model, Kimball models serve the analytic questions by dividing dimensions and high churn metrics / measures about those measures. The things we care about in the business, products, accounts, customers, clients, plane plants, factories etc. The business objects. Absolutely Kimball has its place! So too does Data Vault 2.0!
As the the enterprise continues to flex and grow it needs to adapt without refactoring and applying DV2.0 achieves that, Kimball rests on top of DV2.0 to deliver business focussed answers that the DV2.0 stores, DV2.0 recognizes the business objects and business processes, whereas Kimball delivers the analytics needed. They live together. See: DV2 learning is like learning how to Brew!
Where you might be thinking of in terms of other data requirements is real-time data modelling, or no modelling at all! Let me explain,
Mobile apps needs very fast access to analytical data - explore Kashlev Data Modeller (url: Portal), this takes the data mart and deploys all results to a query-driven data table approach, still the data came from an effort to denormalise the data in the first place!
Data Lake landed data, well to be honest, how did it get there? All data needs data modelling – data modelling describes the business, after all, it is the third block in enterprise architecture - data architecture, without a business architecture you cannot have a data architecture, what are you modelling?
As for Anchor… it’s a cute modelling technique but ask yourself… if you told your Database Administrator you’re expecting to manage every column as its own table (6NF) what do you think his/her reaction would be? It may be logically efficient (no nulls) but it is not physically efficient on relational data stores.
What I’m thinking is that I 100% agree that data should be modelled under business requirements. What I would argue is that we should directly build models that business users can use, instead of dividing them into dim/fact.
Take mobile gaming as an example. Typical data analytics require a lot of slicing on one measure, say profit. Not only that analysts want to slice against common “dimensions” such as date/platform/geo, they also want to slice against a lot of other measures such as balance/player behavior/etc.
Now that with the benefit of columnar database, I really don’t see the benefit of Kimball here. Analysts don’t care how the data is layered out, what they care is to dump the data into a reporting tool or a mathematics model, and the least thing they want is to take care of the joins themselves. With Kimball, they are going to need at least 5-10 joins just for one analysis, because Kimball breaks everything apart.
My proposal is simple. You don’t care about business meaning of your data, because frankly business owners NEVER use database, it’s the business ANALYSTS who use database. The “business meaning” is in fact the final requirement that business analysts ask for. What do they ask for? They want a measure and a bunch of slicers, so why not give them that?
I forgot whether Kimball mentioned that Data modeler should work closely with developers (that give them raw data), but from top of my head I think he emphasizes more about working closer to the clients. I think working with developers is at least of equal importance because you want to make sure that the raw data is of the form you can transform easily.
So here is my method:
Step 1: BI developer (data modeller) meet with analysts to discuss requirement. At the end of the meeting they need to have a table with all fields needed mapped out. And in fact this table is the end product, not just a helper on the road.
Step 2: You meet with upstream developers and make sure they can give you these data. It is not necessarily that they give you the data in one piece, because developers also have frameworks and guidelines to follow, and giving you the data in one piece may damage those guidelines. But you need to make sure that they give you some linking fields to connect the dots.
Step 3: You import each piece to a different raw table with minimum transformation.
Step 4: You transform the raw tables into one single wide table that match exactly the one you got in Step 1. Profit.
OK one major caveat of this method is: If the analysts suddenly change requirements, then you have to expose the raw tables to them, which is uglier than what Kimball will expose. However I’d argue that for analysts, they should be able to figure the business information out from the raw table. After all they are not business owners. Business owners only know business, but analysts need to know how to slice/segment business, and even the raw data should more or less give them that.
In conclusion, I really don’t see how Kimball offers any additional value here. If you break up your data to dim/fact tables, the analysts STILL WANT the wide table, but in the case of Kimball, they have to build it up by themselves, or you have to write a view for them. So essentially you add an additional unnecessary layer into the picture, and waste a ton of time building up a nicely looking palace.
Different models and model types for different purposes, dims and facts are very much relevant but maybe not for your use case
I am curious if there are any good references for how organizations have maybe blended modeling traditional kimball dims and facts with more of a hypercube/OBT approach.
Popular BI tools like Looker and Tableau still relate to data sets using concepts of facts and dimensions even if the data aren’t strictly modeled as a star schema. I think the logical concepts are still relevant and even helpful for ad-hoc data exploration.
Since raw data nearly always needs some measure of transformation to be useful, choosing a modeling design pattern (like star schemas) can help guide the process. As the data modeler, you’re free to extend or hybridize the pattern, but it’s nice to have a starting point.
I have been designing and building dimensional data warehouses (DWs) for a few years now and I believe this methodology gives us simple rules for ‘judicious denormalization’ which creates business friendly tables.
I agree with Josh’s points here. I just want to add that the idea of joining fact tables together makes my skin crawl, if they’re truly the same grain, then widen your table, if not, then you’re risking gaps and duplicates that analysts will be annoyed by (and payback is a coming). For similar reasons I’m not a fan of ODS (Inmon) DWs, those subject area specific tables are great until they need to be expanded, combined and compared (and inevitably, the grain changes).
I would like to comment on Linstedt’s Data Vault 2.0 (DV), it was very popular about 5-10 years ago (least in Australia), it seemed like everyone was decommissioning their Kimball DWs for DV. At the time, it made sense, DWs and SANs were slow, poorly designed and their DBAs stubborn (or cash poor). Overnight fact/dim loads were running well into the morning if not the afternoon. DV’s methodology of parallel loading hubs and spokes from multiple sources seemed like the obvious next evolution. I worked on 2 sites that implemented DV and they were both a mess (technically and culturally). User queries or building cubes from DV tables was difficult, leading to analysts and BI devs were building ETLs on ETLs; data governance and linage suffered immeasurably. Not sure how it all turned out on those sites (I didn’t stick around), needless to say, cloud computing meant the Kimball DW load issues dissipated, and hopefully so did DV.