Data Modeling approaches in modern data-times

drege · May 7, 2020, 3:53pm

Hey folks!

As discussed in the dbt Slack I am moving this to the discourse:
I have a very general question for you and it is something i’ve been wondering a lot recently.
According to classical business intelligence theory a Data Warehouse should include some kind of transformation/data modelling like normalization, star-/snowflake-/galaxy- and/or data vault schema. Often you hear a mixture of some of those system being applied, most commonly normalization or data vault and then e.g. a star schema for end user access to the data via the BI Tool…

However, I feel like this is something mostly huge corporations are doing as the need for such a transformation is much higher because of much bigger data / way more different data sources / a much bigger data analysis department and other reasons.

To me it seems as if a lot of “younger” companies that don’t have that much data / different sources / not as many data people are skipping these classical transformations or use it only partly (e.g. when giving access to colleagues transform the data from raw to star schema right away).
I am wondering now… is this something that is actually happening? Or are most of you actually following that “classical” way?

How are you approaching that? Are you using these modelling techniques in your company? And if so how does your data stack look like (like how many data people, how many sources, small or big data - of course only if you are happy to share it )

I would be very interested in getting an idea about how other companies are approaching this and how commonly people are following the standard way or deviate from it - looking forward to your experiences!

PS: I know that there is already a great discussion about Kimball modeling here: Is Kimball dimensional modeling still relevant in a modern data warehouse? - #6 by josh . However I am more interested in how people are actually setting up their warehouses in their companies and what factors are influencing which decisions. Also I am not necessarily talking about Kimball only.

Thanks in advance

Best,
Andre

michael_zearn · May 7, 2020, 4:22pm

Re-posting my reply from the Slack:

Really good question. My thinking is that these schemas can be useful, but should only be used if they are useful to you, not just because it’s what people usually do.

For example, at Zearn, we use a snowflake-ish schema, but it’s by necessity. In our prod db, we have several large event tables, and several of our user-level dimension tables have different ways of joining to other tables depending on what kind of user you are looking at. In Redshift, doing all those joins would be really slow.

Creating a snowflake schema with dbt allows us to combine all our event tables into one and take advantage of Redshift’s dist/sort keys to make joins to dimensions work well. So it works well for us. But if you don’t get some kind of defined benefit from it (e.g. speed, ease of understanding the tables, etc.), it doesn’t seem like a good idea.

For context, we have one data person (me) and only one data source we do this for (our production database). Our events tables are in the low billions of rows range, with some dimension tables in the millions of rows.

trevor · May 7, 2020, 5:50pm

At thriftbooks we kind of have our data in three layers:

Raw data. Atomic events and assorted data at various levels of tidiness (i.e. not super tidy for human consumption)
Normalized analytics tables from that random disorganized data - this allows the engineers to keep things organized, sanitized, named correctly, column orders right. Without duplicating data.
De-normalized tables created from the normalized tables to allow for easy human consumption without having to join. With the de-normalization comes some duplication. There is a small number of these tables - 10 or so - that have the ability to answer 99% of people’s questions.

I don’t know what the different methodologies are for data warehousing. I’ve heard of Kimball and others, but never really looked into them. We’ve found our process to be easy for our engineers to work with, easy for our customers to query, and efficient for our data engine (Snowflake at this time) to process.

michael_zearn · May 7, 2020, 6:11pm

@trevor that’s a neat approach. Do you have an example of what might be in each of the three layers? I am trying to wrap my head around whether/when something like that would be useful for us.

Craig · May 11, 2020, 5:12pm

This is an excellent question. I work for a large non-profit and we have been grappling with this question as we engage in a re-design of our data warehouse as we move to the cloud. We have settled on:

An Extract process to take source data from sources and dump it raw into cloud storage. We call this our data lake.
A secondary Extract process to load from cloud storage into our cloud data warehouse.
Kimball Star Schema Modeling. We do this primarily to reduce complexity so we can have a layer where the data is easy to understand. This reduces long term technical debt.
Some of our analytical tools can use the star schema directly, and we are done. Other analytical tools need a wide table that exposes hundreds of columns. We call this a “context”, and we define it as a fact table with all possible dimension tables joined together into one wide table.

We are in the process of creating this next generation data warehouse. We have implented all of the technical pieces, and we are now in our implementation stage where we are modeling our tables and building it step by step. Time will tell how well it works, but we already see value.

drege · May 14, 2020, 11:40am

Thanks for sharing that @Craig and @trevor
I am wondering: Is the Kimball Star Schema the layer only non-analytics people and the tools are using or are most of the data analysts also directly accessing that star schema?
One thing which is puzzling me is that for data analysts (or bi managers or analytics engineers, however you would call them) isn’t there a lot of data exploration involved? Especially investigating into topics or creating reports that are not yet accessible via the star schema?
I have the same question for creating a normalized table layer. Do you try to involve all the data you get from the start and normalize it although there is no clear goal to use it yet or do you add new tables/columns only once they are needed?

trevor · May 18, 2020, 11:56pm

Hi @drege! Good questions.

Do you try to involve all the data you get from the start and normalize it although there is no clear goal to use it yet or do you add new tables/columns only once they are needed?

I’ve been at ThriftBooks for over 8 years - as a developer, dev manager, director, and VP. I’ve seen the types of questions people are asking for a long time, so I have a good sense for what data to include and what data to not include. But generally, if the data fits along the grain and might answer a question, I include it. Adding an extra column to a columnar database is not a big deal. If you don’t add it, people won’t be able to see it. If you do add it, maybe nobody will want it and it’ll just clutter things up. I tend to err on the explorability side of things. I’d rather give them the opportunity to see that a field is available and have them decide to not use it than to not show it to them.

Isn’t there a lot of data exploration involved? Especially investigating into topics or creating reports that are not yet accessible via the star schema?

We tell our analysts if there’s anything they need, let us know and we’ll add it as quickly as possible. If there’s a question they want answered, and the data isn’t in the modeled tables, and we have it in our raw data, we add it usually by the next day. Our models are pretty comprehensive because of this. At the beginning, we were adding a lot of tables. Now though, we add or modify something a few times a week.

trevor · May 19, 2020, 12:03am

Also interesting that @Craig uses both star schema and wide, de-normalized tables. I think if you can build one, it’s easy to build the other. If you have a star schema, you just join up all the facts and dimensions and you’ve got a wide de-normalized table. If you have a wide de-normalized table, you just separate out the facts from the dimensions and you’re done. The reason we go w/ the wide de-normalized tables is b/c they are faster to query in Snowflake. A single massive table with billions of rows can be queried pretty easily w/ an extra-small, but when we started joining tables together, unless they had clustering keys the performance was not as good. Clustering keys cost money to maintain, and larger warehouses cost money to run. So the big table works for us.

rob.stover · July 27, 2021, 9:52pm

@Craig curious how this journey has gone a year in. We are in the process of building out our analytics stack at my company and I’ve gone back and forth with this. I haven’t found much material on how this is being implemented, use cases, etc.

ianmfahey · July 5, 2022, 3:38pm

Reading this as part of dbt onboarding and wanted to share a “how we did it” from a 2-3 person Analytics Engineering team responsible for delivering data to both in-house stakeholders and client-facing reporting. I’d say maybe mid-sized data on reverse supply chain and recommerce (~100M rows of units through the system)

Notably, we also had a number of (Snowflake) SQL power users in our data stakeholder group, including people on the Ops team on warehouse floors, so intuitive navigation in the db was a high priority in our modeling.

We settled on a schema per business entity (e.g. unit) and then four types of tables:

<entity>_details: all the immutable information about the entity (e.g. unit_details would have things like client_id and created_timestamp)
<entity>_current: everything in the _details model, plus the current value of any mutable fields (e.g. status or warehouse_location)
<entity>_aggregates: everything in the _current model, plus any aggregates from other entities (e.g. storage_location_aggregates had a unit_count based on unit_current.storage_location)
Any number of <entity>_<event> tables, comparable to fact tables, where we preserved the event stream of anything that could impact each entity (e.g. unit_received, unit_sorted, unit_fulfilled).

While there’s obviously some cruftiness to building table, table-plus, and table-plus-plus, we found this was an acceptable amount of duplication to give users the ability to join to a single table depending on the type of field they needed (i.e. if they just needed status, they could stop at _current rather than aggregates). That said, dbt’s style guide has suggested it might have been better to JUST surface events and _aggregates in the marts pattern.

Still, even this approach was a huge improvement from just-in-time query builds.

molschimke · February 9, 2023, 1:44pm

Hi Drege,

I think it depends on our actual requirements: the big corporations tend to use more sophisticated designs such as Data Vault because they have sophisticated requirements (security, privacy, as in GDPR, data governance, agile extension, multi-cloud, etc.).

However, there is a misconception about Data Vault: we sometimes hear that DV is too complicated for small use cases, but that’s actually not true: if you have low requirements, DV keeps simple as well. But it also has a lot of solutions for more advanced scenarios, but every solution will add something to the resulting model. So the model looks (and becomes) more complicated, but the underlying reason is actually that your requirements are complicated and that is reflected by the resulting solution.

One of our clients gave me excellent feedback on this: he was involved in a non-Data Vault project and told me that they were actually super fast in delivering something. However, it only worked for a while until the requirements increased and the solution more or less broke (because it wasn’t extensible).

Everyone knows that the data volume , velocity increases over time. So I guess that everybody agrees that the solutions must scale in this regard. However, its also true that the requirements how we process data increase over time (think about the introduction of GDPR in Europe, which was a steep step for many). So a solution should also scale in regards to the requirements IMHO.

Hope that helps a bit,

Mike

Topic		Replies	Views
Is Kimball dimensional modeling still relevant in a modern data warehouse? In-Depth Discussions kimball	17	47870	June 23, 2022
Structure Snowflake database, schema In-Depth Discussions	19	14920	June 1, 2021
How we (used to) structure our dbt projects Archive	22	196741	September 7, 2022
Building a Kimball dimensional model with dbt In-Depth Discussions devblog	9	3846	October 16, 2023
Inmon vs Kimball in dbt Archive	3	3283	July 5, 2022

Data Modeling approaches in modern data-times

Related topics