Modern Data Modeling: Are Junk dimensions still relevant?

Hi Guys,

I have a question about what’s the best practice in modern data modeling.

Background info…
The old school Kimball style way is to keep fact tables long and thin, meaning they should only contain measures and keys to dimensions.

All further information, particularly low-cardinality data, such status flags should be in dimensions.

The one exception is high-cardinality data, such as order ids, that are unique for a majority of the fact table rows. Adding those to a dimension table would make the dimension table a similar size as the fact table and defeat the purpose. So they are kept as degenerate dimensions inside the fact table.

In order to avoid creating dozens of dimensions, all data that doesn’t fit into separate dimensions is put into a junk dimension.

The benefit of these best practices is to keep data storage low and the number of dimensions manageable.

The question…
With that in mind… my question is whether this best practice of using a junk dimension still makes sense or whether it’s better to just keep the data as degenerate dimensions?

If the junk dimension columns aren’t used to drill across to other fact tables, is there a benefit of moving them into a separate dimension?

Modern DBs that use column storage compress low-cardinality data in the fact tables. So storage should not be an issue.

Are there any other disadvantages I am missing if I keep some dimensional data in my fact tables?

And only create dimensions for data that is used across multiple fact tables?

1 Like

Hi Kimcha,

I don’t consider myself an expert in modeling strategies, but your point of view makes sense to me.

Some aspects of Kimball’s Dimensional Modeling exist to implement a data compression strategy. The advice to keep fact tables “long and thin” achieves just that: values are looked up, rather than duplicated. That’s data compression. This is excellent advice when the underlying storage is row-based, uncompressed, and joins are cheap.

But as you note, a modern MPP columnar data warehouse implements data compression directly.

MPP systems are also distributed and joins tend to be more costly in a distributed system. The specific costs for joins will vary depending on which warehouse you use.

It is the increased and varying cost of joins, relative to the databases of Kimball’s time, that make me hesitant to say “Kimball is always right.” It’s a different constraint than Kimball was designing for, and I’m certain we have to balance this cost in our modern designs.

1 Like