📚DW Design: which books should I read in 2020?

On a related note to this discourse post, I’m interested to hear what resources people recommend for developing a strong theoretical grounding in DW design.

I’m after a book or a few long and in-depth blog posts to explain the pros and cons of different models so that I can then make informed decisions given a particular use case. I’ve previously enjoyed Martin Fowler’s lengthy posts on software architecture, but he hasn’t done much on data architecture (and what is there is laden with buzzwords and light on technical detail for my liking).

I know The DW Toolkit by Kimball would be a standard reference, but following on from the discussion on this discourse post it seems like there is some debate about its relevance in the MPP database age (although it still seems like a good starting point to me).

Can anyone recommend any favourite DW design books? Or books / long blog posts in general that are must-reads for a technical data lead?

In return, I recommend my favourite technical data architecture book, Designing Data-Intensive Applications by Martin Kleppman, which is AMAZING as a foundation for understanding and combining all kinds of different data systems.

3 Likes

Hi @dan,
Here are two great books (in my opinion) :

  • Building a scalable dwh with datavault 2.0 (by Linstedt & Olschimke)
  • The elephant in the fridge (by Giles)

It’s all about scaling, automation and capturing business rules correctly.

(Edited) I forgot to mention Roelant Vos’ famous blog !! Icing on the cake, I noticed, while looking for the blog’s url, that Roelant Vos himself is testing the integration of his ‘Virtual DataWarehouse’ automation tool with DBT… It’s a small world !

http://roelantvos.com/blog/generating-dbt-etl-for-snowflake/

Best,

1 Like

Thanks the suggestions @fabrice.etanchaud! I’ve seen Datavault come up a few times so this is a good prompt to read a bit further into it, and I’ve saved the blog post for weekend reading!

I also received a few more suggestions on the dbt Slack:

  • Sean C: I’ve found Agile Data Warehouse Design is a good book for this. Half of it is about the process of working with stakeholders and designing models. And then the other half is about design patterns for common types of dimensions. (a few people agreed on this one)
  • Jeff A: I LOVE Mastering Data Modeling: A User Driven Approach for getting a firm understanding how (and how well) a data model maps data to reality and vice versa. It was the book that made data modeling “click” for me. After that the Kimball stuff made much more sense.

There was some more discussion on the original Slack thread, Dave says:

Kimball is still relevant, it’s also worth learning about Inmon and about Data Vault and plain old transactional database modelling. The core parts of Kimball are

  1. semantic layer - converting technical naming into business-aligned naming.
  2. De-normalizing as appropriate
  3. laser focus on business perspective

What I don’t think is so useful is when people read all the technical parts and develop a fixation on the perfect one-size-fits-all-queries star/snowflake schema. There’s some context around when a lot of that was written, mainly to do with the limitations of database technology. Much of it is still relevant, the laws of physics haven’t changed, but some things can be relaxed now.
For example it’s not such a big deal to store multiple redundant copies of denormalized tables that are aggregated in different ways or at different granularities. Especially with cloud based systems like Redshift Spectrum, Athena, Snowflake etc storing the data on blob storage is virtually unlimited and relatively cheap.
I also think it is valuable to realise that if you do create a one-size-fits-all data warehouse then it will not be optimised for any particular use-case. It will be a bit lowest-common-denominator. You can spend months and months trying to build something “perfect” like that and then discover the 5 -10 really important use cases don’t run quickly at all.
One place where dim / fact tables can clash with MPP architecture is that you really want to avoid shuffling between nodes. It’s always going to be faster to have zero joins in a query because a single table can be partitioned across nodes and queries parallelised easily… assuming queries use predicates that are in sympathy with the partitioning strategy. That said, small dimension tables can be stored redundantly in full, on each node, with larger fact tables partitioned across multiple nodes. Co-located (merge) joins are faster, but not as fast as no joins. And as soon as you get complex subqueries then the query engine no longer has sorting guarantees (and hence merge joins) and colocation can’t be relied on either.
What remains is the laser focus on the business requirements. An architecture that is a bit messy and organic, but has some denormalized datamart-style tables is going to be more useful more quickly than a full blown data warehouse.
If you do have extensive data lineage and auditing requirements, then certainly using Data Vault as an intermediary storage will be well worth the additional development time required.
Ultimately the best thing you can do is equip yourself with knowledge of all the different approaches and use the ones that are relevant, disregarding the dogma.

To which Jose followed up:

I absolutely agree with everything in your comment Dave. To be honest, the technology has advanced a lot since these data warehouse designs methodologies were created. So many people tend to either ignore them completely (so there’s a risk of not having a proper design for a company wide data warehouse that covers the requirements from all departments) or they say the new design paradigms focus only on the technical side and don’t go in enough detail over the business requirements so they think they have to follow every single Kimball/Inmon detail. It’s really important to understand the methodologies and the why of them to make proper decisions