On a related note to this discourse post, Iām interested to hear what resources people recommend for developing a strong theoretical grounding in DW design.
Iām after a book or a few long and in-depth blog posts to explain the pros and cons of different models so that I can then make informed decisions given a particular use case. Iāve previously enjoyed Martin Fowlerās lengthy posts on software architecture, but he hasnāt done much on data architecture (and what is there is laden with buzzwords and light on technical detail for my liking).
I know The DW Toolkit by Kimball would be a standard reference, but following on from the discussion on this discourse post it seems like there is some debate about its relevance in the MPP database age (although it still seems like a good starting point to me).
Can anyone recommend any favourite DW design books? Or books / long blog posts in general that are must-reads for a technical data lead?
In return, I recommend my favourite technical data architecture book, Designing Data-Intensive Applications by Martin Kleppman, which is AMAZING as a foundation for understanding and combining all kinds of different data systems.
Hi @dan,
Here are two great books (in my opinion) :
Building a scalable dwh with datavault 2.0 (by Linstedt & Olschimke)
The elephant in the fridge (by Giles)
Itās all about scaling, automation and capturing business rules correctly.
(Edited) I forgot to mention Roelant Vosā famous blog !! Icing on the cake, I noticed, while looking for the blogās url, that Roelant Vos himself is testing the integration of his āVirtual DataWarehouseā automation tool with DBTā¦ Itās a small world !
Thanks the suggestions @fabrice.etanchaud! Iāve seen Datavault come up a few times so this is a good prompt to read a bit further into it, and Iāve saved the blog post for weekend reading!
I also received a few more suggestions on the dbt Slack:
Sean C: Iāve found Agile Data Warehouse Design is a good book for this. Half of it is about the process of working with stakeholders and designing models. And then the other half is about design patterns for common types of dimensions. (a few people agreed on this one)
Jeff A: I LOVEMastering Data Modeling: A User Driven Approach for getting a firm understanding how (and how well) a data model maps data to reality and vice versa. It was the book that made data modeling āclickā for me. After that the Kimball stuff made much more sense.
There was some more discussion on the original Slack thread, Dave says:
Kimball is still relevant, itās also worth learning about Inmon and about Data Vault and plain old transactional database modelling. The core parts of Kimball are
semantic layer - converting technical naming into business-aligned naming.
De-normalizing as appropriate
laser focus on business perspective
What I donāt think is so useful is when people read all the technical parts and develop a fixation on the perfect one-size-fits-all-queries star/snowflake schema. Thereās some context around when a lot of that was written, mainly to do with the limitations of database technology. Much of it is still relevant, the laws of physics havenāt changed, but some things can be relaxed now.
For example itās not such a big deal to store multiple redundant copies of denormalized tables that are aggregated in different ways or at different granularities. Especially with cloud based systems like Redshift Spectrum, Athena, Snowflake etc storing the data on blob storage is virtually unlimited and relatively cheap.
I also think it is valuable to realise that if you do create a one-size-fits-all data warehouse then it will not be optimised for any particular use-case. It will be a bit lowest-common-denominator. You can spend months and months trying to build something āperfectā like that and then discover the 5 -10 really important use cases donāt run quickly at all.
One place where dim / fact tables can clash with MPP architecture is that you really want to avoid shuffling between nodes. Itās always going to be faster to have zero joins in a query because a single table can be partitioned across nodes and queries parallelised easilyā¦ assuming queries use predicates that are in sympathy with the partitioning strategy. That said, small dimension tables can be stored redundantly in full, on each node, with larger fact tables partitioned across multiple nodes. Co-located (merge) joins are faster, but not as fast as no joins. And as soon as you get complex subqueries then the query engine no longer has sorting guarantees (and hence merge joins) and colocation canāt be relied on either.
What remains is the laser focus on the business requirements. An architecture that is a bit messy and organic, but has some denormalized datamart-style tables is going to be more useful more quickly than a full blown data warehouse.
If you do have extensive data lineage and auditing requirements, then certainly using Data Vault as an intermediary storage will be well worth the additional development time required.
Ultimately the best thing you can do is equip yourself with knowledge of all the different approaches and use the ones that are relevant, disregarding the dogma.
To which Jose followed up:
I absolutely agree with everything in your comment Dave. To be honest, the technology has advanced a lot since these data warehouse designs methodologies were created. So many people tend to either ignore them completely (so thereās a risk of not having a proper design for a company wide data warehouse that covers the requirements from all departments) or they say the new design paradigms focus only on the technical side and donāt go in enough detail over the business requirements so they think they have to follow every single Kimball/Inmon detail. Itās really important to understand the methodologies and the why of them to make proper decisions
There is some background for the book - when we wrote it, we wanted to address two target audiences: the professional DWH developer and the beginners, probably sitting in college and learning data warehousing for the first time (with Data Vault 2.0). So I believe / hope its a good starting point for many youngsters.
From our experience, dbt works great with Data Vault 2.0, especially if you follow the ādata-drivenā approach (due to the flexibility of Data Vault, there are many approaches to the modelling style). We have made great experience with DV & dbt at clients of varying scale.