Creating many-to-many tables from dbt

vlourme · April 2, 2022, 4:12pm

Hello everybody

We are starting using dbt in my company (rewriting all our data workflow, migrating from MongoDB to Postgres, etc.). We are querying billing data, everything is in a single, very large raw table, that actually contains around 100M rows, and slowly growing more and more.

With dbt, our goal is to split this large table into multiple smaller and regrouped tables to make our queries faster for the end-user. But we are facing an issue with arrays unnesting and many-to-many relations, I’ll explain with examples.

First, here is an example of our raw table:

As you can see, there are tags in a varchar(255)[] array. We want to make two tables:

tags: which will contain every unique tag, distinct and split in name and value columns.
tags_source: which is a pivot between the raw table ID and the tag ID.

Actually, as I am writing this, it actually does work and we found a way to optimize this query, taking now around 500 seconds to process on 70M rows. We did two iterations before getting this result:

The first attempt was to make the tags table and then a second query was doing the pivot using the @> operator (array contains), this was really slow, the @> operator is not good for handling millions of rows. It took more than an hour to process.
The second and actual attempt was to make an ephemeral materialization that unnest tags, add the original raw ID in the same row and also do an MD5 of the original unnested value. Then, we can launch two parallel queries:
- One that will create the tags table by doing a distinct on the MD5 generated before, preventing any duplicates.
- The second does a pivot by grouping on the MD5 and the source_id.

Using the second method, these two tables now take around 500s to create, in parallel, which is much better.
However, my real questions are:

Can I make this query even faster?
Is there a better, more “dbt” convenient approach?

ted.conbeer · April 4, 2022, 3:40pm

Hi there!

What warehouse are you using (sounds like Postgres)? Your warehouse likely has some support for parsing JSON, which will be much faster than treating it as text. I’d create a “staging” table downstream from this raw table that casts tags to json. See the Postgres docs on manipulating JSON: PostgreSQL: Documentation: 14: 9.16. JSON Functions and Operators (other warehouses have something similar).

Topic		Replies	Views
Splitting arrays into rows in Redshift with dbt macros Archive jinja , redshift	1	6250	May 14, 2019
Combining results of multiple transformations into a single final table Archive	3	6048	September 13, 2022
Analytics on data provided by a API In-Depth Discussions json , api	1	1368	June 14, 2023
Transformation Postgres existing data Help postgres	0	691	March 4, 2023
Unioning identically-structured data sources Show and Tell jinja , best-practice	12	36363	December 5, 2023

Creating many-to-many tables from dbt

Related topics