Hello everybody
We are starting using dbt in my company (rewriting all our data workflow, migrating from MongoDB to Postgres, etc.). We are querying billing data, everything is in a single, very large raw table, that actually contains around 100M rows, and slowly growing more and more.
With dbt, our goal is to split this large table into multiple smaller and regrouped tables to make our queries faster for the end-user. But we are facing an issue with arrays unnesting and many-to-many relations, I’ll explain with examples.
First, here is an example of our raw table:
As you can see, there are tags in a varchar(255)[]
array. We want to make two tables:
- tags: which will contain every unique tag, distinct and split in
name
andvalue
columns. - tags_source: which is a pivot between the raw table ID and the tag ID.
Actually, as I am writing this, it actually does work and we found a way to optimize this query, taking now around 500 seconds to process on 70M rows. We did two iterations before getting this result:
- The first attempt was to make the
tags
table and then a second query was doing the pivot using the@>
operator (array contains), this was really slow, the@>
operator is not good for handling millions of rows. It took more than an hour to process. - The second and actual attempt was to make an
ephemeral
materialization thatunnest
tags, add the original raw ID in the same row and also do an MD5 of the original unnested value. Then, we can launch two parallel queries:- One that will create the
tags
table by doing a distinct on the MD5 generated before, preventing any duplicates. - The second does a pivot by grouping on the MD5 and the
source_id
.
- One that will create the
Using the second method, these two tables now take around 500s to create, in parallel, which is much better.
However, my real questions are:
- Can I make this query even faster?
- Is there a better, more “dbt” convenient approach?