Creating many-to-many tables from dbt

Hello everybody :slight_smile:

We are starting using dbt in my company (rewriting all our data workflow, migrating from MongoDB to Postgres, etc.). We are querying billing data, everything is in a single, very large raw table, that actually contains around 100M rows, and slowly growing more and more.

With dbt, our goal is to split this large table into multiple smaller and regrouped tables to make our queries faster for the end-user. But we are facing an issue with arrays unnesting and many-to-many relations, I’ll explain with examples.

First, here is an example of our raw table:

As you can see, there are tags in a varchar(255)[] array. We want to make two tables:

  • tags: which will contain every unique tag, distinct and split in name and value columns.
  • tags_source: which is a pivot between the raw table ID and the tag ID.

Actually, as I am writing this, it actually does work and we found a way to optimize this query, taking now around 500 seconds to process on 70M rows. We did two iterations before getting this result:

  • The first attempt was to make the tags table and then a second query was doing the pivot using the @> operator (array contains), this was really slow, the @> operator is not good for handling millions of rows. It took more than an hour to process.
  • The second and actual attempt was to make an ephemeral materialization that unnest tags, add the original raw ID in the same row and also do an MD5 of the original unnested value. Then, we can launch two parallel queries:
    • One that will create the tags table by doing a distinct on the MD5 generated before, preventing any duplicates.
    • The second does a pivot by grouping on the MD5 and the source_id.

Using the second method, these two tables now take around 500s to create, in parallel, which is much better.
However, my real questions are:

  • Can I make this query even faster?
  • Is there a better, more “dbt” convenient approach?

Hi there!

What warehouse are you using (sounds like Postgres)? Your warehouse likely has some support for parsing JSON, which will be much faster than treating it as text. I’d create a “staging” table downstream from this raw table that casts tags to json. See the Postgres docs on manipulating JSON: PostgreSQL: Documentation: 14: 9.16. JSON Functions and Operators (other warehouses have something similar).