Querying hierarchical data with a dbt macro (Redshift)


#1

Some of our data has a parent/child relationship, where one of the columns is a foreign key that joins to a record in the same table.
For example:

company_id parent_company_id name revenue
1 Hooli 45
2 1 HooliPhones 84
3 1 HooliChat 32
4 3 HooliMessage 25

There may be cases where you want to know the “top parent of a child”, for example, to later be able to sum up revenue by top parent companies.

company_id parent_company_id name revenue top_parent_company_id level
1 Hooli 45 1 0
2 1 HooliPhones 84 1 1
3 1 HooliChat 32 1 1
4 3 HooliMessage 25 1 2

However in Redshift, it’s not possible to recursively join on a table to find the top parent id. Instead, you have to join the table onto itself multiple times, with the caveat that you must know the maximum depth of the hierarchy.

with level0 as (

    select
      0 as level, 
      companies.company_id,
      companies.parent_company_id,
      companies.company_id as top_parent_company_id
    from companies
    where parent_company_id is null
    
), level1 as (

    select
      1 as level, 
      companies.company_id,
      companies.parent_company_id,
      parent_companies.top_parent_company_id
    from companies
    inner join level0 as parent_companies on companies.parent_company_id = parent_companies.company_id
    
), level2 as (

    select
      2 as level, 
      companies.company_id,
      companies.parent_company_id,
      parent_companies.top_parent_company_id
    from companies
    inner join level1 as parent_companies on companies.parent_company_id = parent_companies.company_id
    
)

select * from level0
union all
select * from level1
union all
select * from level2

dbt macros can be used to make this code much cleaner - I wrote a fairly flexible macro that finds the top_parent_id of an entity based on a given id and parent_id column. You can then join the results of this table back onto your original table for further modelling. You can find the macro, along with an advanced example, in this gist.

I also played around with writing a macro that returns the size of each “level” table, and only continues executing if there are records at this level. I was able to build a working version of that macro using dbt statements, however in the end I decided to stick with the above approach as it was much simpler and achieved what I needed. If you’re interested in the statement-based macro, let me know :slight_smile:


#2

Thanks @claire! This was super helpful. I have a bunch of e-commerce bundled items that needed to be recursively expanded (sometimes there are bundles of bundles). This totally put me on the right track and the leveled approach means I can make it dynamic for the user in Looker!