Some of our data has a parent/child relationship, where one of the columns is a foreign key that joins to a record in the same table.
There may be cases where you want to know the “top parent of a child”, for example, to later be able to sum up revenue by top parent companies.
However in Redshift, it’s not possible to recursively join on a table to find the top parent id. Instead, you have to join the table onto itself multiple times, with the caveat that you must know the maximum depth of the hierarchy.
with level0 as ( select 0 as level, companies.company_id, companies.parent_company_id, companies.company_id as top_parent_company_id from companies where parent_company_id is null ), level1 as ( select 1 as level, companies.company_id, companies.parent_company_id, parent_companies.top_parent_company_id from companies inner join level0 as parent_companies on companies.parent_company_id = parent_companies.company_id ), level2 as ( select 2 as level, companies.company_id, companies.parent_company_id, parent_companies.top_parent_company_id from companies inner join level1 as parent_companies on companies.parent_company_id = parent_companies.company_id ) select * from level0 union all select * from level1 union all select * from level2
dbt macros can be used to make this code much cleaner - I wrote a fairly flexible macro that finds the
top_parent_id of an
entity based on a given
parent_id column. You can then join the results of this table back onto your original table for further modelling. You can find the macro, along with an advanced example, in this gist.
I also played around with writing a macro that returns the size of each “level” table, and only continues executing if there are records at this level. I was able to build a working version of that macro using dbt statements, however in the end I decided to stick with the above approach as it was much simpler and achieved what I needed. If you’re interested in the statement-based macro, let me know