Some data sources are hierarchical, where a record has parent that is a record in the same table. This is usually expressed as a column (such as
parent_company_id) which is a foreign key that joins to another column in the same table. It’s probably best expressed in an example:
There may be cases where you want to know the “top parent of a child”, for example, to later be able to sum up revenue by top parent companies.
The way to solve this problem differs across data warehouses.
Redshift & BigQuery
In Redshift, it’s not possible to recursively join on a table to find the top parent id. Instead, you have to join the table onto itself multiple times, with the caveat that you must know the maximum depth of the hierarchy.
with level0 as ( select 0 as level, companies.company_id, companies.parent_company_id, companies.company_id as top_parent_company_id from companies where parent_company_id is null ), level1 as ( select 1 as level, companies.company_id, companies.parent_company_id, parent_companies.top_parent_company_id from companies inner join level0 as parent_companies on companies.parent_company_id = parent_companies.company_id ), level2 as ( select 2 as level, companies.company_id, companies.parent_company_id, parent_companies.top_parent_company_id from companies inner join level1 as parent_companies on companies.parent_company_id = parent_companies.company_id ) select * from level0 union all select * from level1 union all select * from level2
dbt macros can be used to make this code much cleaner - you can find such a macro, along with an example, in this gist.
Snowflake allows you to do this with recursive CTEs (currently in beta) – more to come on what solve this problem looks like in Snowflake.