Some data sources are hierarchical, where a record has parent that is a record in the same table. This is usually expressed as a column (such as parent_company_id
) which is a foreign key that joins to another column in the same table. It’s probably best expressed in an example:
company_id | parent_company_id | name | revenue |
---|---|---|---|
1 | Hooli | 45 | |
2 | 1 | HooliPhones | 84 |
3 | 1 | HooliChat | 32 |
4 | 3 | HooliMessage | 25 |
There may be cases where you want to know the “top parent of a child”, for example, to later be able to sum up revenue by top parent companies.
company_id | parent_company_id | name | revenue | top_parent_company_id | level |
---|---|---|---|---|---|
1 | Hooli | 45 | 1 | 0 | |
2 | 1 | HooliPhones | 84 | 1 | 1 |
3 | 1 | HooliChat | 32 | 1 | 1 |
4 | 3 | HooliMessage | 25 | 1 | 2 |
The way to solve this problem differs across data warehouses.
Redshift & BigQuery
In Redshift, it’s not possible to recursively join on a table to find the top parent id. Instead, you have to join the table onto itself multiple times, with the caveat that you must know the maximum depth of the hierarchy.
with level0 as (
select
0 as level,
companies.company_id,
companies.parent_company_id,
companies.company_id as top_parent_company_id
from companies
where parent_company_id is null
),
level1 as (
select
1 as level,
companies.company_id,
companies.parent_company_id,
parent_companies.top_parent_company_id
from companies
inner join level0 as parent_companies
on companies.parent_company_id = parent_companies.company_id
),
level2 as (
select
2 as level,
companies.company_id,
companies.parent_company_id,
parent_companies.top_parent_company_id
from companies
inner join level1 as parent_companies
on companies.parent_company_id = parent_companies.company_id
)
select * from level0
union all
select * from level1
union all
select * from level2
dbt macros can be used to make this code much cleaner - you can find such a macro, along with an example, in this gist.
Snowflake
Snowflake allows you to do this with recursive CTEs (currently in beta) – more to come on what solve this problem looks like in Snowflake.