Querying hierarchical data

Some data sources are hierarchical, where a record has parent that is a record in the same table. This is usually expressed as a column (such as parent_company_id) which is a foreign key that joins to another column in the same table. It’s probably best expressed in an example:

company_id parent_company_id name revenue
1 Hooli 45
2 1 HooliPhones 84
3 1 HooliChat 32
4 3 HooliMessage 25

There may be cases where you want to know the “top parent of a child”, for example, to later be able to sum up revenue by top parent companies.

company_id parent_company_id name revenue top_parent_company_id level
1 Hooli 45 1 0
2 1 HooliPhones 84 1 1
3 1 HooliChat 32 1 1
4 3 HooliMessage 25 1 2

The way to solve this problem differs across data warehouses.

Redshift & BigQuery

In Redshift, it’s not possible to recursively join on a table to find the top parent id. Instead, you have to join the table onto itself multiple times, with the caveat that you must know the maximum depth of the hierarchy.

with level0 as (

    select
      0 as level, 
      companies.company_id,
      companies.parent_company_id,
      companies.company_id as top_parent_company_id
    from companies
    where parent_company_id is null
    
),

level1 as (

    select
      1 as level, 
      companies.company_id,
      companies.parent_company_id,
      parent_companies.top_parent_company_id
    from companies

    inner join level0 as parent_companies
      on companies.parent_company_id = parent_companies.company_id
    
),

level2 as (

    select
      2 as level, 
      companies.company_id,
      companies.parent_company_id,
      parent_companies.top_parent_company_id
    from companies

    inner join level1 as parent_companies
      on companies.parent_company_id = parent_companies.company_id
    
)

select * from level0
union all
select * from level1
union all
select * from level2

dbt macros can be used to make this code much cleaner - you can find such a macro, along with an example, in this gist.

Snowflake

Snowflake allows you to do this with recursive CTEs (currently in beta) – more to come on what solve this problem looks like in Snowflake.

3 Likes

Thanks @claire! This was super helpful. I have a bunch of e-commerce bundled items that needed to be recursively expanded (sometimes there are bundles of bundles). This totally put me on the right track and the leveled approach means I can make it dynamic for the user in Looker!

We’ve been able to solve the hierarchical problem using Snowflake’s Recursive CTE. https://docs.snowflake.net/manuals/LIMITEDACCESS/recursive_cte.html

The documentation is fairly clear on how to set it up, you have an anchor clause with a UNION ALL to a recursive clause.

We use it to understand our Zuora subscription data. Our subscriptions can link to future subscriptions as a renewal. When we see a subscription that is a renewal, we want to understand what the first subscription in the lineage actually is. We do this by recursively joining on the renewal subscription id to the main subscription id until there is nothing left to join.

Our code here details how we do this https://gitlab.com/gitlab-data/analytics/blob/70991c7b6beb71858a97f548521c335b2314a384/transform/snowflake-dbt/models/zuora/xf/zuora_subscription_lineage.sql

WITH RECURSIVE

flattening AS (

  SELECT
	{{ dbt_utils.star(from=ref('zuora_subscription_intermediate'), except=["ZUORA_RENEWAL_SUBSCRIPTION_NAME_SLUGIFY"]) }},

    IFF(array_to_string(ZUORA_RENEWAL_SUBSCRIPTION_NAME_SLUGIFY,',') IS NULL,
    subscription_name_slugify,
    subscription_name_slugify || ',' || array_to_string(ZUORA_RENEWAL_SUBSCRIPTION_NAME_SLUGIFY,','))
     					AS lineage,
    renewal.value::string 		AS ZUORA_RENEWAL_SUBSCRIPTION_NAME_SLUGIFY
  FROM {{ref('zuora_subscription_intermediate')}},
    LATERAL flatten(input => zuora_renewal_subscription_name_slugify, OUTER => TRUE) renewal

),

zuora_sub (base_slug, renewal_slug, parent_slug, lineage, children_count) AS (

  SELECT
    subscription_name_slugify              	AS base_slug,
    zuora_renewal_subscription_name_slugify	AS renewal_slug,
    subscription_name_slugify              	AS parent_slug,
    lineage 					AS lineage,
    2     					AS children_count
  FROM flattening

  UNION ALL

  SELECT
    iter.subscription_name_slugify											AS base_slug,
    iter.zuora_renewal_subscription_name_slugify							AS renewal_slug,
    anchor.parent_slug														AS parent_slug,
    anchor.lineage || ',' || iter.zuora_renewal_subscription_name_slugify 	AS lineage,
    iff(iter.zuora_renewal_subscription_name_slugify IS NULL,
		0,
		anchor.children_count + 1) 											AS children_count
  FROM zuora_sub anchor
  JOIN flattening iter
    ON anchor.renewal_slug = iter.subscription_name_slugify

),

I’ll highlight a few items here. First, the recursive CTE does not have to be the first CTE in your model. You do need WITH RECURSIVE at the start, but in our case the first CTE flattening does some simple data prep. The zuora_sub CTE is the recursive CTE.

The first select statement is the anchor query (aliased as anchor). This query pulls from the subscription intermediate table and sets up the initial lineage. If there is a renewal subscription then it will continue to the next part of the CTE, but if there are no renewals then the recursive clause will return no additional results.

The recursive clause (aliased as iter) joins the renewal slug from the anchor clause to the subscription slug of the next iteration of the recursive clause. We’re keeping track of the parent slug as the “root” for the initial recursion (this is the “ultimate parent” of the lineage). Within the recursive clause we’re checking if there are any further renewals before setting the child count.

Overall this has worked great for us! Snowflake sets a limit of 100 levels for the recursion so you don’t have to worry too much about setting up an infinite loop or wasting a ton of resources unless each iteration is a very heavy query.

1 Like