I was recently pairing with an early-career analyst who asked me for help debugging a query she had written. The problem she had to solve was similar to this:
A company has two newsletters that people can subscribe to — for the sake of this conversation, let’s call them “The Daily” and “The Weekly”.
Each month, we want to know:
- How many people that were members of the Daily, subscribed to the Weekly?
- How many people that were members of the Weekly, subscribed to the Daily?
Earlier in my career, when thinking how to solve this problem, my first instinct would have been to look at the data we had available, and start writing queries — after all, you have to build your DAG from left to right, right?
Now, I take a backwards approach, and write the final query first.
Vid here:
and the google sheet.
Step 1: Go to sheets.new to create a new Google Sheet
(Did you know that this is a valid URL? RIP my Google Drive, which is now full of Untitled Sheet
s)
Step 2: Mock out the final report you’re trying to build, by entering values into the sheet.
For us, this looked like the following:
subscription_month | weekly_subscribers_to_daily | daily_ subscribers_to_weekly |
---|---|---|
Jan 2021 | 100 | 150 |
Feb 2022 | 120 | 160 |
^ These numbers don’t have to be reflective of real numbers at all, and the column names don’t even have to be perfect at this point.
Step 3. Mock the query that would produce this table
You might also simultaneously mock out the table you’re selecting from to write this query.
This is where things got tricky for us. At first, we thought we’d be selecting from a table with one record per subscriber:
subscribers
: one record per subscriber
subscriber_id | date_subscribed_to_daily | date_subscribed_to_weekly |
---|---|---|
1 | 2021-01-01 | 2021-01-08 |
2 | 2021-02-07 | |
3 | 2021-03-18 | 2021-03-14 |
But when we went to write the query, we realized we had a problem – what field would we date_trunc
on?
select
date_trunc('month', ????) as subscription_month,
count(???) as weekly_subscribers_to_daily,
count(???) as daily_ subscribers_to_weekly
from subscribers
group by 1
Step 4. Iterate
From here, we realized the model that we were selecting from needed to be one record per subscription — a particular subscriber (human being) can appear twice. Our query was going to look something like this:
select
date_trunc('month', subscription_date) as subscription_month,
count(case when was_previously_subscribed_to_weekly and newsletter = 'daily' then subscriber_id end) as weekly_subscribers_to_daily,
count(case when was_previously_subscribed_to_daily and newsletter = 'weekly' then subscriber_id end) as daily_subscribers_to_weekly
from subscribers
group by 1
^ Note that we also filled in some more of our count
logic here, which lets us know "ok we need to make sure our upstream model has fields like newsletter
and was_previously_subscribed_to_daily
etc.
Now our mock table gets updated to:
subscriptions
: one record per subscriber, per newsletter they are subscribed to
subscriber_id | date_subscribed | newsletter | was_previously_subscribed_to_daily | was_previously_subscribed_to_weekly |
---|---|---|---|---|
1 | 2021-01-01 | daily | FALSE | FALSE |
1 | 2021-01-08 | weekly | TRUE | FALSE |
2 | 2021-02-07 | daily | FALSE | FALSE |
3 | 2021-03-14 | weekly | FALSE | FALSE |
3 | 2021-03-18 | daily | FALSE | TRUE |
We can also take a guess as to what the SQL for those last two columns looks like, they’re going to be some sort of window function, partitioned by the subscriber_id
:
select
subscriber_id,
date_subscribed,
newsletter,
boolor_agg(newsletter = 'daily') over (
partition by subscriber_id
order by date_subscribed
rows between unbounded preceding and current row
) as was_previously_subscribed_to_daily,
-- etc
from ???
Step 5. Match this up with source data
So far, we’ve written pseudo-output and pseudocode without actually looking at our source data! At this point, we need to consider what the shape of our source data is! Do we have a table of subscriptions? Do the subscriptions to each newsletter live in separate tables (and therefore we need to union
the data together)
Step 6. Move the logic into your dbt project
Now that you’ve broken up the problem into smaller pieces, you can start moving this code into your dbt project! Sometimes the theoretical backwards-working is enough for me to start building my DAG from left-to-right. Sometimes though, it’s not, and I need to work backwards in my dbt project too! In those cases, I’ll often put placeholder columns in my models (especially if I’m adjusting existing models by adding a new column):
select
...,
null::boolean as was_previously_subscribed_to_daily
from ...
This approach has been met with mixed reactions internally — I’m curious, do you think this is a good approach, or more confusing than just writing the SQL?