Writing models backwards: An unorthodox approach to data transformation

I was recently pairing with an early-career analyst who asked me for help debugging a query she had written. The problem she had to solve was similar to this:

A company has two newsletters that people can subscribe to — for the sake of this conversation, let’s call them “The Daily” and “The Weekly”.

Each month, we want to know:

  • How many people that were members of the Daily, subscribed to the Weekly?
  • How many people that were members of the Weekly, subscribed to the Daily?

Earlier in my career, when thinking how to solve this problem, my first instinct would have been to look at the data we had available, and start writing queries — after all, you have to build your DAG from left to right, right?

Now, I take a backwards approach, and write the final query first.

Vid here:

and the google sheet.

Step 1: Go to sheets.new to create a new Google Sheet

(Did you know that this is a valid URL? RIP my Google Drive, which is now full of Untitled Sheets)

Step 2: Mock out the final report you’re trying to build, by entering values into the sheet.

For us, this looked like the following:

subscription_month weekly_subscribers_to_daily daily_ subscribers_to_weekly
Jan 2021 100 150
Feb 2022 120 160

^ These numbers don’t have to be reflective of real numbers at all, and the column names don’t even have to be perfect at this point.

Step 3. Mock the query that would produce this table

You might also simultaneously mock out the table you’re selecting from to write this query.

This is where things got tricky for us. At first, we thought we’d be selecting from a table with one record per subscriber:

subscribers: one record per subscriber

subscriber_id date_subscribed_to_daily date_subscribed_to_weekly
1 2021-01-01 2021-01-08
2 2021-02-07
3 2021-03-18 2021-03-14

But when we went to write the query, we realized we had a problem – what field would we date_trunc on?

select
  date_trunc('month', ????) as subscription_month,
  count(???) as weekly_subscribers_to_daily,
  count(???) as daily_ subscribers_to_weekly
 
from subscribers

group by 1

Step 4. Iterate

From here, we realized the model that we were selecting from needed to be one record per subscription — a particular subscriber (human being) can appear twice. Our query was going to look something like this:

select
  date_trunc('month', subscription_date) as subscription_month,
  count(case when was_previously_subscribed_to_weekly and newsletter = 'daily' then subscriber_id end) as weekly_subscribers_to_daily,
  count(case when was_previously_subscribed_to_daily and newsletter = 'weekly' then subscriber_id end) as daily_subscribers_to_weekly
 
from subscribers

group by 1

^ Note that we also filled in some more of our count logic here, which lets us know "ok we need to make sure our upstream model has fields like newsletter and was_previously_subscribed_to_daily etc.

Now our mock table gets updated to:
subscriptions: one record per subscriber, per newsletter they are subscribed to

subscriber_id date_subscribed newsletter was_previously_subscribed_to_daily was_previously_subscribed_to_weekly
1 2021-01-01 daily FALSE FALSE
1 2021-01-08 weekly TRUE FALSE
2 2021-02-07 daily FALSE FALSE
3 2021-03-14 weekly FALSE FALSE
3 2021-03-18 daily FALSE TRUE

We can also take a guess as to what the SQL for those last two columns looks like, they’re going to be some sort of window function, partitioned by the subscriber_id:

select
  subscriber_id,
  date_subscribed,
  newsletter,

  boolor_agg(newsletter = 'daily') over (
     partition by subscriber_id
     order by date_subscribed
     rows between unbounded preceding and current row
  ) as was_previously_subscribed_to_daily, 
  -- etc
from ???

Step 5. Match this up with source data

So far, we’ve written pseudo-output and pseudocode without actually looking at our source data! At this point, we need to consider what the shape of our source data is! Do we have a table of subscriptions? Do the subscriptions to each newsletter live in separate tables (and therefore we need to union the data together)

Step 6. Move the logic into your dbt project

Now that you’ve broken up the problem into smaller pieces, you can start moving this code into your dbt project! Sometimes the theoretical backwards-working is enough for me to start building my DAG from left-to-right. Sometimes though, it’s not, and I need to work backwards in my dbt project too! In those cases, I’ll often put placeholder columns in my models (especially if I’m adjusting existing models by adding a new column):

select
  ..., 
  null::boolean as was_previously_subscribed_to_daily

from ...

This approach has been met with mixed reactions internally — I’m curious, do you think this is a good approach, or more confusing than just writing the SQL?

14 Likes

I like it. It has some echoes of test-driven development which I’ve found to be helpful at times. Excited to try it out myself :).

1 Like

“Invert, always invert” - Carl Gustav Jacob Jacobi

Starting at the end is a very powerful problem solving approach, I’ll definitely give this a go on my next transformation challenge.

Thanks for sharing Claire!

This method of thinking also parallels the structure of many of the SQL questions on HackerRank and other sites, where the expected output table’s schema is typically provided alongside the question. Which is to say - learning to think this way could help you with the SQL test in your next interview. :slightly_smiling_face:

This approach is the one proposed by Kimball in the DWH book :slight_smile:

You start from the answer that you need to provide to the business, then you model your tables, then you figure out how to populate your model (first you look for the data, then you write the ETL :slight_smile: ).

It is a good approach :slight_smile:

1 Like

This approach was introduced to me by my new manager last month and I’m actually finding myself struggling do to the shift to think this way (right to left).

as someone that come from web application development world to the analytics\data engineering world it feels more convenient and efficient to me to start writing code (queries) and adjust it step by step (left to right).

But I’m giving this approach a chance, I wonder how it will feel in the future :smiley:

Great writeup, Claire!

I think the most important part when starting something like this is to understand how your data needs to be structured in the end result for your stakeholder (i.e. having decent-enough requirements). So from that perspective, I always start at the end…but where I go from there depends on a few things. I think your Step 2 (mocking out the final report) is going to look different for people depending on where the BI calculations are done. What you currently have in there is essentially the end result of the report - but is that the end result of your code? It probably depends on where you’re calculating the metrics - in the BI tool or the code itself. I might be wrong with this assumption, but I think most people currently have the calculations in the BI tool itself since it’s more performant to e.g. apply date parameters to the dataset than it is to re-query the dataset with new date parameters.

So for me personally, my Step 2 here would be to mock up what the end dataset is that I need to produce in order to support generating the calculated metrics that go into the report. Here are some questions I ask myself during this process (assuming the calculations are done in the BI tool):

  • What are the metrics that need to go into the report? What level(s) of granularity (dimensions) are the final metrics reported at?
  • For each final metric, what are the underlying fields that contribute to that metric’s calculation?
  • For each individual metric required to support the calculations, what dimensions are available to me and how do the dimensions at that level differ from the dimensionality needed in the final report or dataset?
  • During this thought process, I also think about the operations that need to be applied to the metrics (e.g. SUM, AVG) and think about my GROUP BYs for dimensionality.

In order to answer these questions, you kinda need to start at the beginning of the DAG workflow to understand all the little pieces of data you have available.

I think where people (myself included) are potentially getting tripped up with this “destination-first paradigm” is writing the SQL queries at each stage. Think of it like putting together something from IKEA. You know what the end result needs to be. And if you start from the back of the instruction manual, you’ll see that the final steps often require you to screw allen wrench something in that was assembled from individual components earlier on in the process. And so while it’s great (necessary, I’d argue) to understand what those modular (assembled) components are that you need in the final steps, there’s no way for me to actually build those without starting from the beginning, with the actual parts. Not to mention, the first thing you need to do with IKEA furniture is verify that you have all the necessary parts to be able to actually build the thing (aka initial data discovery).

If I were to change the process you’ve outlined to work better for me personally, I’d probably take out the writing SQL backwards part and instead focus on just modeling my data backwards. I’d want to understand what my data needs to look like at each stage, and once I have all the stages outlined, I’d start from the beginning (left side of the DAG) to start building towards those stages. I think there’s also something to be said for failing early – failing early doesn’t only mean you fail quickly…it also relates to where you fail (i.e. early on in the data process). Point being, I’d rather fail in developing the initial building blocks vs. assuming I’ll have certain building blocks in the end. Now, that might be a little confusing, but what I’m saying is - I can’t assume the building blocks I need later on in the process will be built exactly the way I had outlined if I were writing the SQL backwards - there may be some nuances earlier on in the data process that would cause me to shift my entire approach.

Lastly, I wanted to touch on something you said early on in the article.

Earlier in my career, when thinking how to solve this problem, my first instinct would have been to look at the data we had available, and start writing queries

First off - I think this is a testament to how much you’ve grown as a developer, as this, in my opinion, is the difference between junior and senior developers. Junior devs often take a problem and then sit down and start writing code, hoping that it’ll play out for the best. Senior devs take a problem, then map/plan everything out before executing on their plan.

3 Likes

Thanks for sharing @Claire!

I found your post echoing with my recent experience.

We just onboarded dbt and I am now working on migrating our Postgres project with ~100 tables/views/materialised views to Redshift using dbt, with the goal of improving our schema design. As a newbie in analytics engineering (I am a data analyst), I had struggles at the beginning when I tried to design upstream models (dimension and fact tables) first as I was not so sure what we would need.

Then I decided to start with cube models (an intermediate layer between dimensions & facts and reports), I copied the old SQL definition and start to clean up the code and logic. It was during this process that it became clear to me which parts of the old code should be decoupled, which are repeated across the project, and eventually what should be moved upstream/downstream. This approach of starting from (not the very end) helped me a lot!