Refactoring a query by improving the modeling


#1

Hey all - I’m looking for advice about the best way to model something:

I have a users model and a user_stats model with facts about those users. One of the facts is total_revenue representing how much they’ve spent all time and another is upgraded_at with the date/time they first paid.

To display a list of the most recent customers with their total revenue, I can write a query like this:

select
  email,
  total_revenue
from analytics.users
join analytics.user_stats on user_stats.user_id = users.id
where total_revenue > 0
order by upgraded_at desc
limit 20

I’m wondering whether there are ways to improve this. A few ideas…

Option 1

I’ll be filtering on total_revenue > 0 often. I could create a new column in user_stats like is_customer that way I can just write where is_customer rather than repeatedly write total_revenue > 0 in analyses:

select
  email,
  total_revenue
from analytics.users
join analytics.user_stats on user_stats.user_id = users.id
where is_customer
order by upgraded_at desc
limit 20

Option 2

Create a model that just represents customers:

select users.*
from {{ ref('users') }} users
join {{ ref('user_stats') }} on user_stats.user_id = users.id
where total_revenue > 0

Then in the analysis I can write:

select
  email,
  total_revenue
from analytics.customers customers
join analytics.user_stats on user_stats.user_id = customers.id
order by upgraded_at desc
limit 20

Option 3

Take this a step further and create a recent_customers model:

select customers.*
from {{ ref('customers') }} customers
join {{ ref('user_stats') }} on user_stats.user_id = customers.id
order by upgraded_at desc
limit 20

Then in the analysis just write:

select
  email,
  total_revenue
from analytics.recent_customers
join analytics.user_stats on user_stats.user_id = recent_customers.id

Do any of these approaches stand out as much better or worse than the others? How in general do you all think about how much to model vs writing in the actual analysis query?


#2

Hey @mhmazur - I think that creating a customers model is a fantastic idea.

Your logic today might be where total_revenue > 0, but I’ve definitely seen cases where promotions or discounts result in a legitimate customer not paying any money for the first month of a service. I think if I was doing this, I’d add an is_customer field to the users model, then make a customers model to just select from users where is_customer = TRUE.

This abstraction will make it possible to change the definition of a “customer” in the future without needing to update consumers (either other models, or analyst queries). Further, you can still query users.is_customer to see a breakdown of behavior segmented by user type.

I’d avoid Option 3, as this limit can easily be applied by a consumer if necessary. If you really need to segment by new/old, I’d use a window function to rank the customers, then produce an is_recent boolean.

Something like:

case
  when row_number() over (partition by customer_id order by upgraded_at desc) < 20
    then TRUE
  else FALSE
end as is_recent

My guess is that there are other ways to define “recent” customers, like upgrading in the past week, for instance. This is_recent is a nice abstraction that will again let you change this definition in the future if you need to!


#3

Thank you Drew, appreciate the thoughtful response as always.

One related thing I’ve been wondering about are best practices around where to place new models within the models directory. For example, in this project I have a models/base with the users model and other base models.

When I create a new model like customers here, it’s not obvious where the best place is to put it. My current approach is something like this:

  1. If there’s already a directory with related models, place it in there and add the tests to the existing schema.yml. For example, if I already had a models/users/ with other models related to segmenting users, put the model in models/users/customers.sql
  2. If there’s not already a directory with related models, place it in it’s own directory and create a new schema. For example, models/customers/customers.sql and models/customers/schema.yml. Later, if there do wind up being related models, combine them into a single directory like models/users/ and merge the schemas.

Is that a decent approach or are there better best practices to follow?

I’ve noticed in some of the open source projects there’s often a models/transform/ directory like in the Stripe project - what models would typically go in there vs their own directory?