Jump straight to the project docs, or to the source code
This article was prompted by a question from a team member
Our data
Let’s say we have a table with one record per user, per day they were active:
user_id | date_day | is_active |
---|---|---|
1 | 2020-08-01 | true |
1 | 2020-09-01 | true |
1 | 2020-09-04 | true |
1 | 2020-09-05 | true |
1 | 2020-09-12 | true |
1 | 2020-09-17 | true |
2 | 2020-09-01 | true |
The problem
We’ve been asked to create reports for daily, weekly, and monthly active users.
On the surface, we know that we could calculate daily active users like this:
Daily active users
select
date_day,
count(distinct user_id) as daily_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_day | daily_active_users |
---|---|
2020-08-13 | 1 |
2020-09-01 | 2 |
2020-09-04 | 1 |
2020-09-05 | 1 |
… | … |
(Note, we haven’t filled in the missing days, which might be something worth doing )
We can follow similar patterns for weekly and monthly active users, to generate outputs like this.
Weekly active users SQL
select
date_trunc('week', date_day) as date_week,
count(distinct user_id) as weekly_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_week | weekly_active_users |
---|---|
2020-08-10 | 1 |
2020-08-17 | 1 |
2020-08-24 | 1 |
2020-08-31 | 2 |
2020-09-07 | 1 |
2020-09-14 | 1 |
Monthly active users SQL
select
date_trunc('month', date_day) as date_month,
count(distinct user_id) as monthly_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_month | monthly_active_users |
---|---|
2020-08-01 | 1 |
2020-09-01 | 2 |
However, wouldn’t it be if we could see this changing on a daily basis? That can feel strange when thinking about “monthly active users” for a day in the middle of the month, but essentially we want to get to something like this — for each day:
- How many active users were there today?
- How many active users were there in the last week / 7 days?
- How many active users were there in the last month / 30 days?
GOAL TABLE
date_day | daily_active_users | weekly_active_users | monthly_active_users |
---|---|---|---|
… | … | … | … |
2020-09-01 | 2 | 2 | 2 |
2020-09-02 | 0 | 2 | 2 |
2020-09-03 | 0 | 2 | 2 |
2020-09-04 | 1 | 2 | 2 |
2020-09-05 | 1 | 2 | 2 |
2020-09-06 | 0 | 2 | 2 |
2020-09-07 | 0 | 2 | 2 |
2020-09-08 | 0 | 1 | 2 |
2020-09-09 | 0 | 1 | 2 |
… | … | … | … |
The “semi-additive” nature of these metrics means that we can’t just add up our daily active users each week to get the number of weekly active users, so the way to get here may not feel immediately obvious.
The solution
Fortunately we’ve seen this pattern before!
Before we can build the summary table, we need to build a table with one record per user, per day (regardless of whether they were active that day or not), and fill in columns for:
active_today
active_l7_days
active_l30_days
To solve this, you’ll need to use a date spine and some window functions.
SQL to create this table
with user_active_days as (
select * from {{ ref('user_active_days') }}
),
all_days as (
select * from {{ ref('all_days') }}
),
user_first_active_day as (
select
user_id,
min(date_day) as first_active_day
from user_active_days
group by 1
),
-- use a date spine to fill in missing days
spined as (
select
user_first_active_day.user_id,
all_days.date_day
from user_first_active_day
left join all_days
-- all days after the user was first active
on all_days.date_day >= user_first_active_day.first_active_day
),
filled as (
select
spined.date_day,
spined.user_id,
coalesce(user_active_days.is_active, false) as is_active_today,
max(is_active_today) over (
partition by spined.user_id
order by spined.date_day
rows between 6 preceding and current row
) as is_active_l7_days,
max(is_active_today) over (
partition by spined.user_id
order by spined.date_day
rows between 29 preceding and current row
) as is_active_l30_days
from spined
left join user_active_days
on spined.date_day = user_active_days.date_day
and spined.user_id = user_active_days.user_id
)
select * from filled
date_day | user_id | is_active_today | is_active_l7_days | is_active_l30_days |
---|---|---|---|---|
2020-08-13 | 1 | true | true | true |
2020-08-14 | 1 | false | true | true |
2020-08-15 | 1 | false | true | true |
2020-08-16 | 1 | false | true | true |
2020-08-17 | 1 | false | true | true |
2020-08-18 | 1 | false | true | true |
2020-08-19 | 1 | false | true | true |
2020-08-20 | 1 | false | false | true |
2020-08-21 | 1 | false | false | true |
2020-08-22 | 1 | false | false | true |
… | … | … | … | … |
Once we have this, it’s a reasonably simple hop from that table to our goal table (shown above):
with active_users as (
select * from {{ ref('user_active_days_spined') }}
),
final as (
select
date_day,
sum(is_active_today::integer) as daily_active_users,
sum(is_active_l7_days::integer) as weekly_active_users,
sum(is_active_l30_days::integer) as monthly_active_users
from active_users
group by 1
)
select * from final
order by date_day
See the code
We’ve published some sample code for this over here and deployed the docs for this project here. Feel free to explore! (Note: the code has only been written for Snowflake)