Modeling changes to daily/weekly/monthly active users

Jump straight to the project docs, or to the source code

This article was prompted by a question from a team member :slight_smile:

Our data

Let’s say we have a table with one record per user, per day they were active:

user_id date_day is_active
1 2020-08-01 true
1 2020-09-01 true
1 2020-09-04 true
1 2020-09-05 true
1 2020-09-12 true
1 2020-09-17 true
2 2020-09-01 true

The problem

We’ve been asked to create reports for daily, weekly, and monthly active users.

On the surface, we know that we could calculate daily active users like this:

Daily active users

select
   date_day,
   count(distinct user_id) as daily_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_day daily_active_users
2020-08-13 1
2020-09-01 2
2020-09-04 1
2020-09-05 1

(Note, we haven’t filled in the missing days, which might be something worth doing :wink: )

We can follow similar patterns for weekly and monthly active users, to generate outputs like this.

Weekly active users SQL
select
   date_trunc('week', date_day) as date_week,
   count(distinct user_id) as weekly_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_week weekly_active_users
2020-08-10 1
2020-08-17 1
2020-08-24 1
2020-08-31 2
2020-09-07 1
2020-09-14 1
Monthly active users SQL
select
   date_trunc('month', date_day) as date_month,
   count(distinct user_id) as monthly_active_users
from {{ ref('user_active_days') }}
group by 1
order by 1
date_month monthly_active_users
2020-08-01 1
2020-09-01 2

However, wouldn’t it be :cool: if we could see this changing on a daily basis? That can feel strange when thinking about “monthly active users” for a day in the middle of the month, but essentially we want to get to something like this — for each day:

  • How many active users were there today?
  • How many active users were there in the last week / 7 days?
  • How many active users were there in the last month / 30 days?

:sparkles: GOAL TABLE :sparkles:

date_day daily_active_users weekly_active_users monthly_active_users
2020-09-01 2 2 2
2020-09-02 0 2 2
2020-09-03 0 2 2
2020-09-04 1 2 2
2020-09-05 1 2 2
2020-09-06 0 2 2
2020-09-07 0 2 2
2020-09-08 0 1 2
2020-09-09 0 1 2

The “semi-additive” nature of these metrics means that we can’t just add up our daily active users each week to get the number of weekly active users, so the way to get here may not feel immediately obvious.

The solution

Fortunately we’ve seen this pattern before!

Before we can build the summary table, we need to build a table with one record per user, per day (regardless of whether they were active that day or not), and fill in columns for:

  • active_today
  • active_l7_days
  • active_l30_days

To solve this, you’ll need to use a date spine and some window functions.

SQL to create this table
with user_active_days as (
    select * from {{ ref('user_active_days') }}
),

all_days as (
    select * from {{ ref('all_days') }}
),

user_first_active_day as (
    select
        user_id,
        min(date_day) as first_active_day
    from user_active_days
    group by 1
),

-- use a date spine to fill in missing days
spined as (
    select
        user_first_active_day.user_id,
        all_days.date_day
    from user_first_active_day
    left join all_days
        -- all days after the user was first active
        on all_days.date_day >= user_first_active_day.first_active_day
),

filled as (
    select
        spined.date_day,
        spined.user_id,

        coalesce(user_active_days.is_active, false) as is_active_today,

        max(is_active_today) over (
            partition by spined.user_id
            order by spined.date_day
            rows between 6 preceding and current row
        ) as is_active_l7_days,

        max(is_active_today) over (
            partition by spined.user_id
            order by spined.date_day
            rows between 29 preceding and current row
        ) as is_active_l30_days

    from spined

    left join user_active_days
        on spined.date_day = user_active_days.date_day
        and spined.user_id = user_active_days.user_id
)

select * from filled

date_day user_id is_active_today is_active_l7_days is_active_l30_days
2020-08-13 1 true true true
2020-08-14 1 false true true
2020-08-15 1 false true true
2020-08-16 1 false true true
2020-08-17 1 false true true
2020-08-18 1 false true true
2020-08-19 1 false true true
2020-08-20 1 false false true
2020-08-21 1 false false true
2020-08-22 1 false false true

Once we have this, it’s a reasonably simple hop from that table to our goal table (shown above):

with active_users as (
    select * from {{ ref('user_active_days_spined') }}
),

final as (
    select
        date_day,

        sum(is_active_today::integer) as daily_active_users,
        sum(is_active_l7_days::integer) as weekly_active_users,
        sum(is_active_l30_days::integer) as monthly_active_users

    from active_users

    group by 1
)

select * from final
order by date_day

See the code

We’ve published some sample code for this over here and deployed the docs for this project here. Feel free to explore! (Note: the code has only been written for Snowflake)

5 Likes

What if someone logs in again? Since you’re pulling the first time someone logged in, I’m missing the computation that resets the “clock” if they log in again.

Do you mean multiple times in the same day? The initial table is at the grain of user and day:

This line:

Is to facilitate the date spine - by doing an inequality join you can get a row for every day even if they didn’t log in. See Finding active days for a subscription/user/account (date spining) for a more in depth discussion