How to create group identifiers with repeating grouping criteria?

ianterrell · October 30, 2018, 8:23pm

Can anyone share some window function magic on uniquely identifying groups of sequential values, where I might have some repeated values later in the set?

For instance, given the events below with grouping values X and Y with a natural order order, how do I group them such that every time X or Y changes, naturally ordered, I get a new group identifier?

Example input:

X | Y | data | order
--------------------
A | B |   1  |   1
A | B |   2  |   2
A | C |   1  |   3
B | C |   3  |   4
A | B |   2  |   5
A | C |   1  |   6
B | C |   3  |   7

Desired output:

X | Y | data | order | group
-----------------------------
A | B |   1  |   1   |   1
A | B |   2  |   2   |   1
A | C |   1  |   3   |   2
B | C |   3  |   4   |   3
A | B |   2  |   2   |   4
A | C |   1  |   6   |   5
B | C |   3  |   7   |   6

I can tell when X or Y changes with lag, and I can create group identifiers without repeats with rank() over (order by X, Y), but I can’t quite figure out how to do both at once.

Thanks!

ianterrell · October 30, 2018, 8:35pm

@drew gave great advice in Slack:

I’d start by making a sort of surrogate key to combine X and Y into a single column. Then, lag that field to determine when changes occur. If a change occurred, emit a 1, otherwise, emit a 0 — call it is_new_value. Finally, sum over the output of the is_new_value to get a group. I…. think that will work?

This strategy of counting changes works exactly right. Here’s a functioning example:

with base as (
            select 'A' as x, 'B' as y, 1 as d, 1 as o
  union all select 'A' as x, 'B' as y, 2 as d, 2 as o
  union all select 'A' as x, 'C' as y, 1 as d, 3 as o
  union all select 'B' as x, 'C' as y, 3 as d, 4 as o
  union all select 'A' as x, 'B' as y, 2 as d, 5 as o
  union all select 'A' as x, 'C' as y, 1 as d, 6 as o
  union all select 'B' as x, 'C' as y, 3 as d, 7 as o
),

keyed as (
  select
      *
    , x || y as grouping_key
  from base
),

lagged as (
  select
      *
    , case 
      when coalesce(grouping_key != lag(grouping_key, 1) over (order by o), true) then 1 
      end as is_new_value
  from keyed
 )

select 
    x
  , y
  , d
  , o
  , sum(is_new_value) over (order by o 
                            rows between unbounded preceding and current row) as group_id
from lagged
order by o

Topic		Replies	Views
How to get incremental number by using surrogate key ? Help	3	94	April 17, 2025
How to include columns in your unique_key that change from NULL to populated Help incremental	1	768	April 17, 2024
Need help using incremental to append new campaigns to the table and incrementing surrogate key by 1 Help incremental , python-models , postgres	1	312	May 1, 2024
referring current model Help jinja , macros	1	1077	December 7, 2022
How do I add a column to my table with its own logic? Help	0	486	March 1, 2024

How to create group identifiers with repeating grouping criteria?

Related topics