How to create group identifiers with repeating grouping criteria?


#1

Can anyone share some window function magic on uniquely identifying groups of sequential values, where I might have some repeated values later in the set?

For instance, given the events below with grouping values X and Y with a natural order order, how do I group them such that every time X or Y changes, naturally ordered, I get a new group identifier?

Example input:

X | Y | data | order
--------------------
A | B |   1  |   1
A | B |   2  |   2
A | C |   1  |   3
B | C |   3  |   4
A | B |   2  |   5
A | C |   1  |   6
B | C |   3  |   7

Desired output:

X | Y | data | order | group
-----------------------------
A | B |   1  |   1   |   1
A | B |   2  |   2   |   1
A | C |   1  |   3   |   2
B | C |   3  |   4   |   3
A | B |   2  |   2   |   4
A | C |   1  |   6   |   5
B | C |   3  |   7   |   6

I can tell when X or Y changes with lag, and I can create group identifiers without repeats with rank() over (order by X, Y), but I can’t quite figure out how to do both at once.

Thanks!


#2

@drew gave great advice in Slack:

I’d start by making a sort of surrogate key to combine X and Y into a single column. Then, lag that field to determine when changes occur. If a change occurred, emit a 1, otherwise, emit a 0 — call it is_new_value. Finally, sum over the output of the is_new_value to get a group. I…. think that will work?

This strategy of counting changes works exactly right. Here’s a functioning example:

with base as (
            select 'A' as x, 'B' as y, 1 as d, 1 as o
  union all select 'A' as x, 'B' as y, 2 as d, 2 as o
  union all select 'A' as x, 'C' as y, 1 as d, 3 as o
  union all select 'B' as x, 'C' as y, 3 as d, 4 as o
  union all select 'A' as x, 'B' as y, 2 as d, 5 as o
  union all select 'A' as x, 'C' as y, 1 as d, 6 as o
  union all select 'B' as x, 'C' as y, 3 as d, 7 as o
),

keyed as (
  select
      *
    , x || y as grouping_key
  from base
),

lagged as (
  select
      *
    , case 
      when coalesce(grouping_key != lag(grouping_key, 1) over (order by o), true) then 1 
      end as is_new_value
  from keyed
 )

select 
    x
  , y
  , d
  , o
  , sum(is_new_value) over (order by o 
                            rows between unbounded preceding and current row) as group_id
from lagged
order by o