How to identify rows in between rows with certain values?


#1

I have some event stream data, and I’d like to detect whether some events are between other events of known types. I’m familiar with windowing functions but am struggling to find a query here that doesn’t try to nest my window functions. I’m running on Redshift.

As an example with the following data, I’d like to know which events occur after an A or E but before an X or Z.

Row Event
1 A
2 B
3 C
4 B
5 Z
6 B
7 C
8 E
9 B
10 X
11 E
12 C
13 B

Transformed, it would look like this:

Row Event Between A/E and X/Z?
1 A No
2 B Yes
3 C Yes
4 B Yes
5 Z No
6 B No
7 C No
8 E No
9 B Yes
10 X No
11 E No
12 C Yes
13 B Yes

Thanks!


#2

I think @drew is going to give a better answer, but I wanted to put forward a solution:

If for each row, you grab the row of the previous and next A or E and the row of the previous and next X or Z, you could then compare the two.

with prep as (

select
    *,
    max(case when event in ('A','E') then "row" end) over (order by "row" rows between unbounded preceding and 1 preceding)  as previous_ae,
    max(case when event in ('X','Z') then "row" end) over (order by "row" rows between unbounded preceding and 1 preceding)  as previous_xz,
    min(case when event in ('A','E') then "row" end) over (order by "row" rows between 1 following and unbounded following)  as next_ae,
    min(case when event in ('X','Z') then "row" end) over (order by "row" rows between 1 following and unbounded following)  as next_xz
from table

)

select
    row,
    event,
    previous_ae > previous_xz and next_xz < next_ae as row_between_ae_and_xz
from prep

I think that should replicate the logic you’re looking for. You’ll need to deal with the edge cases where either of the equalities in the final query involve null values, but wasn’t sure exactly how you would want the logic working.

Obviously, if you had something like a user_id, you would partition by that in the window function.


#3

Thank you! This does indeed work. It looks like current row captures the edges better, and I added a coalesce to get the value fully non-null boolean.

with raw_events as (
SELECT 1 as row, 'A' AS event
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'C'
UNION ALL SELECT 4, 'B'
UNION ALL SELECT 5, 'Z'
UNION ALL SELECT 6, 'B'
UNION ALL SELECT 7, 'C'
UNION ALL SELECT 8, 'E'
UNION ALL SELECT 9, 'B'
UNION ALL SELECT 10, 'X'
UNION ALL SELECT 11, 'E'
UNION ALL SELECT 12, 'C'
UNION ALL SELECT 13, 'B'),

prep as (
select
    *,
    max(case when event in ('A','E') then "row" end) over (order by "row" rows between unbounded preceding and current row)  as previous_ae,
    max(case when event in ('X','Z') then "row" end) over (order by "row" rows between unbounded preceding and current row)  as previous_xz,
    min(case when event in ('A','E') then "row" end) over (order by "row" rows between current row and unbounded following)  as next_ae,
    min(case when event in ('X','Z') then "row" end) over (order by "row" rows between current row and unbounded following)  as next_xz
from raw_events
)

select
    row,
    event,
    coalesce(previous_ae > previous_xz and next_xz < next_ae, true) as row_between_ae_and_xz
from prep

Thanks!


#4

Hey @ianterrell - this is a great question! I like @dylanbaker’s approach above, so I’ll instead offer a more general way to think about questions like this.

In general, I find that window functions confuse the heck out of me when I try to do too much at once. This is one of those cases where optimizing for newlines makes things more complicated, and you should instead optimize for clarity.

Whenever I see a problem that can be solved with window functions, I try to break it into its simplest components, then build up to a working select statement through several CTEs. This might mean that the eventual solution is much longer than the shortest possible implementation, but it’ll be much easier to reason about and maintain, which I think is a net-positive.

I’ll also add that a certain amount of pattern recognition is helpful for questions like these. This question reminds me of " event sessionization", in which we try to group events into “sessions” that occur without a 30 minute break of activity. It’s not an identical problem, but it’s similar enough that I could use existing work as a good starting point.

Nice work @dylanbaker and thanks for the great question @ianterrell!


#5

I did something similar a couple years back in an operations context:
Given task start and end times and an id of the worker to whom the task is assigned, determine how many tasks are assigned to each worker at any(all) points in history, then use that to divide the time spent during each time period among the concurrent tasks for that worker during that time period.

I don’t remember all the details, but I remember it had a lot of CTEs for exactly the reasons you mention. If I tried to do it all in one go, it’d be completely unmanageable, so it was broken up so that each CTE was one step of the algorithm and made it much more readable as a result.


#6

Hey @drew do you happen to have some resources on the “event sessionization” problem?


#8

Hey @zeevs: I like to do this with window functions.

-- postgres example
with source as (

	select 1 as user_id, '2018-01-01 12:00:01'::timestamp as event_time, 'A' as event union all
	select 1 as user_id, '2018-01-01 12:00:10' as event_time, 'B' as event union all
	select 1 as user_id, '2018-01-01 12:45:00' as event_time, 'C' as event union all
	select 1 as user_id, '2018-01-01 12:46:00' as event_time, 'D' as event

),

with_prev as (

	select *,
		lag(event_time) over (partition by user_id order by event_time) as prev_event_time
		
	from source
	
),

session_increments as (

	select
		*,
		case
			when event_time - prev_event_time > interval '30 minute' then 1
			else 0
		end as is_new_session
		
	from with_prev
	
),

sessionized as (

	select
		*,
		sum(is_new_session) over (
			partition by user_id
			order by event_time
			rows between unbounded preceding and current row
		) as session_index
		
	from session_increments
	
)

select
	*,
	user_id || '.' || session_index as session_id
	
from sessionized

You can kind of see how this is similar to @ianterrell’s original question. I took a stab at solving his problem using an approach like this. It didn’t turn out too great, but I think this sort of paradigm is a useful way of thinking about the problem!


#9

Thanks @drew, I get this part, this is pretty straightforward. I wonder if there are any best practices to handle this in an incremental model, late arriving data, etc (of course I can google :slight_smile: )