How to identify rows in between rows with certain values?

ianterrell · September 24, 2018, 3:44pm

I have some event stream data, and I’d like to detect whether some events are between other events of known types. I’m familiar with windowing functions but am struggling to find a query here that doesn’t try to nest my window functions. I’m running on Redshift.

As an example with the following data, I’d like to know which events occur after an A or E but before an X or Z.

Row	Event
1	A
2	B
3	C
4	B
5	Z
6	B
7	C
8	E
9	B
10	X
11	E
12	C
13	B

Transformed, it would look like this:

Row	Event	Between A/E and X/Z?
1	A	No
2	B	Yes
3	C	Yes
4	B	Yes
5	Z	No
6	B	No
7	C	No
8	E	No
9	B	Yes
10	X	No
11	E	No
12	C	Yes
13	B	Yes

Thanks!

dylanbaker · September 24, 2018, 4:02pm

I think @drew is going to give a better answer, but I wanted to put forward a solution:

If for each row, you grab the row of the previous and next A or E and the row of the previous and next X or Z, you could then compare the two.

with prep as (

select
    *,
    max(case when event in ('A','E') then "row" end) over (order by "row" rows between unbounded preceding and 1 preceding)  as previous_ae,
    max(case when event in ('X','Z') then "row" end) over (order by "row" rows between unbounded preceding and 1 preceding)  as previous_xz,
    min(case when event in ('A','E') then "row" end) over (order by "row" rows between 1 following and unbounded following)  as next_ae,
    min(case when event in ('X','Z') then "row" end) over (order by "row" rows between 1 following and unbounded following)  as next_xz
from table

)

select
    row,
    event,
    previous_ae > previous_xz and next_xz < next_ae as row_between_ae_and_xz
from prep

I think that should replicate the logic you’re looking for. You’ll need to deal with the edge cases where either of the equalities in the final query involve null values, but wasn’t sure exactly how you would want the logic working.

Obviously, if you had something like a user_id, you would partition by that in the window function.

ianterrell · September 24, 2018, 5:27pm

Thank you! This does indeed work. It looks like current row captures the edges better, and I added a coalesce to get the value fully non-null boolean.

with raw_events as (
SELECT 1 as row, 'A' AS event
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'C'
UNION ALL SELECT 4, 'B'
UNION ALL SELECT 5, 'Z'
UNION ALL SELECT 6, 'B'
UNION ALL SELECT 7, 'C'
UNION ALL SELECT 8, 'E'
UNION ALL SELECT 9, 'B'
UNION ALL SELECT 10, 'X'
UNION ALL SELECT 11, 'E'
UNION ALL SELECT 12, 'C'
UNION ALL SELECT 13, 'B'),

prep as (
select
    *,
    max(case when event in ('A','E') then "row" end) over (order by "row" rows between unbounded preceding and current row)  as previous_ae,
    max(case when event in ('X','Z') then "row" end) over (order by "row" rows between unbounded preceding and current row)  as previous_xz,
    min(case when event in ('A','E') then "row" end) over (order by "row" rows between current row and unbounded following)  as next_ae,
    min(case when event in ('X','Z') then "row" end) over (order by "row" rows between current row and unbounded following)  as next_xz
from raw_events
)

select
    row,
    event,
    coalesce(previous_ae > previous_xz and next_xz < next_ae, true) as row_between_ae_and_xz
from prep

Thanks!

drew · September 24, 2018, 5:47pm

Hey @ianterrell - this is a great question! I like @dylanbaker’s approach above, so I’ll instead offer a more general way to think about questions like this.

In general, I find that window functions confuse the heck out of me when I try to do too much at once. This is one of those cases where optimizing for newlines makes things more complicated, and you should instead optimize for clarity.

Whenever I see a problem that can be solved with window functions, I try to break it into its simplest components, then build up to a working select statement through several CTEs. This might mean that the eventual solution is much longer than the shortest possible implementation, but it’ll be much easier to reason about and maintain, which I think is a net-positive.

I’ll also add that a certain amount of pattern recognition is helpful for questions like these. This question reminds me of " event sessionization", in which we try to group events into “sessions” that occur without a 30 minute break of activity. It’s not an identical problem, but it’s similar enough that I could use existing work as a good starting point.

Nice work @dylanbaker and thanks for the great question @ianterrell!

michael.dunn · September 24, 2018, 6:53pm

I did something similar a couple years back in an operations context:
Given task start and end times and an id of the worker to whom the task is assigned, determine how many tasks are assigned to each worker at any(all) points in history, then use that to divide the time spent during each time period among the concurrent tasks for that worker during that time period.

I don’t remember all the details, but I remember it had a lot of CTEs for exactly the reasons you mention. If I tried to do it all in one go, it’d be completely unmanageable, so it was broken up so that each CTE was one step of the algorithm and made it much more readable as a result.

zeevs · September 25, 2018, 6:58am

Hey @drew do you happen to have some resources on the “event sessionization” problem?

drew · September 25, 2018, 3:20pm

Hey @zeevs: I like to do this with window functions.

-- postgres example
with source as (

	select 1 as user_id, '2018-01-01 12:00:01'::timestamp as event_time, 'A' as event union all
	select 1 as user_id, '2018-01-01 12:00:10' as event_time, 'B' as event union all
	select 1 as user_id, '2018-01-01 12:45:00' as event_time, 'C' as event union all
	select 1 as user_id, '2018-01-01 12:46:00' as event_time, 'D' as event

),

with_prev as (

	select *,
		lag(event_time) over (partition by user_id order by event_time) as prev_event_time
		
	from source
	
),

session_increments as (

	select
		*,
		case
			when event_time - prev_event_time > interval '30 minute' then 1
			else 0
		end as is_new_session
		
	from with_prev
	
),

sessionized as (

	select
		*,
		sum(is_new_session) over (
			partition by user_id
			order by event_time
			rows between unbounded preceding and current row
		) as session_index
		
	from session_increments
	
)

select
	*,
	user_id || '.' || session_index as session_id
	
from sessionized

You can kind of see how this is similar to @ianterrell’s original question. I took a stab at solving his problem using an approach like this. It didn’t turn out too great, but I think this sort of paradigm is a useful way of thinking about the problem!

zeevs · September 25, 2018, 4:24pm

Thanks @drew, I get this part, this is pretty straightforward. I wonder if there are any best practices to handle this in an incremental model, late arriving data, etc (of course I can google )

Topic		Replies	Views
How to query for a single row when using BETWEEN on valid_from and valid_to Help snapshots	2	832	January 10, 2024
How to create group identifiers with repeating grouping criteria? Archive	1	2190	October 30, 2018
Sharing my SQL Style Guide Show and Tell best-practice	9	6690	December 17, 2020
Splitting arrays into rows in Redshift with dbt macros Archive jinja , redshift	1	6227	May 14, 2019
What are the best practices to partition big tables in Redshift In-Depth Discussions redshift , partitioning	2	31273	April 28, 2020

Row	Event	Between A/E and X/Z?
1	A	No
2	B	Yes
3	C	Yes
4	B	Yes
5	Z	No
6	B	No
7	C	No
8	E	No
9	B	Yes
10	X	No
11	E	No
12	C	Yes
13	B	Yes

Row	Event	Between A/E and X/Z?
1	A	No
2	B	Yes
3	C	Yes
4	B	Yes
5	Z	No
6	B	No
7	C	No
8	E	No
9	B	Yes
10	X	No
11	E	No
12	C	Yes
13	B	Yes

How to identify rows in between rows with certain values?

Related topics

Row	Event	Between A/E and X/Z?
1	A	No
2	B	Yes
3	C	Yes
4	B	Yes
5	Z	No
6	B	No
7	C	No
8	E	No
9	B	Yes
10	X	No
11	E	No
12	C	Yes
13	B	Yes