Grouping events into intervals using hedgehog rolling style

- I have a challenging SQL problem for you, – yelled one of my coworkers one morning entering my office and dropping on his way a few pictures, SQL server system views map and a visitors chair. – Much more interesting than all your alerts and twitter conversations! ( “Well”, I thought to myself, “it might beat the alerts but not the twitter conversations” )

The problem presented by him is as follows:

There is a table containing information about a device’s, like phones or tablets, history of connections to hotspots as shown below:

The idea was to bind together the above mentioned events into groups. Each group, or we can call it - an interval, should represent the time that any device was connected to the specific access point. Each time the device has moved to the new access point, a new interval should start. If there were no events for the specific device longer than 30 min, the next event should start a new interval even if the new event was reported on the same access point as before.

Queries will eventually slice the data to see how many users were on the same access point at the specific time, average time devices stayed connected to specific access_point and other similar business questions.

The matter was indeed an attention-grabbing one since the desired result cannot be achieved by using a regular GROUP BY logic.

First of all, here is a test data:

SELECT *

INTO test_events

FROM (VALUES (1,1,10,'20130101 10:01'),

(3,2 , 10 ,'20130101 10:03'),

(4,3 , 10 ,'20130101 10:04'),

(5,1 , 10,'20130101 10:05'),

(6,1 , 11,'20130101 10:09'),

(7,2 , 12,'20130101 10:06'),

(8,3 , 10,'20130101 10:10'),

(9,1 , 11,'20130101 10:40'),

(10,3 , 10,'20130101 10:47'),

(11,3 , 10 ,'20130101 10:52'),

(12,3 , 10 ,'20130101 11:05'),

(13,2 , 10,'20130101 10:53')

)vals(id,hardware_id,access_point,datein);

The most trivial and easy algorithm would be to travel through the table, ordered by devices and the time. For each new event in the resultlset, access point and hardware_id should be analyzed. If the access point is different from the device’s previous event or more than 30 min have passed since the previous event, we will consider it as an interval start and use it’s own id as an Interval id. Keeping the Interval id we will go to the next event. If the access point is the same, use a previous event id as an Interval Id.

This logic is easily implemented using a cursor. I haven’t used the cursors for the last 10 years but I wondered what would be the execution times.

I have added two columns to the table, IntervalId to group the events into the intervals and the IntervalDateEnd to add information for when the interval ends in order to satisfy the above-mentioned queries.

ALTER TABLE test_events ADD IntervalId int;

ALTER TABLE test_events ADD IntervalDateEnd datetime ;

The first cursor was using the aforesaid logic as is and had updated the table on every iteration. On a table with on 250000 rows I have stopped it after 4 hours, when less than half of the rows have got updated.

My next cursor implementation was a bit different. First of all I have created a temporary table. Instead of updating the table on every iteration, I added a new row to the temporary table grasping the event id and Intervalid. When the cursor finished I updated the main table by joining it to the temporary results table. This solution took 1 min ( ~ 35 sec cursor and ~30 sec table update) on table with 250000 rows and 10 min on table with 500000 rows. The test was performed on a fairly slow VM machine so the execution times reflect that.

The execution is quite heavy, so if you are using the cursors please make sure you are using them wisely. Differences between the cursors are marked in yellow.

-----------Better cursor----------

DECLARE@idint,@accessPointint,@hardware_idvarchar(50),@timestampdatetime

DECLARE@parentIdint,@parentAccessPointint,@ParentHardwareIDvarchar(50),@ParentTimeStamp datetime

CREATETABLE#Mapping(Idint, IntervalIdint,IntervalDateEnddatetime);

DECLAREEventLoopCURSORREAD_ONLYSTATIC FOR

SELECTid,access_point,hardware_id,datein

FROMtest_events

ORDERBYhardware_id,datein;

OPENEventLoop

FETCHNEXTFROMEventLoop

INTO@id,@accessPoint,@hardware_id,@timestamp

WHILE@@FETCH_STATUS= 0

BEGIN

IF (@hardware_id=@ParentHardwareID

AND@accessPoint=@parentAccessPoint

ANDDATEDIFF(mi,@ParentTimeStamp,@timestamp)<= 30 )BEGIN

INSERTINTO#Mapping(Id,IntervalId)SELECT @id,@parentId

END

ELSEBEGIN

INSERTINTO#Mapping(Id,IntervalId) SELECT @id,@id

UPDATE#Mapping

SETIntervalDateEnd=@ParentTimeStamp

WHEREid=@parentId

SELECT @parentId=@id

END

SELECT

@parentAccessPoint=@accessPoint,

@ParentHardwareID=@hardware_id,

@ParentTimeStamp=@timestamp;

FETCHNEXTFROMEventLoop

INTO@id,@accessPoint,@hardware_id,@timestamp

END

UPDATE#Mapping

SETIntervalDateEnd=@ParentTimeStamp

WHEREid=@parentId

CLOSEEventLoop;

DEALLOCATEEventLoop;

UPDATEe

SET e.IntervalId= m.IntervalId,

e.IntervalDateEnd = m.IntervalDateEnd

FROM test_eventse

JOIN#Mappingm

ONe.id=m.id

-----------BAD cursor----------

DECLARE@idint,@accessPointint,@hardware_idvarchar(50),@timestampdatetime

DECLARE@parentIdint,@parentAccessPointint,@ParentHardwareIDvarchar(50),@ParentTimeStamp datetime

DECLAREEventLoop CURSORREAD_ONLYSTATICFOR

SELECTid,access_point,hardware_id,datein

FROMtest_events

ORDERBYhardware_id,datein;

OPENEventLoop

FETCHNEXTFROMEventLoop

INTO@id,@accessPoint,@hardware_id,@timestamp

WHILE@@FETCH_STATUS= 0

BEGIN

IF (@hardware_id=@ParentHardwareID

AND@accessPoint=@parentAccessPoint

ANDDATEDIFF(mi,@ParentTimeStamp,@timestamp)<= 30 )BEGIN

UPDATEtest_events

SET IntervalId=@parentId

WHEREid=@id

END

ELSEBEGIN

UPDATEtest_events

SET IntervalId=@id

WHEREid=@id

UPDATEtest_events

SETIntervalDateEnd=@ParentTimeStamp

WHEREid=@parentId

SELECT @parentId=@id

END

SELECT

@parentAccessPoint=@accessPoint,

@ParentHardwareID=@hardware_id,

@ParentTimeStamp=@timestamp;

FETCHNEXTFROMEventLoop

INTO@id,@accessPoint,@hardware_id,@timestamp

END

UPDATEtest_events

SETIntervalDateEnd=@ParentTimeStamp

WHEREid=@parentId

CLOSEEventLoop;

DEALLOCATEEventLoop;

I’m habitually trying to avoid cursors for the reason that in most cases they will require more memory and produce much more locks. Therefore I needed to find a true set based solution which, as I have believed, additionally would have much better execution times.

That’s where I came up with 4 steps logic.

1st step. Find all interval starters.

By thinking about the data you can see that each event can belong to only one of the two possible statuses: Start the interval or continue the interval. We can find this by comparing each event to the previous event that was reported for the same device. I can achieve this by using LAG() window function to bring the previous event. Here you can find more information about the analytic functions.

Here is a query:

UPDATEr1

SET r1.IntervalId=r2.IntervalId

FROMtest_eventsr1

JOIN(SELECTid,IntervalId =

CASEWHEN

lag(access_point,1)OVER (PARTITIONBYhardware_idORDERBYdatein)=access_point

and

DATEDIFF(mi,

lag(datein,1)OVER (PARTITIONBYhardware_idORDERBYdatein),

datein

)< 30

THEN

NULL

ELSEid

END

FROMtest_events

WHEREIntervalIdISNULL)r2

onr1.id=r2.id

WHEREr1.IntervalIdISNULL

Unfortunately I can use windowed functions only in SELECT or ORDER BY clause, otherwise I could have used it in the SET clause of the update and the query would have been even lighter.

This step took 7 sec on 250000 rows and 10 sec on the 500000 rows.

Now I need to link the rest of the events to their intervals. In order to do that I can use this simple query that for each event that is continuing the interval find the closest event that is an interval starter using sequential id column which is PK of the table.

UPDATEe1

SETIntervalId=(SELECTtop 1 IntervalId

FROMtest_eventse2

WHEREe2.hardware_id=e1.hardware_idande2.access_point=e1.access_pointande2.id<e1.id

andIntervalIdisnotnull

ORDERBYe2.dateindesc)

FROMtest_eventse1

WHEREIntervalIdISNULL

Although at the first look this query looks better than the cursor, in fact, is not entirely a set based query. For each event in the table, the subquery searches for the interval starter by going back to the table. This step took 2 minutes on table with 250000 rows and 6 min on table with 500000 rows.

Which means I need another solution.

If you take a look at the example below, you can clearly see that the intervals belong to one of the two categories

Closed intervals: there are other appearances of the same hardware_id on the different access_point ( marked in blue )

Open intervals : there are none other appearances of the same hardware_id on the different access_point ( marked in green)

2nd step. Link the events within the closed intervals.

For each start of the interval find the next start interval’s starting time. We can easily find the next value by using the LEAD() window function.

Together with the current timestamp they define the interval boundaries which will be between the CurrentTime and LESS than NextIntervalStartTime.

This step took 15 sec on 250000 table and 25 sec on 500000 table.

UPDATEe1

SETe1.IntervalId=e2.IntervalId

FROMtest_eventse1

JOIN(select*,NextEventStart = LEAD(datein)OVER (PARTITIONBYhardware_idORDERBYdatein)

FROMtest_events

WHEREIntervalIdISNOTNULL)e2

ONe1.hardware_id=e2.hardware_idande1.access_point=e2.access_point

ande1.datein>=e2.dateinande1.datein<e2.NextEventStart

WHEREe1.IntervalIdISNULL

3rd step. Link the events within the open intervals.

For each start of the interval we will need to find the last timestamp for the same device on the same access point. Analytical function LAST_VALUE() will help us to achieve that. Take into the consideration that the default range of this function includes only preceding events so you need to explicitly specify RANGEBETWEENCURRENTROWANDUNBOUNDEDFOLLOWING option.

Together with the current timestamp they define the interval boundaries which lay between the CurrentTime and the LastTimeForThisDevice

UPDATE e1

SET e1.IntervalId=e2.IntervalId

FROM test_eventse1

JOIN(SELECT *, LastTimeForThisDevice = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,access_point ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

FROMt est_events)e2

on e1.hardware_id=e2.hardware_id and e1.access_point=e2.access_point

and e1.datein=e2.datein and e1.datein=e2.LastTimeForThisDevice

WHEREe1.IntervalIdISNULL

This query took 25 sec on 250000 rows and 45 sec on 500000 rows.

4th step. Grab all the Interval End Dates.

If we want to find an interval end dates, there is a need for another query that took 30 sec on 500000 rows.

UPDATEe1

set e1.IntervalDateEnd=e2.IntervalDateEnd

from test_eventse1

join(select id, IntervalDateEnd = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,IntervalId ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

from test_events)e2

one1.id=e2.id

Total set based solution time is about 2 min. Now, the next step would be to identify good indexes that will support the queries and crawler updates that will take care of fresh data, will not slow down the inserts and will not cause page fragmentation.

You may think that since the (simpler to write) second cursor took only(!) 10 min to finish there would be not much reason for all the hassle. However, a set based solution is far more scalable which is extremely important these days where every small table tends to turn into BIG data.

Forcing the set based technology to use iterative functionality is not graceful at all. Think about the hedgehogs, who would never use their tiny paws to descent from a climb. They will typically roll into a ball and roll down.

May all your TSQL solutions roll.

Search This Blog

Maria's journey in the data fields

Grouping events into intervals using hedgehog rolling style

Comments

Post a Comment

Popular posts from this blog

SQL Awesomeness: Finding a QUALIFY query clause in the depths of the database ocean

Look back and realize how far you came

The Greatest Reasons to use or not to use a Centralized Data Access Architecture