Grouping events into intervals using hedgehog rolling style
- I have a challenging SQL problem for you, – yelled one of my coworkers one morning entering my office and dropping on his way a few pictures, SQL server system views map and a visitors chair. – Much more interesting than all your alerts and twitter conversations! ( “Well”, I thought to myself, “it might beat the alerts but not the twitter conversations” )
The problem presented by him is as follows:
There is a table containing information about a device’s, like phones or tablets, history of connections to hotspots as shown below:
The idea was to bind together the above mentioned events into groups. Each group, or we can call it - an interval, should represent the time that any device was connected to the specific access point. Each time the device has moved to the new access point, a new interval should start. If there were no events for the specific device longer than 30 min, the next event should start a new interval even if the new event was reported on the same access point as before.
Queries will eventually slice the data to see how many users were on the same access point at the specific time, average time devices stayed connected to specific access_point and other similar business questions.
The matter was indeed an attention-grabbing one since the desired result cannot be achieved by using a regular GROUP BY logic.
First of all, here is a test data:
SELECT *
INTO test_events
FROM (VALUES (1,1,10,'20130101 10:01'),
(3,2 , 10 ,'20130101 10:03'),
(4,3 , 10 ,'20130101 10:04'),
(5,1 , 10,'20130101 10:05'),
(6,1 , 11,'20130101 10:09'),
(7,2 , 12,'20130101 10:06'),
(8,3 , 10,'20130101 10:10'),
(9,1 , 11,'20130101 10:40'),
(10,3 , 10,'20130101 10:47'),
(11,3 , 10 ,'20130101 10:52'),
(12,3 , 10 ,'20130101 11:05'),
(13,2 , 10,'20130101 10:53')
)vals(id,hardware_id,access_point,datein);
The most trivial and easy algorithm would be to travel through the table, ordered by devices and the time. For each new event in the resultlset, access point and hardware_id should be analyzed. If the access point is different from the device’s previous event or more than 30 min have passed since the previous event, we will consider it as an interval start and use it’s own id as an Interval id. Keeping the Interval id we will go to the next event. If the access point is the same, use a previous event id as an Interval Id.
This logic is easily implemented using a cursor. I haven’t used the cursors for the last 10 years but I wondered what would be the execution times.
I have added two columns to the table, IntervalId to group the events into the intervals and the IntervalDateEnd to add information for when the interval ends in order to satisfy the above-mentioned queries.
ALTER TABLE test_events ADD IntervalId int;
ALTER TABLE test_events ADD IntervalDateEnd datetime ;
The first cursor was using the aforesaid logic as is and had updated the table on every iteration. On a table with on 250000 rows I have stopped it after 4 hours, when less than half of the rows have got updated.
My next cursor implementation was a bit different. First of all I have created a temporary table. Instead of updating the table on every iteration, I added a new row to the temporary table grasping the event id and Intervalid. When the cursor finished I updated the main table by joining it to the temporary results table. This solution took 1 min ( ~ 35 sec cursor and ~30 sec table update) on table with 250000 rows and 10 min on table with 500000 rows. The test was performed on a fairly slow VM machine so the execution times reflect that.
The execution is quite heavy, so if you are using the cursors please make sure you are using them wisely. Differences between the cursors are marked in yellow.
-----------Better cursor----------
DECLARE@idint,@accessPointint,@hardware_idvarchar(50),@timestampdatetime
DECLARE@parentIdint,@parentAccessPointint,@ParentHardwareIDvarchar(50),@ParentTimeStamp datetime
CREATETABLE#Mapping(Idint, IntervalIdint,IntervalDateEnddatetime);
DECLAREEventLoopCURSORREAD_ONLYSTATIC FOR
SELECTid,access_point,hardware_id,datein
FROMtest_events
ORDERBYhardware_id,datein;
OPENEventLoop
FETCHNEXTFROMEventLoop
INTO@id,@accessPoint,@hardware_id,@timestamp
WHILE@@FETCH_STATUS= 0
BEGIN
IF (@hardware_id=@ParentHardwareID
AND@accessPoint=@parentAccessPoint
ANDDATEDIFF(mi,@ParentTimeStamp,@timestamp)<= 30 )BEGIN
INSERTINTO#Mapping(Id,IntervalId)SELECT @id,@parentId
END
ELSEBEGIN
INSERTINTO#Mapping(Id,IntervalId) SELECT @id,@id
UPDATE#Mapping
SETIntervalDateEnd=@ParentTimeStamp
WHEREid=@parentId
SELECT @parentId=@id
END
SELECT
@parentAccessPoint=@accessPoint,
@ParentHardwareID=@hardware_id,
@ParentTimeStamp=@timestamp;
FETCHNEXTFROMEventLoop
INTO@id,@accessPoint,@hardware_id,@timestamp
END
UPDATE#Mapping
SETIntervalDateEnd=@ParentTimeStamp
WHEREid=@parentId
CLOSEEventLoop;
DEALLOCATEEventLoop;
UPDATEe
SET e.IntervalId= m.IntervalId,
e.IntervalDateEnd = m.IntervalDateEnd
FROM test_eventse
JOIN#Mappingm
ONe.id=m.id
|
-----------BAD cursor----------
DECLARE@idint,@accessPointint,@hardware_idvarchar(50),@timestampdatetime
DECLARE@parentIdint,@parentAccessPointint,@ParentHardwareIDvarchar(50),@ParentTimeStamp datetime
DECLAREEventLoop CURSORREAD_ONLYSTATICFOR
SELECTid,access_point,hardware_id,datein
FROMtest_events
ORDERBYhardware_id,datein;
OPENEventLoop
FETCHNEXTFROMEventLoop
INTO@id,@accessPoint,@hardware_id,@timestamp
WHILE@@FETCH_STATUS= 0
BEGIN
IF (@hardware_id=@ParentHardwareID
AND@accessPoint=@parentAccessPoint
ANDDATEDIFF(mi,@ParentTimeStamp,@timestamp)<= 30 )BEGIN
UPDATEtest_events
SET IntervalId=@parentId
WHEREid=@id
END
ELSEBEGIN
UPDATEtest_events
SET IntervalId=@id
WHEREid=@id
UPDATEtest_events
SETIntervalDateEnd=@ParentTimeStamp
WHEREid=@parentId
SELECT @parentId=@id
END
SELECT
@parentAccessPoint=@accessPoint,
@ParentHardwareID=@hardware_id,
@ParentTimeStamp=@timestamp;
FETCHNEXTFROMEventLoop
INTO@id,@accessPoint,@hardware_id,@timestamp
END
UPDATEtest_events
SETIntervalDateEnd=@ParentTimeStamp
WHEREid=@parentId
CLOSEEventLoop;
DEALLOCATEEventLoop;
|
I’m habitually trying to avoid cursors for the reason that in most cases they will require more memory and produce much more locks. Therefore I needed to find a true set based solution which, as I have believed, additionally would have much better execution times.
That’s where I came up with 4 steps logic.
1st step. Find all interval starters.
By thinking about the data you can see that each event can belong to only one of the two possible statuses: Start the interval or continue the interval. We can find this by comparing each event to the previous event that was reported for the same device. I can achieve this by using LAG() window function to bring the previous event. Here you can find more information about the analytic functions.
Here is a query:
UPDATEr1
SET r1.IntervalId=r2.IntervalId
FROMtest_eventsr1
JOIN(SELECTid,IntervalId =
CASEWHEN
lag(access_point,1)OVER (PARTITIONBYhardware_idORDERBYdatein)=access_point
and
DATEDIFF(mi,
lag(datein,1)OVER (PARTITIONBYhardware_idORDERBYdatein),
datein
)< 30
THEN
NULL
ELSEid
END
FROMtest_events
WHEREIntervalIdISNULL)r2
onr1.id=r2.id
WHEREr1.IntervalIdISNULL
Unfortunately I can use windowed functions only in SELECT or ORDER BY clause, otherwise I could have used it in the SET clause of the update and the query would have been even lighter.
This step took 7 sec on 250000 rows and 10 sec on the 500000 rows.
Now I need to link the rest of the events to their intervals. In order to do that I can use this simple query that for each event that is continuing the interval find the closest event that is an interval starter using sequential id column which is PK of the table.
UPDATEe1
SETIntervalId=(SELECTtop 1 IntervalId
FROMtest_eventse2
WHEREe2.hardware_id=e1.hardware_idande2.access_point=e1.access_pointande2.id<e1.id
andIntervalIdisnotnull
ORDERBYe2.dateindesc)
FROMtest_eventse1
WHEREIntervalIdISNULL
Although at the first look this query looks better than the cursor, in fact, is not entirely a set based query. For each event in the table, the subquery searches for the interval starter by going back to the table. This step took 2 minutes on table with 250000 rows and 6 min on table with 500000 rows.
Which means I need another solution.
If you take a look at the example below, you can clearly see that the intervals belong to one of the two categories
- Closed intervals: there are other appearances of the same hardware_id on the different access_point ( marked in blue )
- Open intervals : there are none other appearances of the same hardware_id on the different access_point ( marked in green)
2nd step. Link the events within the closed intervals.
For each start of the interval find the next start interval’s starting time. We can easily find the next value by using the LEAD() window function.
Together with the current timestamp they define the interval boundaries which will be between the CurrentTime and LESS than NextIntervalStartTime.
This step took 15 sec on 250000 table and 25 sec on 500000 table.
UPDATEe1
SETe1.IntervalId=e2.IntervalId
FROMtest_eventse1
JOIN(select*,NextEventStart = LEAD(datein)OVER (PARTITIONBYhardware_idORDERBYdatein)
FROMtest_events
WHEREIntervalIdISNOTNULL)e2
ONe1.hardware_id=e2.hardware_idande1.access_point=e2.access_point
ande1.datein>=e2.dateinande1.datein<e2.NextEventStart
WHEREe1.IntervalIdISNULL
For each start of the interval we will need to find the last timestamp for the same device on the same access point. Analytical function LAST_VALUE() will help us to achieve that. Take into the consideration that the default range of this function includes only preceding events so you need to explicitly specify RANGEBETWEENCURRENTROWANDUNBOUNDEDFOLLOWING option.
Together with the current timestamp they define the interval boundaries which lay between the CurrentTime and the LastTimeForThisDevice
UPDATE e1
SET e1.IntervalId=e2.IntervalId
FROM test_eventse1
JOIN(SELECT *, LastTimeForThisDevice = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,access_point ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROMt est_events)e2
on e1.hardware_id=e2.hardware_id and e1.access_point=e2.access_point
and e1.datein=e2.datein and e1.datein=e2.LastTimeForThisDevice
WHEREe1.IntervalIdISNULL
This query took 25 sec on 250000 rows and 45 sec on 500000 rows.
If we want to find an interval end dates, there is a need for another query that took 30 sec on 500000 rows.
UPDATEe1
set e1.IntervalDateEnd=e2.IntervalDateEnd
from test_eventse1
join(select id, IntervalDateEnd = LAST_VALUE(datein) OVER (PARTITION BY hardware_id,IntervalId ORDER BY datein RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
from test_events)e2
one1.id=e2.id
Total set based solution time is about 2 min. Now, the next step would be to identify good indexes that will support the queries and crawler updates that will take care of fresh data, will not slow down the inserts and will not cause page fragmentation.
You may think that since the (simpler to write) second cursor took only(!) 10 min to finish there would be not much reason for all the hassle. However, a set based solution is far more scalable which is extremely important these days where every small table tends to turn into BIG data.
Forcing the set based technology to use iterative functionality is not graceful at all. Think about the hedgehogs, who would never use their tiny paws to descent from a climb. They will typically roll into a ball and roll down.
Comments
Post a Comment