User Behavior Analysis in Practice 1：Conventional Grouping and Aggregation

Target task

We have a user events table T. Below is its structure and part of its data:

Time	UserID	EventType
2022/6/1 10:20	1072755	Search
2022/6/1 12:12	1078030	Browse
2022/6/1 12:36	1005093	Submit
2022/6/1 13:21	1048655	Login
2022/6/1 14:46	1037824	Logout
2022/6/1 15:19	1049626	AddtoCart
2022/6/1 16:00	1009296	Submit
2022/6/1 16:39	1070713	Browse
2022/6/1 17:40	1090884	Search

Fields in table T:

Field name	Data type	Description
Time	Datetime	Time stamp of an event, accurate to milliseconds
UserID	Integer	User ID
EventType	String	Types of events

Computing task:

Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.

Techniques involved:

Use binary file storage instead of database storage.
Use parallel processing during traversal.

Sample code

Dump data from database and store it in a binary file

Stocked data: the data is retrieved from the database and written to a bin file:

	A
1	=connect("demo").cursor@x("select * from T")
2	=file("T.btx").export@b(A1)

A1 Connect to the database, retrieve data from table T and generate a cursor. @x option enables to automatically close database connection after data retrieval finishes.

A2 Export A1’s data to bin file T.btx. @b option enables writing data to a binary file.

Newly-increased data: the newly-increased data can be obtained using a filtering condition in SQL when there is any that needs to be appended to an existing bin file. @a option enables appending data to a bin file.

The newly-increased data can be identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a bin file:

	A
1	=connect("demo").cursor@x("select * from T where Time>=? && Time<?",date(now()-1), date(now()))
2	=file("T.btx").export@ba(A1)

A1 Get data generated in the previous day through filtering condition and store it in a cursor.

A2 Fetch A1’s data from the cursor and append it to bin file T.btx. @a enables data appending; without it the existing bin file will be overwritten.

Perform grouping and aggregation on a bin file

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

	A
1	=file("T.btx").cursor@mb()
2	>start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
3	=A1.select(Time<=end && Time>=start).groups(EventType; count(1):Num, icount(UserID):iNum)

A1 Generate cursor for bin file "T.btx". @m enables parallel processing through a multicursor; set the default number of parallel threads in esProc configuration file raqsoftconfig.xml, or just write f.cursor@m(n) where n represents the number of parallel threads. It is recommended that n should be less than the number of computer’s CPU cores, otherwise speed will be lower.

A2 Generate two variables – start and end – to filter data by the time stamp. The variables will be passed in through parameters in real-life computations.

A3 Perform filtering and grouping & aggregation on A1’s cursor. The grouping field is EventType; count(1) performs a simple count while icount(UserID) finds the number of unique UserIDs. Note that no matter how many operations are performed on a cursor, like multiple rounds of filtering, sorting or grouping, they will be executed together at one time – data is retrieved once through one traversal – to get the final result.

Execution result:

EventType	Num	iNum
AddtoCart	1845674	175476
Browse	3578901	348791
Login	4033000	393400
Logout	4033000	393400
Search	2947931	257539
Submit	867345	83375

SPL Resource： SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Behavior Analysis in Practice 1：Conventional Grouping and Aggregation

Target task

Techniques involved:

Sample code

Clone this wiki locally