-
Notifications
You must be signed in to change notification settings - Fork 339
User Behavior Analysis in Practice 5:Using Dimension Table
We have a user events table T. Below is its structure and part of its data:
Time | UserID | EventTypeID | ProductID | Quantity |
---|---|---|---|---|
2022/6/1 10:20 | 1072755 | 3 | 100001 | |
2022/6/1 12:12 | 1078030 | 2 | 100002 | |
2022/6/1 12:36 | 1005093 | 5 | 100003 | 3 |
2022/6/1 13:21 | 1048655 | 1 | ||
2022/6/1 14:46 | 1037824 | 6 | ||
2022/6/1 15:19 | 1049626 | 4 | 100004 | 4 |
2022/6/1 16:00 | 1009296 | 5 | 100005 | 6 |
2022/6/1 16:39 | 1070713 | 2 | 100006 | |
2022/6/1 17:40 | 1090884 | 3 | 100007 |
Fields in table T:
Field name | Data type | Description |
---|---|---|
Time | Datetime | Time stamp of an event, accurate to milliseconds |
UserID | String | User ID |
EventTypeID | Integer | Event type ID |
ProductID | String | Product ID |
Quantity | Numeric | Quantity |
Dimension table EventType:
EventTypeID | EventType |
---|---|
1 | Login |
2 | Browse |
3 | Search |
4 | AddtoCart |
5 | Submit |
6 | Logout |
Dimension table Product:
ProductID | ProductName | Unit | Price | ProductTypeID |
---|---|---|---|---|
100001 | Apple | Pound | 5.5 | 1 |
100002 | Tissue | Packs | 16 | 2 |
100003 | Beef | Pound | 35 | 3 |
100004 | Wine | Bottles | 120 | 4 |
100005 | Pork | Pound | 25 | 3 |
100006 | Bread | Packs | 10 | 5 |
100007 | Juice | Bottles | 6 | 4 |
… | … | … | … | … |
Fields in dimension table Product:
Field name | Data type | Description |
---|---|---|
ProductID | String | Product ID |
ProductName | String | Product name |
Unit | String | Sales unit |
Price | Numeric | Unit price |
ProductTypeID | Integer | Product type ID |
Dimension table ProductType:
ProductTypeID | ProductType |
---|---|
1 | Fruits |
2 | Home&Personalcare |
3 | Meat |
4 | Beverage |
5 | Bakery |
… | … |
Relationship between tables:
Computing task:
Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.
1 Join tables through dimension table rather than generating a wide table. This can reduce the volume of data to be stored and increase retrieval speed.
2 Use a global variable to pre-load dimension tables and establish associations for later reuse.
1. According to our previous practices, we dump data from user events table T and store it in composite table T.ctx according to the order of Time field; and then dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx.
2. Import each dimension table into memory, set primary key for them, open the composite table cursor, establish associations with dimension tables, and perform grouping &aggregation.
Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:
A | |
---|---|
1 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 | =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 |
3 | >EventType=file("EventType.btx").import@b().keys@i(EventTypeID) |
4 | >ProductType=file("ProductType.btx").import@b().keys@i(ProductTypeID) |
5 | >Product=file("Product.btx").import@b().keys@i(ProductID) |
6 | >Product=Product.switch(ProductTypeID, ProductType:ProductTypeID) |
7 | =A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID) |
8 | =A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum) |
A2 Retrieve records within the specified time range and where the event type is “submit order” and “search” from the composite table file and create a cursor based on them.
A3 Load dimension data from bin file EventType.btx, and set primary key and create index on it.
A4 Load dimension data from bin file ProdcutType.btx and set ProductTypeID as its primary key.
A5 Load dimension data from bin file Prodcut.btx and set ProductID as the primary key.
A6 Establish association between Product and* ProductType*.
A7 Associate A2’s cursor with in-memory dimension tables Product and EventType respectively.
A8 Perform grouping calculation to generate small result sets from A7’s cursor, the joining result.
When using switch() function to join with a dimension table, you need to set primary key for the dimension table in advance and perform join operation through the key. The join amounts to adding references of dimension table records in the associated field of the original table. Then you can reference any field of the dimension table using the syntax "field of the original table.field of the dimension table".
With hierarchical dimension tables, like table T –* Product* table – ProductType table in this instance, you can use the dot operator (.) to reference levels of dimension tables one by one, such as " ProductID.ProductTypeID.ProductType ". It means that table T’s ProductID field references ProductTypeID field in its dimension table and then ProductType field in the dimension table’s dimension table.
3. Dimension tables are often used repeatedly. As they are generally not large, we can load them into the memory, establish associations and store each of them as a global variable. Then there is no need to load dimension tables and establish associations again for summarizations and just use the global variable directly. This way the above code can be regarded as including two parts. The first part is to load dimension tables as global variables at the startup of the server. The second part is the code for performing summarization.
Part one (execute once at the startup of the server):
A | |
---|---|
1 | =file("EventType.btx").import@b().keys@i(EventTypeID) |
2 | =file("ProductType.btx").import@b().keys@i(ProductTypeID) |
3 | =file("Product.btx").import@b().keys@i(ProductID) |
4 | >env(EventType,A1),env(ProductType,A2),env(Product,A3) |
5 | >Product.switch(ProductTypeID,ProductType:ProductTypeID) |
A4 Store each of the in-memory dimension tables as global variables for reference by the code for performing summarization.
Part two (summarization):
A | |
---|---|
1 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 | =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5 |
3 | =A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID) |
4 | =A3.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum) |
Execution result:
EventTypeID | ProductTypeID | EventType | ProductType | Quantity | Num | iNum |
---|---|---|---|---|---|---|
3 | 1 | Search | Fruits | 0 | 499586 | 48735 |
3 | 2 | Search | Home&Personalcare | 0 | 508897 | 49872 |
3 | 3 | Search | Meat | 0 | 403213 | 39923 |
3 | 4 | Search | Beverage | 0 | 324567 | 29045 |
3 | 5 | Search | Bakery | 0 | 335498 | 30234 |
… | … | … | … | … | … | … |
5 | 1 | Submit | Fruits | 206938 | 103469 | 13523 |
5 | 2 | Submit | Home&Personalcare | 463188 | 154396 | 14656 |
5 | 3 | Submit | Meat | 94378 | 93366 | 8754 |
5 | 4 | Submit | Beverage | 217504 | 54376 | 5233 |
5 | 5 | Submit | Bakery | 339480 | 67896 | 5844 |
… | … | … | … | … | … | … |
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code