Skip to content

User Behavior Analysis in Practice 5:Using Dimension Table

esProcSPL edited this page Oct 17, 2024 · 1 revision

Target task:

We have a user events table T. Below is its structure and part of its data:

Time UserID EventTypeID ProductID Quantity
2022/6/1 10:20 1072755 3 100001
2022/6/1 12:12 1078030 2 100002
2022/6/1 12:36 1005093 5 100003 3
2022/6/1 13:21 1048655 1
2022/6/1 14:46 1037824 6
2022/6/1 15:19 1049626 4 100004 4
2022/6/1 16:00 1009296 5 100005 6
2022/6/1 16:39 1070713 2 100006
2022/6/1 17:40 1090884 3 100007

Fields in table T:

Field name Data type Description
Time Datetime Time stamp of an event, accurate to milliseconds
UserID String User ID
EventTypeID Integer Event type ID
ProductID String Product ID
Quantity Numeric Quantity

Dimension table EventType:

EventTypeID EventType
1 Login
2 Browse
3 Search
4 AddtoCart
5 Submit
6 Logout

Dimension table Product:

ProductID ProductName Unit Price ProductTypeID
100001 Apple Pound 5.5 1
100002 Tissue Packs 16 2
100003 Beef Pound 35 3
100004 Wine Bottles 120 4
100005 Pork Pound 25 3
100006 Bread Packs 10 5
100007 Juice Bottles 6 4

Fields in dimension table Product:

Field name Data type Description
ProductID String Product ID
ProductName String Product name
Unit String Sales unit
Price Numeric Unit price
ProductTypeID Integer Product type ID

Dimension table ProductType:

ProductTypeID ProductType
1 Fruits
2 Home&Personalcare
3 Meat
4 Beverage
5 Bakery

Relationship between tables:

Computing task:

Calculate the total sales amount, number of orders, search frequency and the number of distinct users performing search and ordering under each type of product within a specified time period.

Techniques involved:

1 Join tables through dimension table rather than generating a wide table. This can reduce the volume of data to be stored and increase retrieval speed.

2 Use a global variable to pre-load dimension tables and establish associations for later reuse.

Sample code

1. According to our previous practices, we dump data from user events table T and store it in composite table T.ctx according to the order of Time field; and then dump data in those dimension tables as bin files EventType.btx, Product.btx and ProductType.btx.

2. Import each dimension table into memory, set primary key for them, open the composite table cursor, establish associations with dimension tables, and perform grouping &aggregation.

Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:

A
1 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2 =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5
3 >EventType=file("EventType.btx").import@b().keys@i(EventTypeID)
4 >ProductType=file("ProductType.btx").import@b().keys@i(ProductTypeID)
5 >Product=file("Product.btx").import@b().keys@i(ProductID)
6 >Product=Product.switch(ProductTypeID, ProductType:ProductTypeID)
7 =A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)
8 =A7.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

A2 Retrieve records within the specified time range and where the event type is “submit order” and “search” from the composite table file and create a cursor based on them.

A3 Load dimension data from bin file EventType.btx, and set primary key and create index on it.

A4 Load dimension data from bin file ProdcutType.btx and set ProductTypeID as its primary key.

A5 Load dimension data from bin file Prodcut.btx and set ProductID as the primary key.

A6 Establish association between Product and* ProductType*.

A7 Associate A2’s cursor with in-memory dimension tables Product and EventType respectively.

A8 Perform grouping calculation to generate small result sets from A7’s cursor, the joining result.

When using switch() function to join with a dimension table, you need to set primary key for the dimension table in advance and perform join operation through the key. The join amounts to adding references of dimension table records in the associated field of the original table. Then you can reference any field of the dimension table using the syntax "field of the original table.field of the dimension table".

With hierarchical dimension tables, like table T –* Product* table – ProductType table in this instance, you can use the dot operator (.) to reference levels of dimension tables one by one, such as " ProductID.ProductTypeID.ProductType ". It means that table T’s ProductID field references ProductTypeID field in its dimension table and then ProductType field in the dimension table’s dimension table.

3. Dimension tables are often used repeatedly. As they are generally not large, we can load them into the memory, establish associations and store each of them as a global variable. Then there is no need to load dimension tables and establish associations again for summarizations and just use the global variable directly. This way the above code can be regarded as including two parts. The first part is to load dimension tables as global variables at the startup of the server. The second part is the code for performing summarization.

Part one (execute once at the startup of the server):

A
1 =file("EventType.btx").import@b().keys@i(EventTypeID)
2 =file("ProductType.btx").import@b().keys@i(ProductTypeID)
3 =file("Product.btx").import@b().keys@i(ProductID)
4 >env(EventType,A1),env(ProductType,A2),env(Product,A3)
5 >Product.switch(ProductTypeID,ProductType:ProductTypeID)

A4 Store each of the in-memory dimension tables as global variables for reference by the code for performing summarization.

Part two (summarization):

A
1 >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd")
2 =file("T.ctx").open().cursor(UserID,EventTypeID,ProductID,Quantity;Time>=start && Time<=end && (EventTypeID==5
3 =A2.switch(ProductID,Product:ProductID;EventTypeID,EventType:EventTypeID)
4 =A3.groups(EventTypeID,ProductID.ProductTypeID;EventTypeID.EventType,ProductID.ProductTypeID.ProductType,sum(Quantity):Quantity,count(1):Num, icount(UserID):iNum)

Execution result:

EventTypeID ProductTypeID EventType ProductType Quantity Num iNum
3 1 Search Fruits 0 499586 48735
3 2 Search Home&Personalcare 0 508897 49872
3 3 Search Meat 0 403213 39923
3 4 Search Beverage 0 324567 29045
3 5 Search Bakery 0 335498 30234
5 1 Submit Fruits 206938 103469 13523
5 2 Submit Home&Personalcare 463188 154396 14656
5 3 Submit Meat 94378 93366 8754
5 4 Submit Beverage 217504 54376 5233
5 5 Submit Bakery 339480 67896 5844
Clone this wiki locally