-
Notifications
You must be signed in to change notification settings - Fork 339
Python vs. SPL 8 Ordered Grouping
We are naturally interested in order-related operations, and grouping operations may also involve the order. This article will compare the calculation abilities of Python and SPL in ordered grouping.
It is possible that the position information of members participates in the grouping and the grouping keys are related to positions (or sequence numbers) of the members. For example:
Here are the data:
time a b
0 0.5 -2.0
1 0.5 -2.0
2 0.1 -1.0
3 0.1 -1.0
4 0.1 -1.0
5 0.5 -1.0
6 0.5 -1.0
7 0.5 -3.0
8 0.5 -1.0
It is expected that every three rows are grouped together and the mode number is taken as the result of that group. And the expected result is as follows:
time a b
2 0.5 -2.0
5 0.1 -1.0
8 0.5 -1.0
Python
import pandas as pd file1="D:/data/position.txt" data1=pd.read_csv(file1,sep="\t") pos_seq=[i//3 for i in range(len(data1))] res1=data1.groupby(pos_seq).agg(lambda x:x.mode().iloc[-1]) print(res1) |
Derive a column according to positions, and group by the integer quotient divided by 3 Group and aggregate according to the derived column |
In order to group every three rows, we can divide the sequence number by 3 and take the integer quotient to get the grouping key, by which the grouping operation is performed. To get the mode number of each group, we can use the agg+lambda expression, which is actually an aggregation operation.
SPL
A | B | |
1 | D:\data\position.txt | |
2 | =file(A1).import@t() | |
3 | =A2.group((#-1)\3).new(~.max(time):time,~.mode(a):a,~.mode(b):b) | /group and aggregate |
In SPL, we can get the grouping key with “#” rather than calculating an extra column of grouping keys separately. Here the grouping key is an integer and can be calculated using the group@n() function in SPL. This function is an efficient grouping method for natural numbers, which can directly obtain the correct grouped subsets with sequence numbers rather than performing the comparison operation, so it calculates much faster. Here we can change the code slightly as:
A3=A2.group@n((#+2)\3)...
Sometimes the data are ordered originally, we want to group only the adjacent records with the same value together during grouping, for example:
Here are the data:
duration location user
0 10 house A
1 5 house A
2 5 gym A
3 4 gym B
4 10 shop B
5 4 gym B
6 6 gym B
It is expected to get the sum of “duration” when “user” and “location” are the same consecutively and to get the sum again when the values of “location” are changed. And the expected result is:
duration location user
15 house A
5 gym A
4 gym B
10 shop B
10 gym B
Python
file2="D:/data/value.txt" data2=pd.read_csv(file2,sep="\t") value_seq=(data2.location!=data2.location.shift()).cumsum() res2=data2.groupby(['user','location',value_seq], as_index=False, sort=False)['duration'].sum() print(res2) |
The derived column, the values of the column are the same if the adjacent “location” values are the same; otherwise +1 Group and aggregate according to [user, location, derived column] |
Python cannot group the adjacent members with the same value, but has to derive a column and take it as part grouping keys to complete the grouping operation. However, it takes great effort for us to come up with a relatively simple derived column.
SPL
A | B | |
… | … | |
4 | D:\data\value.txt | |
5 | =file(A5).import@t() | |
6 | =A6.groups@o(user,location;sum(duration)) | /group with changed values |
The groups@o() function in SPL scans the whole sequence in turn and adds the member to the current grouped subset if the grouping key value is the same as the grouping key of the last member; or generates a new grouped subset and adds the current member to it if the grouping values are not the same. Then after scanning, a batch of grouped subsets are obtained, thus the grouping operation is finally completed. This method in SPL does not need us to derive a new column and the syntax of it is also quite easy with only an additional @o option to the groups function. Such grouping operation only compares the adjacent values, so it executes faster, and this is exactly what the groupby function in Python cannot achieve.
There is another situation for the originally ordered data, that is, grouping the data when the conditions are changed. For example:
Here are the data:
ID code
333_c_132 x
333_c_132 n06
333_c_132 n36
333_c_132 n60
333_c_132 n72
333_c_132 n84
333_c_132 n96
333_c_132 n108
333_c_132 n120
999_c_133 x
999_c_133 n06
999_c_133 n12
999_c_133 n24
998_c_134 x
998_c_134 n06
998_c_134 n12
998_c_134 n18
998_c_134 n36
997_c_135 x
997_c_135 n06
997_c_135 n12
997_c_135 n24
997_c_135 n36
996_c_136 x
996_c_136 n06
996_c_136 n12
996_c_136 n18
996_c_136 n24
996_c_136 n36
995_c_137 x
It is expected to retrieve one row out of every two “x”es in code column randomly, and the expected result is:
333_c_132 n06
999_c_133 n12
998_c_134 n18
997_c_135 n36
996_c_136 n18
Python
file3="D:/data/condition.txt" data3=pd.read_csv(file3,sep="\t") cond_seq=data3.code.eq('x').cumsum() res3=data3[data3.code.ne('x')].groupby(cond_seq).apply(lambda x:x.sample(1)).reset_index(level=0,drop=True) print(res3) |
The derived column Group according to the derived column and then sample |
The logic of this operation in Python stays the same with the previous two examples, i.e., deriving a column of grouping keys on our own arduously and grouping according to that column. Also, Python provides the sample() function for sampling the data, making the sampling more effective.
SPL
A | B | |
… | … | |
9 | D:\data\condition.txt | |
10 | =file(A9).import@t() | |
11 | =A10.group@i(code=="x").conj((l=~.len(),if(l<2,,~.m(2:)(rand(l-1)+1)))) | /group with changed conditions |
The grouping key of the group@i() function in SPL is an expression, which generates a new grouped subset if the return result is true, that is, a new group will be generated when a certain condition is satisfied. This method also does not need to think of a derived column arduously and the calculation efficiency is excellent. But the sampling function is not provided in SPL, we have to write the sampling action manually, which can be easily completed using the rand() function in SPL.
In addition, @o and @i options are valid for both groups and group functions with the same effects.
As for processing logs, these three order-related grouping operations introduced in the article all work very well.
- The log with fixed number of lines
The form of the log is:
IP, TIME, GET, URL, and BROWER are in the first line;
MODULE is in the seconds line;
USERID, UNAME, and LOCATION are in the third line.
In this case, use group((#-1)\3) and then process the log of each group.
- The log with unfixed number of lines but each line has a mark
The form of such kind of log is as follows:
The number of lines of each user is different but each line is started with the ID of users.
In this case, use group@o(~.split(“\t”)(1)) and then process the log of each group.
- The log with unfixed number of lines but there is a starting mark
The form of the log is:
The number of lines of each user is different but each user has a staring mark “userid”.
In this case, use group@i(~.split(“:”)(1)==”userid”), and then process the log of each group.
Python is not able to use the ordered data to improve efficiency because it adopts the same method for ordered grouping and unordered grouping. It is always necessary to derive a column and make it satisfy the grouping conditions, wasting a lot of useful information.
While SPL can make full use of the ordered data to make the grouping operation even faster. As long as the grouping conditions are well written, we don’t have to derive a column by ourselves, and @n, @o or @i options can be added to change the form of functions, which is very easy and practical.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code