-
Notifications
You must be signed in to change notification settings - Fork 340
Python vs. SPL 7 Alignment Grouping & Enumeration Grouping
Grouping operation usually refers to equivalence grouping with the following features:
-
All the members of the original set are in and only in one unique group;
-
No group is empty set;
The grouping with the above features is also known as a complete partition in mathematics.
Given the name of the complete partition, is there any chance that an incomplete partition exists?
The answer is definitely positive. It is possible that both features of complete partition are not satisfied. For example, there may be empty sets in the groups, or some members of the original set may not be in any groups or in more than one group. So this article is focused on comparing the calculation abilities of Python and SPL in incomplete partition grouping.
Sometimes there is a situation that calculates the aggregation information of specified sets after grouping, for example:
To calculate the number of male employees in specified departments ['Administration', 'HR', 'Marketing', 'Sales'] of the company.
The employee information table is as follows:
Python
def align_group(g,l,by): d = pd.DataFrame(l,columns=[by]) m = pd.merge(d,g,on=by,how='left') return m.groupby(by,sort=False) file="D:\data\EMPLOYEE.csv" emp = pd.read_csv(file) emp_f=emp.query('GENDER=="M"') sub_dept = ['Administration', 'HR', 'Marketing', 'Sales'] res = align_group(emp_f,sub_dept,'DEPT').EID.count() print(res) |
User-defined function, alignment grouping Compare to dataframe Alignment operation Group Select Specified sequence Aggregate the aligned groups |
Because there may be more than four departments listed in the example and there may be no male employee in a certain department, that is, the subset of grouping may be empty. And the groupby function only groups by all the departments and there is no empty set after grouping. In this case, the member of the empty set cannot be calculated. The only method is to use the merge function to align the data and then group them.
SPL
A | B | |
1 | D:\data\EMPLOYEE.csv | |
2 | =file(A1).import@tc() | |
3 | =A2.select(GENDER=="M") | /select |
4 | [Administration, HR, Marketing, Sales] | /specified departments |
5 | =A3.align@a(A4,DEPT).new(A4(#):DEPT,~.len():NUM) | /aggregate the aligned groups |
The align@a() function in SPL is used to perform the alignment grouping and the result is in accordance with the group function, returning the sets after grouping. This function is more effective to write without the need of an extra merge function to align the data.
The align() function is more commonly used to sort the data in an unconventional order, for example:
When sorting all the provinces of China, we usually rank Beijing as the first rather than placing Anhui at the first place according to the order of unicode. At this point, we can just list the specified order of provinces and use the align() function to align them, which is written like:
province=[“Beijing”, “Tianjin”, “Heilongjiang”, ...]
A.align(province,LOCATION)
In this way, LOCATION will be sorted in the order of specified provinces, and @a option can be added in order to return the subsets of grouping.
A group of conditions are set in advance, and members of the to-be-grouped sets are taken as parameters to calculate the conditions; the member that satisfies the condition is stored in the subset that corresponds to that condition, and the subsets of the result set are corresponding to the specified conditions one by one. This kind of grouping is named as enumeration grouping, for example:
To calculate the average salary of each generation (Generation X: born between 1965 and 1980; Generation Y: born between 1980 and 1995; Generation Z: born between 1995 and 2010).
Python
#continue to use emp year_seg=["1965-01-01","1980-01-01","1995-01-01","2010-01-01"] generation=["X","Y","Z"] year_seg=pd.to_datetime(year_seg) birth=pd.to_datetime(emp["BIRTHDAY"]) genner=pd.cut(birth,bins=year_seg,right=False,labels=generation) gen_salary=emp.groupby(genner).SALARY.mean() print(gen_salary) |
Time interval Name of generation Marks of segments Group and aggregate |
Python provides the cut() function to segment the data. This function can segment the data either averagely or by the user-defined segment, is able to mark the segments and allows empty sets in the grouped Series subsets, which is very powerful.
SPL
A | B | |
… | /A2 is the employee information | |
7 | [?>=date("1965-01-01")&&?<date("1980-01-01"),?>=date("1980-01-01")&&?<date("1995-01-01"),?>=date("1995-01-01")&&?<date("2010-01-01")] | /grouping conditions |
8 | [X,Y,Z] | /names of groups |
9 | =A2.run(BIRTHDAY=date(BIRTHDAY)) | /convert the data type of date |
10 | =A2.enum(A7,BIRTHDAY).new(A8(#):GENERATION,avg(SALARY):SALARY) | /enumeration grouping |
In SPL, the conditions of different generations are written as strings to concatenate as a sequence, in which “?” indicates the grouping key value to be calculated. The enum function uses the grouping key of each member in the to-be-grouped set to calculate those conditions in turn, and store the member in the corresponding group if the return result is true. The final grouped subsets will also correspond to the conditions used as benchmarks in number and order, so only those subsets need to be processed after grouping. Moreover, in enumeration grouping, there may also be situations where empty sets exist or some members are not in any groups.
This kind of segment grouping is quite common and will be executed even faster with simpler syntax if SPL uses the pseg function to convert it to sequence-number grouping.
A | B | |
… | /A2 is the employee information | |
11 | ["1965-01-01","1980-01-01","1995-01-01","2010-01-01"] | /segment |
12 | =A11.(date(~)) | /convert the data type of date |
13 | =A2.align@a(A8,A8(A12.pseg(BIRTHDAY))).new(A8(#):GENERATION,avg(SALARY):SALARY) | /group and aggregate |
The pseg() function can work out which segment a certain date or number belongs to, and return the sequence number. Since there may be no employee belonging to Generation Z, we can use the align@a() function to align the groups and process the subsets in the same way, which is the same method as the cut function in Python.
Sometimes we may also encounter the situation that the member of a set exists repeatedly in multiple subsets. For example:
To group the employees by age and calculate the number of each group (when grouping conditions overlap, list all the employees satisfying the conditions. The grouping conditions are [year of service<5, 5<=years of service>=10, years of service>=10, years of service>=15]).
Python
#continue to use emp import datetime import numpy as np import math def eval_g(dd:dict,ss:str): return eval(ss,dd) employed_list=['Within five years','Five to ten years','More than ten years','Over fifteen years'] employed_str_list=["(s<5)","(s>=5) & (s<10)","(s>=10)","(s>=15)"] today=datetime.datetime.today() emp['HIREDATE']=pd.to_datetime(emp['HIREDATE']) employed=((today-emp['HIREDATE'])/np.timedelta64(1,'Y')).apply(math.floor) emp['EMPLOYED']=employed dd={'s':emp['EMPLOYED']} group_cond = [] for n in range(len(employed_str_list)): emp_g = emp.groupby(eval_g(dd,employed_str_list[n])) emp_g_index=[index for index in emp_g.size().index] if True not in emp_g_index: sum_emp=0 else: group=emp_g.get_group(True) sum_emp=len(group) group_cond.append([employed_list[n],sum_emp]) group_df=pd.DataFrame(group_cond,columns=['EMPLOYED','NUM']) print(group_df) |
Function, convert strings to expressions Grouping conditions Calculate the entry time Loop through the grouping conditions Group by the grouping conditions Grouping index Employees that do not satisfy the conditions Number of employees as 0 Employees that satisfy the conditions The total number of employees Aggregate the calculation results of each grouping condition |
This example is a bit special because “years of service>= 10” and “years of service >= 15” groups may have duplicate members, which can not be calculated successfully by the ordinary equivalence grouping or alignment grouping operations. Besides, the cut function in Python is not able to implement such a partition at once. Instead, we have to hard-code by ourselves: loop through the grouping conditions -- equivalence grouping by conditions -- process the grouped subsets -- union the results, then finally the operations can be completed after a big detour.
SPL
A | B | |
… | /A2 is the employee information and A10 is the name of generation | |
15 | [?<5,?>=5 && ?<10,?>=10,?>=15] | /grouping conditions |
16 | [Within five years,Five to ten years,More than ten years,Over fifteen years] | /names of groups |
17 | =A2.derive(age(HIREDATE):EMPLOYED) | /calculate the year of service |
18 | =A17.enum@r(A15,EMPLOYED).new(A16(#):EMPLOYED,~.len():NUM) | /enumeration grouping |
The operations in SPL are much easier. The @r option in the enum@r() function allows members repeatedly existing in different subsets, which saves a lot trouble compared with the big detour in Python.
The alignment grouping in Python makes a relatively small detour, which uses the merge function to align the data, resulting in bad efficiency. But thanks to the simple syntax, it is still acceptable for small data volume. However, the enumeration grouping in Python is much more troublesome with worse efficiency and complex syntax.
SPL, on the contrary, is much easier, in which the align function is used for alignment grouping and the enum function for enumeration grouping. And both of them return the same results as the group function, i.e., the grouped subsets, on which the subsequent operations are performed.
In terms of the function form, the functions in SPL for equivalence grouping, alignment grouping and enumeration grouping are in the same form: A.f(...).(y). The only difference lies in “f(...)”: the equivalence grouping is “group”, the alignment grouping is “align”, and the enumeration grouping is “enum”. And the subsequent processing on the grouped subsets of the three operations are exactly the same. However, these three grouping operations are performed completely differently in Python, which needs to be handled separately.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code