-
Notifications
You must be signed in to change notification settings - Fork 339
Ordered grouping
We know that SQL uses the concept of unordered sets in mathematics, so SQL grouping does not pay attention to the order of members in the set to be grouped. The equivalence grouping and none equivalence grouping we discussed earlier have not paid attention to this issue, and the grouping rules are based on the member values themselves. But if an ordered set is used as the object of operation, the influence of member order on grouping must be considered, and there are many applications of ordered grouping in real business.
Firstly, there is the sequence number grouping.
A simple example: Divide a class of students into three equal groups (assuming the number of students can be divided by three). According to our definition of grouping, this can also be seen as a type of grouping, but this operation cannot be directly written in SQL because the grouping criteria are not related to member values.
If we use # symbol in Lambda syntax in SPL, this problem can be easily solved.
A.group( (#-1)*3\A.len() ) // Divide into first 1/3, middle 1/3, and last 1/3 by sequence number
A.group( (#-1)%3 ) // Get one from every three by sequence number
This is also an operation that focuses on grouped subsets rather than aggregate values.
Implementing this operation in SQL can be quite cumbersome. You need to first create a sequence number using a subquery, and then perform similar actions.
In this example, we haven't really paid attention to the order of members yet, we just explained the role of sequence number. When the set members to be grouped are in other orders, we can also obtain usable results.
Let's take a look at more examples.
When processing text logs, the basic unit of some logs is not 1 line, but may be 3 lines, meaning that every event is always written in 3 lines of text, which is not a rare situation. When dealing with this type of log, it is necessary to split every 3 lines of text into a grouped subset, and then conduct detailed analysis and processing for each subset. In this case, the correctness of grouping depends on the order of the members (lines in the text log) in the set to be grouped.
After the entrance examination, students are made a snake shaped division into different classes based on their grade ranking, namely, No. 1, 4, 5, 8 are in one class, while No. 2, 3, 6, 7 are in another class, this can ensure that the average ranking of the two classes is the same. This grouping operation can be easily written using sequence numbers in SPL:
A.sort@z(score).group(#%4<2)
The grouping key value used here is no longer an ordinary numerical value, but a boolean variable, which is equivalent to dividing into two groups based on "true" and "false". The true value corresponds to the first class, and the false value corresponds to the other class. Essentially, this is still an equivalence grouping, only the grouping key values used can be any generic.
Obviously, the correctness of this grouping also heavily depends on the order of set members to be grouped.
By the way, this is also an example of only caring about grouped subsets and not about aggregate values.
In most cases, sequence number grouping only uses sequence number to calculate the grouping key value, and then it becomes a regular equivalence grouping. Is there a situation where it is not possible to simply convert it to equivalence grouping?
Value-change grouping is one.
There is a set of baby birth records sorted by birth order. We now want to know how many batches of continuous same-gender babies exceeding 5 are there?
A simple idea is to group first, and to put babies of the same gender who were born consecutively into a group, then calculate the count for each group, and then count how many are greater than 5. The last two steps are very simple, and the problem is how to group them?
Grouping babies directly by gender is certainly not correct, and it is necessary to consider the order. Scan the records in sequence, and generate a new group when the baby's gender changes. This type of grouping obviously cannot be directly implemented using equivalence grouping.
SPL provides an ordered grouping method that makes it easy to implement this grouping operation, which generates a new group when the grouping key value changes.
A.group@o(gender).count(~.len()>5) // @o option indicates to generate a new group when the grouping value changes.
Using SQL is a lot of more troublesome. You need to first create an intermediate flag to generate the group number, which is roughly like this:
SELECT COUNT(*) FROM
(SELECT ChangeNumber FROM
(SELECT SUM(ChangeFlag) OVER (ORDER BY birthday) ChangeNumber FROM
(SELECT CASE WHEN gender=LAG(gender) OVER ( ORDER BY birthday) THEN 0 ELSE 1 END ChangeFlag FROM A))
GROUP ChangeNumber HAVING COUNT(*)>5)
It's not easy to even understand such SQL. And it is necessary to use the birthday field to form order, and the aforementioned ordered grouping method does not require this information when the original data is ordered.
This situation can also occur in text analysis. The log for each user event may have multiple lines and the number of lines is uncertain, but when writing the log, the user ID is written at the beginning of each line. This way, we can perform an ordered grouping by this user ID, and when it changes, it indicates that it is another user's event.
Even for ordinary equivalence grouping, if it is known in advance that the original set has ordered grouping key values, it can also be implemented using this scheme, which will achieve higher performance, much faster than the commonly used HASH grouping scheme in databases, and is particularly suitable for big data traversal.
There is also the condition-change grouping.
We often use this as an example: What is the maximum consecutive rising days of a stock?
Of course, this problem can be directly solved through traversal, but we are now using a grouping approach to handle it. At least in the SQL system, this is the only solution (strictly speaking, this is the simplest and most feasible solution found so far).
Sort the closing price of a stock by date, and then divide the consecutive rising dates into the same group, so that you only need to consider which group has the most members. More specifically, when the stock rises in a day, it is grouped together with the previous day, and when the stock falls in a day, generate a new group.
To implement this idea using SQL, it is also necessary to use intermediate flags and variables to generate group numbers:
SELECT MAX(ContinuousDays) FROM
(SELECT COUNT(*) ContinuousDays FROM
(SELECT SUM(RisingFlag) OVER (ORDER BY TradingDate ) NoRisingDays FROM
(SELECT TradingDate,
CASE WHEN ClosingPrice>LAG(ClosingPrice) OVER (ORDER BY TradingDate THEN 0 ELSE 1 END) RisingFlag
FROM A))
GROUP BY NoRisingDays)
SPL has a specialized ordered grouping method and corresponding Lambda syntax, making this operation very simple:
A.sort(TradingDate).group@i(ClosingPrice<ClosingPrice[-1]).max(~.len()) // @i option indicates to generate a new group when the criteria satisfies
Although the implementation approach is exactly the same as SQL, writing it out is a step-by-step process rather than a multi-layered nested statement, making it much easier to write and understand.
Similarly, this situation can also occur in text analysis. In logs with an uncertain number of lines, sometimes a flag string is written at the beginning of an event. When this flag string is scanned, generate a new group. The criteria for ordered grouping can be set to be that the current scanned line is the same as the specified text, ensuring that the log information for the same event is in the same group.
In theory, the latter two ordered grouping cases can also be transformed to equivalence grouping (which is necessary when using SQL, which can also demonstrate the completeness of the SQL operation system from another perspective), but it is indeed quite troublesome, so we generally do not treat them as equivalence grouping any more.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code