-
Notifications
You must be signed in to change notification settings - Fork 339
Further Discussion on Ordered Grouping
When discussing ordered grouping, we only studied the possible impact of the order of the set members to be grouped on the grouping operation. However, since we need to consider the order of the set, and the result set is also a set, does its member order also have business significance?
It does have, but it is not as important as the order of the original set.
There are two aspects to the order of the grouping result set, one is the order of these grouped subsets, and the other is the order of the members of the grouped subsets.
When considering the equivalence grouping operation of ordered sets, we believe that it is most reasonable to maintain the original order of the grouped subsets by default, that is, the order in which each grouped subset member first appears in the original set. The reason for this is that other orders (such as order by grouping key values) can be obtained by sorting the result sets, while the original order is likely to be lost after grouping is done, or at least it will be difficult to obtain it again.
For example, if we want to count the number of repetitions of words in a textbook, this is a simple equivalence grouping operation. The default result set should be in the order in which new words appear in the book, which has business significance. When teaching the book to students, they can preview new words in that order. If this order is not returned after the grouping operation, it will be difficult to obtain. Each word needs to be artificially assigned a sequence number where it appears in the book. When grouping, the minimum value of the sequence number should also be calculated, and then sort by this value, and finally, this value is discarded. The calculation process is cumbersome and inefficient.
SQL based on unordered sets itself does not have a saying about the original order of sets, and of course, it cannot stipulate the order of grouped result sets. Returning result sets to ensure the original order is a meaningless proposition. In practice, databases generally implement grouping using the HASH method. In this case, the order of the result set is often the order of the HASH values, and the order of the HASH values is basically meaningless. When we care about the order, we need to sort again. In order to obtain the sorting basis, it is necessary to add sequence number information to the original set as mentioned in the example and participate in the grouping operation, which is cumbersome and inefficient.
The earlier example would be written as follows:
SELECT WORD,MIN(RN) NO,COUNT(*) WC
FROM (SELECT WORD, ROWNUMBER() RN FROM T )
GROUP BY WORD ORDER BY NO
SPL is designed based on ordered sets, and its grouping operation follows this principle, that is, the result set remains in its original order.
word.groups( ~:word; count(1):wc )
The order of the result set for alignment grouping and enumeration grouping should obviously be consistent with the criterion set. The order of the result set of ordered grouping is obviously the most reasonable according to the order each group is generated.
As we mentioned earlier, using the LEFT JOIN method in SQL can implement the effects of alignment and enumeration grouping, but whether it is the HASH method or sorting method, the result set will lose the order of the criterion set. And the order of the result set for alignment and enumeration grouping is very necessary. To obtain this order through sorting again, it is necessary to maintain a sequence number in the criterion set. This will turn the originally simple single valued member set into a multi field record set. Moreover, when the criterion set needs to insert/delete members, it is very troublesome to continue maintaining the number, as the member numbers after the modified members need to be adjusted. So implementing alignment and enumeration grouping in SQL is a tedious task.
As for the order of members in the grouped subset, it should also default to maintaining the original order, that is, the order of members in the original set. However, whether it is meaningful depends on the subsequent actions to be taken.
SQL does not care about this order at all. SQL will force aggregation after grouping, and only have regular aggregation operations such as SUM/COUNT, the execution result of which has nothing to do with the execution order. The concept of member order of grouped subset has no meaning at all in SQL.
But some further calculations may be related to order, for example, in a stock closing price table sorted by date, we want to calculate the longest consecutive rising days of each stock. This requires calculating the longest consecutive rising days for each grouped subset after grouping by stock. If the members in the grouped subset are still in order of dates, it can be convenient to calculate, otherwise they will have to be sorted again.
SPL supports ordered operations very well, and both the grouped result sets and the grouped subset members will follow these principles to maintain a reasonable order for further calculation. For example, calculating the longest consecutive rising days for each stock can be simply written as:
stocks.group( stock ).new( stock, ~.group@i(price<price[-1]).max(~.len()):MaxRisingDays)
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code