-
Notifications
You must be signed in to change notification settings - Fork 340
SPL:TopN and TopN in group
TopN query is to look up the first Ns / last Ns from the data. TopN query can not only directly get a value, but also inquire the detailed information of the record where TopN is located, and sometimes it also looks for the row number of TopN. Additionally, TopN query may also be used in groups to query the first Ns / last Ns within the group.
We can divide taking the first Ns / last Ns into three types of requirements to describe them in detail, respectively taking their values, taking their row numbers and taking their records.
In the case of the NASDAQ, some of the data are as follows:
Date | Open | Close | Amount |
---|---|---|---|
2019/01/02 | 6506.910156 | 6665.939941 | 2261800000 |
2019/01/03 | 6584.77002 | 6463.5 | 2607290000 |
2019/01/04 | 6567.140137 | 6738.859863 | 2579550000 |
2019/01/07 | 6757.529785 | 6823.470215 | 2507550000 |
2019/01/08 | 6893.439941 | 6897.0 | 2380290000 |
… | … | … | … |
Check out the top three volume values of the NASDAQ Index in 2019.
The SPL script looks like this:
A | |
---|---|
1 | =T("IXIC.txt") |
2 | =A1.select(year(Date)==2019) |
3 | =A2.top(-3,Amount) |
A1: import NASDAQ index data
A2: select the data of 2019
A3: employ the A.top(n,x) function to get the three highest volume values. If n is a positive number, take the first Ns; if N is a negative number, take the last Ns. In particular cases, if n is ±1, return to the single value,which is similar to taking a maximum/minimum value.
We can also look up the four lowest volume values of the NASDAQ in 2019:
A | |
---|---|
3 | =A2.top(4,Amount) |
A3: employ the A.top(n,x) function to get the four lowest volume values
In an order-related set, we can operate inter-row calculations by taking the row numbers of the first N / last N members.
[e.g. 2] Inquire the increasing rate in trading volumes relative to the previous days of the top three highest closing prices of NASDAQ in 2019.
The SPL script looks like this:
A | |
---|---|
1 | =T("IXIC.txt") |
2 | =A1.select(year(Date)==2019).sort(Date) |
3 | =A2.ptop(-3,Close) |
4 | =A3.(A2( |
A1: import NASDAQ data.
A2: select the data of 2019 and sort them by date.
A3: employ the A.top(n,x) function to take the row numbers of the top three closing prices.
A4: based on the selected row number, calculate the the increasing rate by comparing the trading volumes of current day with those of the previous day.
Sometimes, we don’t care about what the specific values of first Ns / last Ns are, we care more about the records where the values are. For example, search the names of the top three students in a math final exam, the top five sales customers’ names in 2020, and so on.
[e.g. 3] Check the trading volumes of the lowest closing prices of NASDAQ index in 2019.
The SPL script looks like this:
A | |
---|---|
1 | =T("IXIC.txt") |
2 | =A1.select(year(Date)==2019) |
3 | =A2.top(5;Close) |
4 | =A3.new(Date,Amount) |
A1: import NASDAQ data.
A2: select data of 2019
A3: employ the A.top(n,x) function to take records of the five trading days with the lowest closing prices.
A4: take the dates and volumes from the records of the five days.
The query of first Ns / last Ns in group is a very common requirement. For example, find out what the top two math scores are in each class, what the top three single sales customers are for each month, and so on. In this section, we’ll break down how to solve the problems of using TopN in group.
We can regard the TopN query as a kind of aggregation operation. First, the data is grouped according to certain conditions, and then the TopN query is performed on the result sets of each group. Let’s talk about it in terms of values and records respectively.
Take the student score table as an example. Some of the data are as follows:
Class | StudentID | Subject | Score |
---|---|---|---|
1 | 1 | English | 95 |
1 | 1 | Math | 90 |
1 | 1 | PE | 80 |
1 | 2 | English | 75 |
1 | 2 | Math | 84 |
… | … | … | … |
[e.g. 4] Check the scores of the top two students in each class.
The SPL script looks like this:
A | |
---|---|
1 | =T("Score.txt") |
2 | =A1.select(Subject:"Math") |
3 | =A2.group(Class;~.top(-2,Score):TOP2) |
4 | =A3.new(Class,TOP2(1):First,TOP2(2):Second) |
A1: import score table.
A2: select the math grades.
A3: group by class, and use the A.top() function to count the top two math scores in each class.
A4: create the result table. The first column is the class, the second column is the first student, and the third column is the second student.
[e.g. 5] Check the information about the top three students of each subject in every class.
The SPL script looks like this:
A | |
---|---|
1 | =T("Score.txt") |
2 | =A1.group(Class,Subject;~.top(-3;Score):TOP3) |
3 | =A2.conj(TOP3) |
A1: import score table.
A2: divide them into groups by class and subject and take the records of the top three scores in each group.
A3: concatenate the records of the top two students of each subject in all classes.
To perform TopN operation using accumulation does not produce the result set of a grouping, which is often used when the amount of data is big. Let’s nevertheless talk about it in terms of values and records.
Take the sales table for example. Some of the data are as follows:
OrderID | Customer | OrderDate | SellerId | Amount |
---|---|---|---|---|
81182311 | VINET | 2013/07/04 | 5 | 2440.0 |
98807954 | TOMSP | 2013/07/05 | 6 | 1863.4 |
65550721 | HANAR | 2013/07/08 | 4 | 1813.0 |
37311312 | VICTE | 2013/07/08 | 3 | 670.8 |
80138612 | SUPRD | 2013/07/09 | 4 | 3730.0 |
… | … | … | … | … |
[e.g. 6] Check the top two sales amounts of each month in 2014.
The SPL script looks like this:
A | |
---|---|
1 | =file("Sales.txt").cursor@t().select(year(OrderDate)==2014) |
2 | =A1.groups(month(OrderDate):Month;top(-2,Amount):TOP2) |
3 | =A2.news(TOP2;Month,~:Amount) |
A1: generate cursors for the sales table and select the data of 2014.
A2: group them by month and get the top two sales amounts of each month.
A3: create the results table, the first column is the month and the second one is the amount.
[e.g. 7] calculate the sales records of the top three sales in each month of 2014.
The SPL script looks like this:
A | |
---|---|
1 | =file("Sales.txt").cursor@t().select(year(OrderDate)==2014) |
2 | =A1.groups(month(OrderDate);top(-3;Amount):TOP3) |
3 | =A2.conj(TOP3) |
A1: generate cursors for the sales table and select the data of 2014.
A2: group them by month and get the records of the top three sales amounts per month.
A3: concatenate the records of the top three sales amounts of each month.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code