-
Notifications
You must be signed in to change notification settings - Fork 339
Add order related operation capability to Lambda syntax
We will continue to discuss the Lamba syntax in set operations and introduce order related computational capabilities.
SQL uses the mathematical concept of unordered sets and does not pay attention to order when traversing sets. But computers can only execute step by step, and there will always be an order when looping. By fully utilizing this order, richer computational requirements can be easily expressed.
For example, we want to extract half of the members from a set to form a new set. This may seem like a filtering operation, but the filtering conditions are not related to the set members themselves, but are determined by the sequence number during the loop.
In SPL, it is not convenient to describe this type of operation using only the ~ notation. In this case, a symbol (identifier) is needed to represent the sequence number of the loop. This is the fourth rule of Lambda syntax.
In fact, most high-level languages have a loop variable to represent the sequence number when writing loop statements, which plays this role. However, this mechanism is not provided in Lambda syntax, and when encountering such operations, one can only write a loop or artificially create a sequence number to implement it, which is very cumbersome. SQL also does not have a scheme to represent the sequence number of traversals, and can only manually create a sequence number using a subquery before filtering it.
SPL uses # to represent the sequence number of traversal, so this operation is easy to write:
A.select(#<=A.len()/2) Get the first half of the members
A.select(#%2==0) Get members of even positions
Correspondingly, in filtering operations, we usually return members that meet the conditions, but sometimes we don't care about specific members and only care about their sequence numbers. SPL also designs a filtering function that returns sequence numbers:
A.pselect( ~>5 ) Return the sequence numbers of members greater than 5
Similarly, there may also be:
A.pmax() Return the sequence number of the maximum value
...
We will further discuss these operations related to positions in the future.
When considering the order of traversal on the set, we can further enhance the descriptive power of computation.
For example, if 12 months of sales data have been prepared in order, now we want to calculate which months have a growth rate exceeding 5%.
Writing SQL for this type of cross row calculation is quite cumbersome, requiring the use of JOIN or window functions to align last month's data with current month's data, and then calculate the growth rate, which inevitably requires subqueries. If the syntax of referencing adjacent members is provided, this calculation can be easily described.
In SPL, [i] is used to represent the member with a distance of i from the current member, and combined with the # notation mentioned earlier, the above calculation can be written as:
A.(if(~/~[-1]>1.05,#,0)).select(~>0)
~[-1] represents the previous member, which is the previous month's sales revenue. Identify the months with growth rates exceeding 5% (i.e. #), clear all other months to zero, and finally select these none zero months.
If the filtering function that returns the sequence number mentioned above is used, it can also be written in a simpler form:
A.pselect(~/~[-1]>1.05)
In addition to adjacent members, there may also be references of adjacent sets, such as in the set above, we want to calculate the moving average of sales for each month before and after.
SPL extends the [i] expression to the [a, b] notation to represent a set composed of adjacent members, and this operation is easy to describe:
A.(~[-1,1].avg())
A set of adjacent members may also have more complex situations, such as calculating cumulative sales for the current month.
SPL allows for the default representation of a missing meaning starting from the first member (equivalently, b missing can be understood as to the last member) in the [a, b] notation, then this operation can be written as
A.(~[,0].sum())
Similarly, for structured data calculations, field names can also be directly used. If the set in this example is a table composed of two fields: "month" and "sales", the above operations can be written as follows:
A.select(amount/amount[-1]>1.05) There is already a month field in the result set here, so there is no need to use #
A.derive(amount[-1,1].avg:move_average) Add a field to represent moving average
A.derive(amount[,0].sum():cum_amount)
When considering orderliness, the rules of Lambda syntax are more complex than regular set operations, and these ordered operations often occur. If Lambda syntax does not support them, it will make these calculations difficult to describe, and programmers will have to fall back to the state of writing multi line loop statements, or artificially create sequence numbers, which is cumbersome and affects readability.
SQL does not provide ordered Lambda syntax and often requires the use of subqueries and window functions to generate sequence numbers. Some complex ordered traversal operations cannot even be written, and they will have to be converted into multi line loop statements using stored procedures. In this sense, although SQL has set-oriented features, the set-orientation is not thorough enough.
Python provides partial sequence number capability, and adjacent references are also quite cumbersome. SPL may be the only language that provides a more complete ordered Lambda syntax.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code