-
Notifications
You must be signed in to change notification settings - Fork 339
General operations on SPL files
SPL has two file formats: bin file and composite table, with suffixes btx and ctx respectively. This article introduces the conventional calculations of SPL on these/ two types of files, similar to the simple calculations that SQL can express (excluding subqueries).
The basic concepts and generation methods related to these two types of files can be found in: How to use SPL storage for beginners.
Here, we use some data tables and fields from TPCH.
orders, order table (ctx), the used fields are O_ORDERKEY order number, primary key; O_CUSTKEY customer number; O_TOTALPRICE order amount; O_ ORDERDATE order date.
lineitem, online item table (order details table) (ctx), the used fields are L_ORDERKEY order number, primary key; L_LINENUMBER online product number, primary key; L_QUANTITY quantity; L_DISCOUNT discount; L_EXTENDEDPRICE extended Price.
Customer, customer table (ctx), with fields C_CUSTKEY customer number, primary key; C_Name customer name; C_NATIONKEY nation number; C_ACCTBAL customer value.
Nation, nation table (btx), with fields N_NATIONKEY nation number, primary key; N_ Name nation name; N_REGIONKEY region number; N_COMMENT comment.
Explanation of inter table relationships:
The order table is associated with the online item table on the order number, and the order table is associated with the customer table on the customer number.
The customer table is associated with the nation table on the nation number.
Filter out data that meets the criteria from the file.
Example: Filter data by criteria from tables of nation, customer, lineitem, and orders.
A | |
---|---|
1 | =T("nation.btx").select(N_NAME=="CHINA" && like(N_COMMENT,"*express*")) |
2 | =T("customer.ctx").select(C_ACCTBAL>=1000) |
3 | =file("lineitem.ctx").open().cursor@m(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT).select(L_EXTENDEDPRICE*L_DISCOUNT>=10000) |
4 | =file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;between(O_ORDERDATE,date(1996,1,1):date(1996,1,31))) |
A1 reads the bin file data into memory and filters it according to conditions.
A2 reads the composite table data into memory and filters it according to conditions.
A3 assumes that the composite table lineitem.ctx has a large amount of data and cannot be totally loaded into memory, a cursor is used to read in batches into memory, read one batch, and filter one batch according to the conditions.
Due to the fact that it is read first and then filtered, the fields for the filtering criteria must also appear in the field list, such as L_EXTENDEDPRICE and L_DISCOUNT in this case.
The addition of @m to a cursor refers to multithreaded parallel reading and computation, known as a multi cursor. In the following examples, unless otherwise specified, the cursor can be directly added with @m to become a multi cursor. Remove @m to represent an ordinary cursor and perform single threaded reading and calculation.
A4 assumes that the composite table orders.ctx has a large amount of data and cannot be totally loaded into memory. Unlike A3, here a cursor with filtering conditions is used for stream reading, which filters while reading, resulting in better performance. This method is called pre cursor filtering.
Due to the fact that it filters while reading, the filtering criteria may not appear in the field list, such as O_ORDERDATE here.
A3 and A4 return cursors, not result sets, and generally require subsequent calculations to be meaningful. This situation is common in SPL.
Explanation:
1. In general, it is recommended to use pre cursor filtering like A4. The A3 method is used for filtering after grouping, multi-purpose traversal, and other scenarios where pre cursor filtering cannot be used.
2. From the examples, it can be seen that whether it is a bin file or a composite table, all data is read into memory to form a table sequence, and the subsequent code writing is the same. In the case of in-memory situation, we will no longer provide examples of two types of files separately.
Aggregate the data in the file.
Example: Summarize the total number of nations, total number and average of customers, total number of orders, discounted prices of online items, and total price.
A | |
---|---|
1 | =T("nation.btx").count(1) |
2 | =T("customer.ctx").groups(;count(1),avg(C_ACCTBAL)) |
3 | =file("orders.ctx").open().cursor(O_ORDERKEY).total(count(1)) |
4 | =file("lineitem.ctx").open().cursor@m(L_EXTENDEDPRICE,L_DISCOUNT).total(sum(L_EXTENDEDPRICE*L_DISCOUNT),max(L_EXTENDEDPRICE)) |
A1: Read the file into memory and summarize the number of records.
A2: Read the file into memory and summarize the number of records and average value. groups can calculate multiple aggregation values, simplifying writing.
A3: Summarize the number of composite table records using a cursor. Just one primary key field is sufficient to minimize the number of retrieved fields, which is beneficial for performance.
A4: Summarize the composite table using a cursor, calculate the average and maximum values. Similarly, only the necessary fields are retrieved.
Filtering and aggregation can be used together to achieve aggregation after filtering.
Example:
Filter customers based on their numerical values first, and then calculate the average of their numerical values.
Filter online items first, then calculate sum and maximum.
A | |
---|---|
1 | =T("customer.ctx").select(C_ACCTBAL>=1000) |
2 | =A1.avg(C_ACCTBAL) |
3 | =file("lineitem.ctx").open().cursor(L_ORDERKEY,L_EXTENDEDPRICE,L_DISCOUNT;L_EXTENDEDPRICE*L_DISCOUNT>=10000) |
4 | =A3.total(sum(L_EXTENDEDPRICE*L_DISCOUNT),max(L_EXTENDEDPRICE)) |
A1: Read customer data into memory and filter according to conditions.
A2: Calculate the average value after filtering.
A3: Establish an online item table and define pre cursor filtering.
A4: Calculate filtering, sum and maximum value on the cursor.
Calculate a new column using multiple columns in the file data in two ways, in memory and cursor.
Example:
Concatenate the number and name of the nation table into a string.
Calculate the discounted price based on the online item table.
A | |
---|---|
1 | =T("nation.btx").derive(N_NATIONKEY/"-"/N_NAME:key&name) |
2 | =file("lineitem.ctx").open().cursor@m(L_EXTENDEDPRICE,L_DISCOUNT).derive(L_EXTENDEDPRICE*L_DISCOUNT:newPrice) |
A1: Read the file data into memory, and then calculate the new column key&name.
A2: Calculate the new column newPrice using the cursor method. What is returned here is a cursor and will not be calculated immediately. Subsequent calculations are required to obtain the result.
Group the data in the file according to some fields and aggregate in each group.
Example:
Group the nations by region, and calculate the number of nations in each region.
Group orders by date, and calculate the order amount and maximum amount for each date.
The group and aggregation can be divided into two cases:
1. The result set after grouping and aggregation is not large, and can be loaded in memory.
A | |
---|---|
1 | =T("nation.btx").groups(N_REGIONKEY;count(1):nationCount) |
2 | =file("orders.ctx").open().cursor(O_ORDERDATE,O_TOTALPRICE).groups(O_ORDERDATE;sum(O_TOTALPRICE):all,max(sum(O_TOTALPRICE)):max) |
A1: Load the file into memory and then group and aggregate it.
A2: Calculate the grouping and aggregation using a cursor method, and the result of groups is a table sequence.
2. The results after grouping and aggregation are large and cannot fit in memory.
A | |
---|---|
1 | =file("orders.ctx").open().cursor(O_ORDERDATE,O_TOTALPRICE).groupx(O_ORDERDATE;sum(O_TOTALPRICE):all,max(sum(O_TOTALPRICE)):max) |
A1: Calculate the grouping and aggregation using a cursor method, and the result of groupx is a cursor. groupx requires the use of external storage buffer, which takes a long time. Usually used for offline data preparation, such as summarizing order details into an order table based on order numbers, and saving a separate copy for subsequent queries.
Remove duplicates from the data in the file and keep only one value with the same value.
Example:
Filter first, then query all region numbers in the nation table to remove duplicates.
First filter out orders with amounts within the specified range, and then query all customer numbers in the order table to remove duplicates.
A | |
---|---|
1 | =T("nation.btx").select(like(N_NAME,"C*")).id(N_REGIONKEY) |
2 | =file("orders.ctx").open().cursor(O_CUSTKEY;between(O_TOTALPRICE,1000:5000)).id(O_CUSTKEY) |
A1: Load the file content into memory, then filter it first and then remove duplicates.
A2: Perform a pre cursor filtering and then remove duplicates. When the result set after deduplication is still large, use the groupx function. The code is:
=file("orders.ctx").open().cursor(O_CUSTKEY;between(O_TOTALPRICE,1000:5000)).groupx(O_CUSTKEY)
Both the id function of a table sequence and cursor allow for multiple fields to be deduplicated, with multiple fields separated by commas.
Remove duplicates from the file data and calculate how many unique values there are.
Example:
Filter out eligible nations first, and then calculate the number of unique region keys for these nations.
First filter out orders with customer numbers 2, 3, 5, and 8, and then calculate the number of unique customer numbers in the orders.
A | |
---|---|
1 | =T("nation.btx").select(N_NATIONKEY>10).icount(N_REGIONKEY) |
2 | =file("orders.ctx").open().cursor(O_CUSTKEY;[2,3,5,8].contain(O_CUSTKEY)).total(icount(O_CUSTKEY)) |
A1: Load the file content into memory, then filter it first and then count unique values.
A2: Using a cursor method, filter first and then count unique values.
After grouping the data in the file, count unique values in each group.
Example:
After grouping customers by nation, calculate the number of unique customer names for each group.
After grouping orders by date, calculate the number of unique customers in each group.
A | |
---|---|
1 | =T("customer.ctx").groups(C_NATIONKEY;icount(C_NAME)) |
2 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_ORDERDATE).groups(O_ORDERDATE;icount(O_CUSTKEY)) |
A1: Load the file data into memory, and then calculate the grouping and count unique numbers.
A2: Calculate grouping and unique values in each group for file data using cursor method. If the data amount is large, use the groupx function.
Sort the data in the file in ascending/descending order.
Example:
Sort nation data in ascending order by region and descending order by name.
Sort the order data in ascending order of date.
A | |
---|---|
1 | =T("nation.btx").sort(N_REGIONKEY,N_NAME:-1) |
2 | =file("orders.ctx").open().cursor(O_ORDERDATE,O_ORDERKEY).sortx(O_ORDERDATE) |
A1: Sort file data after reading it into memory, N_NAME:-1 represents descending order.
A2: Sort file data using cursor method, only supports ascending order. In this situation, the data amount is generally large, and sorting requires the use of hard disk buffer, which takes a long time. It is generally used for offline data preparation, such as storing order data by date and saving it as another composite table, which can improve performance when querying.
If the data is ordered by the grouping field, grouping can only compare with adjacent records, which can improve grouping performance.
Example:
After sorting the nation data by region, perform an ordered grouping and aggregation.
Sort the order table by date in advance and save it as a composite table orders_order_date.ctx, then we can do an ordered grouping and aggregation by date here.
A | |
---|---|
1 | =T("nation.btx").sort(N_REGIONKEY) |
2 | =A1.groups@o(N_REGIONKEY;count(1):c) |
3 | =file("orders_order_date.ctx").open().cursor(O_ORDERDATE,O_TOTALPRICE) |
4 | =A3.group@s(O_ORDERDATE;count(1):c,sum(O_TOTALPRICE):s) |
A1: Read nation file into memory and sort it by region. The sorting can be done in advance and saved as a separate file.
A2: Using ordered data, use groups@o to do an ordered grouping and aggregation. groups@o requires that the data must be in order by the grouping field.
A3: Define a cursor using the order file sorted by date in advance.
A4: Use the group function to do an ordered grouping to the cursor, and returns a cursor. The group function requires that the data must be in order by the grouping field.
If the data is ordered by the deduplication field, only adjacent records need to be compared during deduplication, which can improve deduplication performance.
Example:
Filter first, then query all region numbers in the nation table to remove duplicates. Assuming that nation table is ordered by region number.
Filter first, then query all customer numbers in the order table to remove duplicates. Assuming that the order table is ordered by customer number.
A | |
---|---|
1 | =T("nation_order_regionkey.btx").select(like(N_NAME,"C*")).id@o(N_REGIONKEY) |
2 | =file("orders_order_custkey.ctx").open().cursor(O_CUSTKEY;between(O_TOTALPRICE,1000:5000)).id@o(O_CUSTKEY) |
A1: Load the file into memory, then filter it first and then remove duplicates.
A2: Perform a pre cursor filtering and then remove duplicates. If the result after deduplication is still large, group@1 should be used:
=file("orders_order_custkey.ctx").open().cursor(O_CUSTKEY;between(O_TOTALPRICE,1000:5000)).group@1(O_CUSTKEY)
It still returns a cursor. The group function requires that the cursor data is ordered by the deduplication field.
id@o function requires that the data is ordered by the deduplication field.
If the data is ordered by the deduplication field, only adjacent records need to be compared during deduplication, which can improve the performance of counting unique values.
Example:
Calculate the number of unique region numbers for the nation table. Assuming that nation file is ordered by region number.
Calculate the number of unique customer numbers in the orders file. Assuming that the orders file is ordered by customer number.
A | |
---|---|
1 | =T("nation.btx").icount@o(N_REGIONKEY) |
2 | =file("orders.ctx").open().cursor(O_CUSTKEY).total(icount@o(O_CUSTKEY) |
A1: Load the file into memory, then filter it first and then deduplicate and count it.
A2: Using a cursor method, filter first and then deduplicate and count.
icount@o function requires that the data is ordered by the deduplication field.
If the data is ordered by the deduplication field, only adjacent records need to be compared during deduplication, which can improve the performance of counting unique values in a group.
Example:
Group customers by nation and calculate the number of unique customer names in each group. Assuming customer data is ordered by customer name.
Group orders by date and calculate the number of unique customer numbers in each group. Assuming that the order data is ordered by customer number.
A | |
---|---|
1 | =T("customer.ctx").groups(C_NATIONKEY;icount@o(C_NAME)) |
2 | =file("orders.ctx").open().cursor(O_CUSTKEY,O_ORDERDATE).groups(O_ORDERDATE;icount@o(O_CUSTKEY)) |
A1: Load the file data into memory, and then group and count unique values.
A2: Using a cursor method, group the file data and count unique values.
If the data is ordered by the grouping field and deduplicated field, such as order data is ordered by O_ORDERDATE and O_CUSTKEY, ordered grouping and counting unique values in a group can be used together.
Find the topN and bottomN of the data in the file.
Example:
Filter according to the enumeration type criteria first, and then calculate the top 3 customer values.
First, exclude some customers (20,31,55,86) according to the enumeration type, and then calculate the order amount of the bottom 10.
A | |
---|---|
1 | =T("customer.ctx").select([2,3,5,8].contain(C_NATIONKEY)).top(3;C_ACCTBAL) |
2 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_TOTALPRICE;![20,31,55,86].contain(O_CUSTKEY)).total(top(-10;O_TOTALPRICE)) |
A1: Load the file content into memory and calculate the top three after filtering.
A2: Use the cursor method to calculate the top -10 (bottom 10) after filtering, where top is an aggregation function like sum and count.
Calculate the TOP-N of each group after grouping the data in the file.
Example:
First filter customers values based on their integer multiples of 10, and then group them by nation to find the top 3 customers in each nation.
First filter orders by date of the year 2023, and then group them by customer, and calculate the amount that comes from the bottom 10 in each group.
A | |
---|---|
1 | =T("customer.ctx").select(C_ACCTBAL%10==0).groups(C_NATIONKEY;top(3,C_ACCTBAL)) |
2 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE;year(O_ORDERDATE)==2023).groups(O_CUSTKEY;top(-10,O_TOTALPRICE)) |
A1: Load the file content into memory, then filter, group, and calculate the top three.
A2: Filter, group and calculate the top -10 (bottom 10) using cursor method. When the grouping result set is not large, use the groups function; When the result set is large, use the groupx function.
The top function here has a different parameter separator compared to the previous example. Here, separated by a comma, returns a sequence composed of the first n values of the sorting expression; Separated by a semicolon in the previous example, returns a sequence composed of the first n records.
For the convenience of description, we can also regard the bin file and composite table as data tables.
Two data tables, where non primary key fields in Table A are associated with the primary key of Table B. B is called A's foreign key table and this association is called a foreign key association.
Example:
The customer table and nation table are associated on nation number, concatenate the customer name and nation name with a '-'.
The order table and customer table are associated on customer number, calculate the amount of each order plus customer value.
There are two methods to implement this single foreign key association, method 1:
A | B | |
---|---|---|
1 | =T("nation.btx").keys(N_NATIONKEY) | =T("customer.ctx") |
2 | =B1.switch(C_NATIONKEY,A1:N_NATIONKEY) | |
3 | =A2.derive(C_NAME/"-"/C_NATIONKEY.N_NAME) | |
4 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE) | |
5 | =A4.switch(O_CUSTKEY,B1) | |
6 | =A5.derive(O_CUSTKEY.C_ACCTBAL+O_TOTALPRICE:newValue) |
A1: Read nation data into memory and set the primary key.
B1: Read customer data into memory, and the primary key has already been defined during composite table generation, so there is no need to set it again.
A2: The customer table is associated with the nation table. Since the primary key has been defined in A1, here :N_NATIONKEY can be omitted.
A3: Calculate the associated customer table using its own field and nation table field.
A4: Define the cursor for the order table, retrieving only the required fields.
A5: The order table cursor is associated with the in-memory customer table.
A6: Add the customer's field to the cursor order field and returns a cursor.
Method 2:
A | B | |
---|---|---|
1 | =T("nation.btx").keys(N_NATIONKEY) | =T("customer.ctx") |
2 | =B1.join(C_NATIONKEY,A1:N_NATIONKEY,N_NAME) | |
3 | =A2.derive(C_NAME/"-"/N_NAME) | |
4 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE) | |
5 | =A4.join(O_CUSTKEY,B1,C_ACCTBAL) | |
6 | =A5.derive(C_ACCTBAL+O_TOTALPRICE:newValue) |
A1: Read nation data into memory and set the primary key.
B1: Read customer data into memory, and the primary key has already been defined during composite table generation, so there is no need to set it again.
A2: Use the join function to associate the nation number in B1 with the nation number in A1, when N_NATIONKEY is the primary key, it can also be omitted, while adding the N_NAME column data in A1 to form a new table sequence.
A3: Perform calculations on the new table sequence.
A5: Use the join function to associate the cursor of A4 with B1 through the customer number, while adding the C_ACCTBAL field of B1, returns a cursor.
A6: Add two fields of the cursor to obtain a new field.
For cases where multiple foreign key fields are associated, the switch function cannot be used, only the join function can be used for association. Assuming that both the primary and foreign keys in the above example become two, the code would be as follows:
A | B | |
---|---|---|
1 | =T("nation.btx").keys(N_NATIONKEY1, N_NATIONKEY2) | =T("customer.ctx") |
2 | =B1.join(C_NATIONKEY1:N_NATIONKEY2,A1:N_NATIONKEY1: N_NATIONKEY2,N_NAME) | |
3 | =A2.derive(C_NAME/"-"/N_NAME) | |
4 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE) | |
5 | =A4.join(O_CUSTKEY1: O_CUSTKEY2,B1,C_ACCTBAL) | |
6 | =A5.derive(C_ACCTBAL+O_TOTALPRICE:newValue) |
Multiple primary and foreign keys in A2 and A5 should be separated by colons.
The primary key of Table A is associated with the primary key of Table B, and A and B are called homo-dimension tables. The hono- dimension tables are a one-to-one relationship that can be logically treated as one table. Homo- dimension tables are always associated by primary keys, and corresponding records are uniquely corresponding.
Assuming the primary keys of national information table nation_info and nation table are both nation numbers.
The primary keys of customer table and customer information table customer_info are both customer numbers.
Example 1:
Concatenate the name field in the nation table with the population field in the national information table using "-".
Add the customer value in the customer table and the FUND in the customer information table.
A | B | |
---|---|---|
1 | =T("nation.btx").keys(N_NATIONKEY) | =T("nation_info.btx").keys(NI_NATIONKEY) |
2 | =join(A1:n,N_NATIONKEY;B1:ni,POPULATION) | |
3 | =A2.new(n.N_NAME/"-"/ni.POPULATION) | |
4 | =file("customer.ctx").open().cursor(C_CUSTKEY,C_ACCTBAL) | |
5 | =file("customer_info.ctx").open().cursor(CI_CUSTKEY,FUND) | |
6 | =joinx(A4:c,C_CUSTKEY;A5:ci,CI_CUSTKEY) | |
7 | =A6.new(c.C_ACCTBAL+ci.FUND:newValue) |
A1 and B1 read nation and national information into memory.
A2: Join nation table and national information table on primary key.
A3: Create a new table sequence, concatenating the nation name and population into a string field.
A4, A5: Define the cursors of customer table and customer information table.
A6: Join two cursors on the primary key. When using joinx, the cursor data participating in the association must be ordered by the associated fields.
A7: Fetch the fields of c and ci in A6, and add them to form a new cursor.
If using multiple cursors for A4 and A5, add @m to A4. A5 cannot directly add @m, but should be written as:
=file("customer_info.ctx").open().cursor(CI_CUSTKEY,FUND;;A4)
This is because during parallel segmentation, it is necessary to ensure that the same primary keys of two files are divided into corresponding segments.
There is also a primary key association in the form of a primary sub table. The primary key of Table A is associated on part of the primary keys of Table B, where A is called the primary table and B is called the sub table. The primary sub tables are a one-to-many relationship.
The order table and online item table are primary sub tables, while the order table is the primary table.
Assuming customer table and customer contact table customer_contacts are primary sub tables, where the primary keys of the latter are the customer number and contact number.
Example:
Query customers with female contacts.
Query the online item table for order information with a discount greater than 0.05.
A | |
---|---|
1 | =T("customer.ctx") |
2 | =T("customer_contacts.ctx").select(GENDER==1) |
3 | =join(A1:c,A1.C_CUSTKEY;A2:cc,CC_CUSTKEY) |
4 | =A3.conj(c) |
5 | =file("orders.ctx").open().cursor(O_ORDERKEY,O_CUSTKEY,O_TOTALPRICE) |
6 | =file("lineitem.ctx").open().cursor(L_ORDERKEY;L_DISCOUNT>0.05) |
7 | =joinx(A5:o,O_ORDERKEY;A6:l,L_ORDERKEY) |
8 | =A7.conj(o) |
A1: Read the customer table into memory.
A2: Read customer contacts into memory and filter by gender female (1).
A3: Join A1 and A2 on the primary key.
A4: Use column c to merge to obtain customer data.
A5: Define the cursor for the order table.
A6: Define the online item table cursor, using pre cursor filtering.
A7: Join two cursors on the primary key. When using joinx, the cursor data participating in the association must be ordered by the associated fields.
A8: Fetch o from A7 and merge to obtain the result cursor.
If using multiple cursors for A5 and A6, add @m to A5. A6 cannot directly add @m, it should be written as:
=file("lineitem.ctx").open().cursor(L_ORDERKEY;L_DISCOUNT>0.05;A5)
Moreover, the order of the primary and sub tables cannot be changed. The primary table A5 should add @m and be written in the front, and the sub table A6 should not add @m. The last ';' should be followed by the primary table A5.
This is because during parallel segmentation, it is necessary to ensure that the sub tables are segmented in the same way as the primary table, and the same primary keys are divided into corresponding segments.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code