-
Notifications
You must be signed in to change notification settings - Fork 340
Comparison of SQL and SPL:Complicated Static Transposition
【Abstract】 Data transposition aims to convert queried data into a specified format to display using front-end applications, such as reporting tools. There are row-to-column transposition, column-to-row transposition, and more complicated dynamic transposition. This essay focuses on solutions and basic principles of SQL and SPL, the two commonly used programming languages, in handling transposition scenarios, and tries to find the convenient and efficient way for you through sample programs in SQL and SPL.
Let’s take a look at how SQL and SPL handle complicated static transposition scenarios.
【Example 1】Based on the following punch-in data table, generate a new table recording the daily activities for each employee. Each person corresponds to seven records per day:
PER_CODE | IN_OUT | DATE | TIME | TYPE |
---|---|---|---|---|
1110263 | 1 | 2013-10-11 | 09:17:14 | In |
1110263 | 6 | 2013-10-11 | 11:37:00 | Break |
1110263 | 5 | 2013-10-11 | 11:38:21 | Return |
1110263 | 0 | 2013-10-11 | 11:43:21 | NULL |
1110263 | 6 | 2013-10-11 | 13:21:30 | Break |
1110263 | 5 | 2013-10-11 | 14:25:58 | Return |
1110263 | 2 | 2013-10-11 | 18:28:55 | Out |
… | … | … | … | … |
Below is the result of the expected layout:
PER_CODE | DATE | IN | OUT | BREAK | RETURN |
---|---|---|---|---|---|
1110263 | 2013-10-11 | 9:17:14 | 18:28:55 | 11:37:00 | 11:38:21 |
1110263 | 2013-10-11 | 9:17:14 | 18:28:55 | 13:21:30 | 14:25:58 |
… | … | … | … | … | … |
SQL solution:
Here multiple records should be combined to do the calculation and return multiple records according to the specified structure. We sort records every 7 rows by PER_CODE and DATE, get TIME field values from rows numbered [1,7,2,3] in turn from each group and make them first group of values under IN, OUT, BREAK, and RETURN fields, and then retrieve TIME field values from rows numbered [1,7,5,6] in turn from each group and make them the second groups of values under these fields. With databases that do not support PIVOT/UNPIVOT, SQL has the following query:
WITH CTE1 AS (
SELECT
PER_CODE,IN_OUT,"DATE","TIME",TYPE,
MOD(ROWNUM-1,7)+1 GROUP_ORDER
FROM DAILY_TIME
ORDER BY PER_CODE,"DATE","TIME"
)
SELECT * FROM (
SELECT
T_IN.PER_CODE,T_IN."DATE",T_IN."IN",T_OUT."OUT",
T_BREAK.BREAK,T_RETURN."RETURN"
FROM (
SELECT PER_CODE,"DATE","TIME" "IN"
FROM CTE1
WHERE GROUP_ORDER=1
) T_IN
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" "OUT"
FROM CTE1
WHERE GROUP_ORDER=7
) T_OUT
ON T_IN.PER_CODE=T_OUT.PER_CODE
AND T_IN."DATE"=T_OUT."DATE"
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" BREAK
FROM CTE1
WHERE GROUP_ORDER=2
) T_BREAK
ON T_IN.PER_CODE=T_BREAK.PER_CODE
AND T_IN."DATE"=T_BREAK."DATE"
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" "RETURN"
FROM CTE1
WHERE GROUP_ORDER=3
) T_RETURN
ON T_IN.PER_CODE=T_RETURN.PER_CODE
AND T_IN."DATE"=T_RETURN."DATE"
UNION ALL (
SELECT
T_IN.PER_CODE,T_IN."DATE",T_IN."IN",T_OUT."OUT",
T_BREAK.BREAK,T_RETURN."RETURN"
FROM (
SELECT PER_CODE,"DATE","TIME" "IN"
FROM CTE1
WHERE GROUP_ORDER=1
) T_IN
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" "OUT"
FROM CTE1
WHERE GROUP_ORDER=7
) T_OUT
ON T_IN.PER_CODE=T_OUT.PER_CODE
AND T_IN."DATE"=T_OUT."DATE"
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" BREAK
FROM CTE1
WHERE GROUP_ORDER=5
) T_BREAK
ON T_IN.PER_CODE=T_BREAK.PER_CODE
AND T_IN."DATE"=T_BREAK."DATE"
LEFT JOIN (
SELECT PER_CODE,"DATE","TIME" "RETURN"
FROM CTE1
WHERE GROUP_ORDER=6
) T_RETURN
ON T_IN.PER_CODE=T_RETURN.PER_CODE
AND T_IN."DATE"=T_RETURN."DATE"
)
)
ORDER BY PER_CODE,"DATE",BREAK
The query is hard to understand. Take ORACLE 11g as an example, we use PIVOT to simplify the LEFT JOIN part. Then the SQL query is as follows:
WITH CTE1 AS (
SELECT
PER_CODE,IN_OUT,"DATE","TIME",TYPE,
MOD(ROWNUM-1,7)+1 GROUP_ORDER
FROM DAILY_TIME
ORDER BY PER_CODE,"DATE","TIME"
)
SELECT *
FROM (
SELECT *
FROM (
SELECT
PER_CODE,"DATE","TIME",GROUP_ORDER
FROM CTE1
WHERE GROUP_ORDER IN (1,7,2,3)
)
PIVOT(
MIN("TIME") FOR GROUP_ORDER
IN (1 AS "IN",7 AS "OUT",2 AS BREAK,3 AS "RETURN")
)
UNION ALL
(
SELECT *
FROM (
SELECT
PER_CODE,"DATE","TIME",GROUP_ORDER
FROM CTE1
WHERE GROUP_ORDER IN (1,7,5,6)
)
PIVOT(
MIN("TIME") FOR GROUP_ORDER
IN (1 AS "IN",7 AS "OUT",5 AS BREAK,6 AS "RETURN")
)
)
)
ORDER BY PER_CODE,"DATE",BREAK
SPL solution:
Though the table structure after transposition is definite, it is still complicated to get this done using A.pivot() function. As an alternative, we create the target data structure instead and then populate data to it.
A | |
---|---|
1 | =create(PER_CODE,DATE,IN,OUT,BREAK,RETURN) |
2 | =T("DailyTime.txt").sort(PER_CODE,DATE,TIME) |
3 | =A2.group((#-1)\7).(~([1,7,2,3,1,7,5,6])) |
4 | >A1.record(A3.conj([ |
A1: Create an empty table according to the specified data structure.
A2: Import the daily punch-in records and sort them by employee code and date.
A3: Group A2’s records every seven rows and, for each group, return records by retrieving values in the order of [1,7,2,3,1,7,5,6].
A4: Concatenate all returned records according to the target order and insert them to A3’s table.
Let’s examine the SPL solution. First we create an empty table according to the target structure, sort original records by grouping them every seven rows, and in each group, get time values from rows [1,7,2,3,1,7,5,6] in turn for the future two records, and finally, populate values to the empty table in sequence. The SQL query is extremely complicated though it uses PIVOT function. PIVOT is not a suitable method for handling this case. Yet SQL has problems in dealing with the task using SPL’s way. SQL’s grouping operation cannot return the post-grouping subsets, which makes it unable to perform further computations on the subset as SPL does. A SQL set is unordered. The language can generate sequence numbers for members in each group through row numbers, but it is inconvenient for it to access members in turn using multiple sequence numbers. SPL, however, can do that.
【Example 2】Based on the following user payment detail table, generate a new table storing the payable amount per month for each user in the year 2014. Below is part of the source table:
ID | CUSTOMERID | NAME | UNPAID | ORDER_DATE |
---|---|---|---|---|
112101 | C013 | CA | 12800 | 2014/02/21 |
112102 | C013 | CA | 3500 | 2014/06/15 |
112103 | C013 | CA | 2600 | 2015/03/21 |
… | … | … | … | … |
Below is the result of the following layout:
NAME | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CA | 12800 | 12800 | 12800 | 12800 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 | |
… |
SQL solution:
It is impossible for SQL to implement the general method of creating the target table structure and then populating data to the structure. The language needs to work out a solution for each case. In the last step, PIVOT function can be used to transpose month values into column names. Before that, we need to prepare the original data according to the following format:
NAME | ORDER_MONTH | UNPAID |
---|---|---|
CA | 1 | |
CA | 2 | 12800 |
CA | 3 | 12800 |
CA | 4 | 12800 |
CA | 5 | 12800 |
CA | 6 | 3500 |
CA | 7 | 3500 |
… | … | … |
Yet in the original table, not each month has records. What we expect is a table recording the monthly payable amounts for each user in the year 2014. We perform a cross product on the list of unique customers who have the payment records and the months from January to December to generate a table of the target structure. Then we left join this table with the user payment detail table to get the monthly payable amount for each user. Now the table is as follows:
NAME | ORDER_MONTH | UNPAID |
---|---|---|
CA | 1 | |
CA | 2 | 12800 |
CA | 3 | |
CA | 4 | |
CA | 5 | |
CA | 6 | 3500 |
CA | 7 | |
… | … | … |
There is more to do as we expect to populate January’s payable amount, 12,800, to months from March to May, and for customer CA, we want to insert June’s payable amount 3500 to months after July. To do those, we can use the subquery, where we select the previous records containing the payable amounts for a customer if the current month does not have a payable amount and get the one with the latest month. Below is the complete SQL query:
WITH CTE1 AS(
SELECT T1.NAME,T1.ORDER_MONTH,T2.UNPAID
FROM (
SELECT *
FROM (
SELECT DISTINCT CUSTOMERID, NAME
FROM PAYMENT
WHERE EXTRACT (YEAR FROM ORDER_DATE)=2014
)
CROSS JOIN (
SELECT 1 ORDER_MONTH FROM DUAL
UNION ALL SELECT 2 ORDER_MONTH FROM DUAL
UNION ALL SELECT 3 ORDER_MONTH FROM DUAL
UNION ALL SELECT 4 ORDER_MONTH FROM DUAL
UNION ALL SELECT 5 ORDER_MONTH FROM DUAL
UNION ALL SELECT 6 ORDER_MONTH FROM DUAL
UNION ALL SELECT 7 ORDER_MONTH FROM DUAL
UNION ALL SELECT 8 ORDER_MONTH FROM DUAL
UNION ALL SELECT 9 ORDER_MONTH FROM DUAL
UNION ALL SELECT 10 ORDER_MONTH FROM DUAL
UNION ALL SELECT 11 ORDER_MONTH FROM DUAL
UNION ALL SELECT 12 ORDER_MONTH FROM DUAL
)
) T1
LEFT JOIN (
SELECT
CUSTOMERID, NAME,
EXTRACT (MONTH FROM ORDER_DATE) ORDER_MONTH, UNPAID
FROM PAYMENT
WHERE EXTRACT (YEAR FROM ORDER_DATE)=2014
) T2
ON T1.NAME=T2.NAME AND
T1.ORDER_MONTH=T2.ORDER_MONTH
ORDER BY NAME,ORDER_MONTH
),
CTE2 AS (
SELECT
T1.NAME,T1.ORDER_MONTH,
NVL(T1.UNPAID,
(
SELECT
MIN(UNPAID) KEEP (DENSE_RANK FIRST ORDER BY ORDER_MONTH DESC)
FROM CTE1 T2
WHERE T1.NAME=T2.NAME AND
T2.UNPAID>0 AND
T2.ORDER_MONTH<T1.ORDER_MONTH
)
) UNPAID
FROM CTE1 T1
ORDER BY T1.NAME,T1.ORDER_MONTH
)
SELECT *
FROM CTE2
PIVOT(
MIN(UNPAID) FOR ORDER_MONTH
IN (
1 AS "1",2 AS "2",2 AS "3",
4 AS "4",5 AS "5",6 AS "6",
7 AS "7",8 AS "8",9 AS "9",
10 AS "10",11 AS "11",12 AS "12"
)
)
The SQL query is too complicated to understand even though we have given a detailed explanation. Now let’s look at how SPL handles the task:
SPL solution:
It is complicated to do this with A.pivot() function. According to the logic in the previous example, we create a target data structure and then populate data to it.
A | |
---|---|
1 | =create(NAME,${12.string()}) |
2 | =T("Payment1.txt").select(year(ORDER_DATE)==2014).group(CUSTOMERID) |
3 | >A2.((m12=12.(null), |
A1: Create an empty table of the target data structure.
A2: Import the user payment table containing records of the year 2014 and sort it by customer ID.
A3: Loop through each group and for each member in a group calculate the payable amount per month, and then insert the results to A1’s table along with customer names.
The SPL script is concise and, more importantly, clear in logic. For a complicated static transposition task, we can first create the target data structure and then populate data to it. As the SPL set is ordered, it is convenient to perform an inter-row calculation.
Presenting data in horizontal column groups is often used for data visualization, where data having same type of attributes is displayed in groups of columns for convenient viewing. Let’s see how SQL and SPL handle this type of transposition through an example.
【Example 3】List names and salaries of employees whose get paid over 10,000 in both sales department and R&D department. Below is part of the employee table:
ID | NAME | SURNAME | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | Moore | California | R&D | 7000 |
2 | Ashley | Wilson | New York | Finance | 11000 |
3 | Rachel | Johnson | New Mexico | Sales | 9000 |
4 | Emily | Smith | Texas | HR | 7000 |
5 | Ashley | Smith | Texas | R&D | 16000 |
… | … | … | … | … | … |
Below is result of the expected layout:
SALESNAME | SALARY | RDNAME | SALARY |
---|---|---|---|
Madeline | 15000 | Ashley | 16000 |
Jacob | 12000 | Jacob | 16000 |
Andrew | 12000 | Ryan | 13000 |
… | … | … | … |
SQL solution:
First, we perform conditional filtering according to the condition that a record has the department value of sales or R&D and the salary value above 10,000. To display data in groups of columns, we need to get the rows numbers for members in each group and then perform a full join by matching row numbers. Below is the SQL query:
SELECT
T1.NAME SALESNAME, T1.SALARY, T2.NAME RDNAME, T2.SALARY
FROM (
SELECT
NAME,SALARY, ROW_NUMBER()OVER (ORDER BY SALARY DESC) NO
FROM EMPLOYEE
WHERE DEPT='Sales' AND SALARY >10000
) T1
FULL JOIN (
SELECT
NAME,SALARY, ROW_NUMBER()OVER (ORDER BY SALARY DESC) NO
FROM EMPLOYEE
WHERE DEPT='R&D' AND SALARY >10000
) T2
ON T1.NO=T2.NO
SPL solution:
As it handles dynamic transpositions, SPL handles this type of transposition by creating the target structure first and then populating data to it. Below is the SPL script:
A | |
---|---|
1 | =T("Employee.csv").select(SALARY >10000).sort@z(SALARY) |
2 | =A1.select(DEPT:"Sales") |
3 | =A1.select(DEPT:"R&D") |
4 | =create('SALESNAME',SALARY,'RDNAME', SALARY) |
5 | =A4.paste(A2.(NAME),A2.(SALARY),A3.(NAME),A3.(SALARY)) |
A1: Import Employee data table, select records where salaries are above 10,000, and sort them by salary in descending order.
A2: Get records of sales department.
A3: Get records of R&D department.
A4: Create an empty table of the target data structure.
A5: Use A.paste() function to paste result values to corresponding columns.
We can see that SQL’s static transposition methods PIVOT and UNPIVOT have limited applications and are supported only by certain database products. The SQL query will often be too complicated when it tries to handle certain complex static transposition scenarios. Moreover, SQL lacks a standard method of dealing with them.
SPL provides a flexible and adaptable method for handling various complicated transposition tasks. More importantly, the language has a clear and stable logic. It will first create the target data structure and then populated the calculated result to the table.
In the next essay in the transposition series, we will introduce how SQL and SPL handle dynamic transpositions.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code