-
Notifications
You must be signed in to change notification settings - Fork 339
Comparison of SQL and SPL:Join Operations (Ⅰ)
The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you. Looking ${article} for details.
The independent relationship between data of tables is called table association. There are four types of table relationships – one-to-one, many-to-one, one-to-many and many-to-many. Two or more tables can be joined through the associative relationship for performing an associative query on them.
The one-to-one relationship exists between two tables where a record in a table corresponds to only one record in another table; and vice versa, and usually, that use the same primary key fields.
Suppose there are a student table and student_contacts table. Both use student ID as the primary key. The former stores student names, genders, birthdays, departments, and classes. The latter records contacts of students and their addresses. Each student record corresponds to a piece of contact information, and each contact corresponds to one student.
【Example 1】Based on STUDENT table and STUDENT_CONTACTS table, get names of students and their contacts and contacts’ addresses for students who have the contact information. Below is part of the source data:
STUDENT:
ID | NAME | DEPARTMENTID | CLASSID | GENDER | BIRTHDAY |
---|---|---|---|---|---|
1 | Rebecca | 1 | 1 | F | 2010/09/08 |
2 | Ashley | 1 | 1 | F | 2010/10/09 |
3 | Rachel | 1 | 1 | F | 2011/04/29 |
4 | Emily | 1 | 1 | F | 2010/11/24 |
5 | Ashley | 1 | 1 | F | 2011/03/03 |
… | … | … | … | … | … |
STUDENT_CONTACTS:
ID | CONTACTS | ADDRESS |
---|---|---|
1 | Mrs. Moore | 124 Guangming North Road |
2 | Mrs. Wilson | 116 Baishi Road |
3 | Mr. Johnson | No.8, Mingcheng Road, Haidian District |
4 | Mr. Smith | 12 Fuxing Road |
5 | Mr. Smith | 462 Shijingshan Road |
… | … | … |
SQL solution:
There could be some students who do not have contacts information. To find names and contact information of target students, we use the inner join to select records from STUDENT table where IDs exist in both tables. An inner join is often called a join simply. It will delete all rows that cannot match the other table from the result table. SQL INNER JOIN inner joins the first and the second table and ON statement defines the joining condition. Below is SQL statements:
SELECT
S.NAME,C.CONTACTS,C.ADDRESS
FROM
STUDENT S
INNER JOIN
STUDENT_CONTACTS C
ON S.ID=C.ID
SPL solution:
SPL calls two or more tables having one-to-one relationship between them the homo-dimension tables. One is the homo-dimension table of the other, and vice versa. SPL join() function is used to perform a join operation, an inner join operation, by default.
A | |
---|---|
1 | =T("Student.txt") |
2 | =T("StudentContacts.txt") |
3 | =join(A1:S,ID;A2:C,ID) |
4 | =A3.new(S.NAME,C.CONTACTS,C.ADDRESS) |
A1: Import Student table from the source file.
A2: Import StudentContacts table from the source file.
A3: Inner joins the two tables through their ID field.
A4: Create a new table sequence made up of fields storing student names, their contacts and addresses.
SPL supports retrieving a data table from the database. Suppose the data comes from database "db"’s "STUDENT" table, A1 in the above SPL script can be rewritten as follows:
A | |
---|---|
1 | =connect("db").query("select * from STUDENT") |
【Example 2】Based on EMPLOYEE table and MANAGER table, get salaries, including the allowance, of all employees, including managers. Below is part of the source data:
EMPLOYEE:
ID | NAME | BIRTHDAY | STATE | DEPT | SALARY |
---|---|---|---|---|---|
1 | Rebecca | 1974/11/20 | California | R&D | 7000 |
2 | Ashley | 1980/07/19 | New York | Finance | 11000 |
3 | Rachel | 1970/12/17 | New Mexico | Sales | 9000 |
4 | Emily | 1985/03/07 | Texas | HR | 7000 |
5 | Ashley | 1975/05/13 | Texas | R&D | 16000 |
… | … | … | … | … | … |
MANAGER:
ID | ALLOWANCE |
---|---|
18 | 7000 |
2 | 11000 |
4 | 7000 |
6 | 10000 |
7 | 9000 |
… | … |
SQL solution:
Managers are employees, too, but the MANAGER table stores allowance information. To query salaries of all employees, the left join is needed to get records of all employees, including managers. The left join is also called left outer join. It joins two tables based on the left table and lists all records in the left table and records in the right table that match the left table according to the specified condition in the result table. SQL LEFT JOIN is used to left join tables. Below is SQL statements:
SELECT
E.ID,E.NAME,E.SALARY+NVL(M.ALLOWANCE,0) INCOME
FROM EMPLOYEE E
LEFT JOIN
MANAGER M
ON E.ID=M.ID
SPL solution:
SPL join() function is used to perform the join operation. @1 option enables a left join.
A | |
---|---|
1 | =T("Employee.csv") |
2 | =T("Manager.txt") |
3 | =join@1(A1:E, ID; A2:M, ID) |
4 | =A3.new(E.ID, E.NAME, E.SALARY+M.ALLOWANCE:INCOME) |
A1: Import Employee table from the source file.
A2: Import Manager table from the source file.
A3: Left join the two tables through ID field based on the first table, the employee table.
A4: Create a new table sequence made up of fields storing employee IDs, names, and salaries.
The one-to-one relationship is the simplest among all relationships. Two tables are joined directly by matching their primary keys. Both SQL and SPL can handle this type of associative relationship effectively.
The one-to-many relationship exists between two tables where a record of one table corresponds to any one or more records in the other table. The table at the “one” end is called the primary table and the one at the “many” end is called the subtable (or subordinate table). Suppose there are orders table and order detail table, each order has the only ID but each order ID may correspond to multiple order detail records. We call the orders table the primary table and the order detail table the subtable.
【Example 3】Based on ORDERS table and ORDER_DETAIL table, calculate the total amount in each order. Below is part of the source data:
ORDERS:
ID | CUSTOMERID | EMPLOYEEID | ORDER_DATE | ARRIVAL_DATE |
---|---|---|---|---|
10248 | VINET | 5 | 2012/07/04 | 2012/08/01 |
10249 | TOMSP | 6 | 2012/07/05 | 2012/08/16 |
10250 | HANAR | 4 | 2012/07/08 | 2012/08/05 |
10251 | VICTE | 3 | 2012/07/08 | 2012/08/05 |
10252 | SUPRD | 4 | 2012/07/09 | 2012/08/06 |
… | … | … | … | … |
ORDER_DETAIL:
ID | ORDER_NUMBER | PRODUCTID | PRICE | COUNT | DISCOUNT |
---|---|---|---|---|---|
10814 | 1 | 48 | 102.0 | 8 | 0.15 |
10814 | 2 | 48 | 102.0 | 8 | 0.15 |
10814 | 3 | 48 | 306.0 | 24 | 0.15 |
10814 | 4 | 48 | 102.0 | 8 | 0.15 |
10814 | 5 | 48 | 204.0 | 16 | 0.15 |
… | … | … | … | … | … |
SQL solution:
SQL uses JOIN statement to handle the one-to-many relationship. Below is SQL statements:
SELECT
ID, SUM(PRICE*COUNT) AMOUNT
FROM (
SELECT
Orders.ID, Detail.PRICE, Detail.COUNT
FROM ORDERS Orders
INNER JOIN
ORDER_DETAIL Detail
ON Orders.ID=Detail.ID
)
GROUP BY ID
ORDER BY ID
SPL solution:
SPL uses join() function to join the primary table and the subtable through the subtable’s one key field and the primary table’s key when there is only one subtable.
A | |
---|---|
1 | =T("Orders.txt") |
2 | =T("OrderDetail.txt") |
3 | =join(A1:Orders,ID; A2:Detail,ID) |
4 | =A3.groups(Orders.ID; sum(Detail.PRICE*Detail.COUNT):AMOUNT) |
A1: Import Orders table.
A2: Import OrderDetail table.
A3: Join the two tables through their ID fields.
A4: Group A3’s joining result table and calculate the total amount for each order.
【Example 4】Suppose the ORDERS table has another subtable for recording the payment information. We want to find the orders for which not all payment has been received, that is, those where the accumulative payment is less than the total order amount. Below is part of the source data:
ORDERS:
ID | CUSTOMERID | EMPLOYEEID | ORDER_DATE | ARRIVAL_DATE |
---|---|---|---|---|
10248 | VINET | 5 | 2012/07/04 | 2012/08/01 |
10249 | TOMSP | 6 | 2012/07/05 | 2012/08/16 |
10250 | HANAR | 4 | 2012/07/08 | 2012/08/05 |
10251 | VICTE | 3 | 2012/07/08 | 2012/08/05 |
10252 | SUPRD | 4 | 2012/07/09 | 2012/08/06 |
… | … | … | … | … |
ORDER_DETAIL:
ID | ORDER_NUMBER | PRODUCTID | PRICE | COUNT | DISCOUNT |
---|---|---|---|---|---|
10814 | 1 | 48 | 102.0 | 8 | 0.15 |
10814 | 2 | 48 | 102.0 | 8 | 0.15 |
10814 | 3 | 48 | 306.0 | 24 | 0.15 |
10814 | 4 | 48 | 102.0 | 8 | 0.15 |
10814 | 5 | 48 | 204.0 | 16 | 0.15 |
… | … | … | … | … | … |
ORDER_PAYMENT:
ID | PAY_DATE | AMOUNT | CHANNEL | INSTALMENTS |
---|---|---|---|---|
10814 | 2014/01/05 | 816.0 | 3 | 0 |
10848 | 2014/01/23 | 800.25 | 2 | 1 |
10848 | 2014/01/23 | 800.25 | 0 | 0 |
10848 | 2014/01/23 | 800.25 | 3 | 1 |
10966 | 2014/03/20 | 572.0 | 2 | 1 |
… | … | … | … | … |
SQL solution:
Just JOINing the three tables is not the right way of doing this because there could be many-to-many relationship between ORDER_DETAIL table and ORDER_PAYMENT table. When the subtables are grouped by order ID, the field will become their unique, actual primary keys. Then the three tables can be joined through the ID field.
SELECT
Orders.ID,Detail.AMOUNT,Payment.PAY_AMOUNT
FROM ORDERS Orders
INNER JOIN
(
SELECT ID, SUM(PRICE*COUNT) AMOUNT
FROM ORDER_DETAIL
GROUP BY ID
) Detail
ON Orders.ID=Detail.ID
INNER JOIN
(
SELECT ID, SUM(AMOUNT) PAY_AMOUNT
FROM ORDER_PAYMENT
GROUP BY ID
) Payment
ON Orders.ID=Payment.ID
WHERE PAY_AMOUNT<Detail.AMOUNT
ORDER BY ID
SPL solution:
When the subtables are grouped by ID, the field will become the real primary key. Now we can treat them as tables with the one-to-one relationship in between (or the homo-dimension tables).
A | |
---|---|
1 | =T("Orders.txt") |
2 | =T("OrderDetail.txt") |
3 | =T("OrderPayment.txt") |
4 | =A2.groups(ID; sum(PRICE*COUNT):AMOUNT) |
5 | =A3.groups(ID; sum(AMOUNT):PAY_AMOUNT) |
6 | =join(A1:Orders,ID; A4:Detail,ID; A5:Payment,ID) |
7 | =A6.new(Orders.ID, Detail.AMOUNT, Payment.PAY_AMOUNT) |
8 | =A7.select(PAY_AMOUNT<AMOUNT) |
A1: Import Orders table.
A2: Import OrderDetail table.
A3: Import OrderPayment table.
A4: Group OrderDetail table and calculate the total amount in each order.
A5: Group OrderPayment table and calculate the total payment amount in each order.
A6: The join() function joins Orders table and the grouped & summarized OrderDetail table and OrderPayment table through their ID fields.
A7: Create a new table sequence consisting of fields of order ID, order amount and order payment amount.
A8: Select records where the payment amount is less than the order amount, that is, those that have not received all payment.
Without the support of stepwise coding, SQL will write the whole procedure in a single statement, which is sure to be complicated. SPL displays clear logic by using the step-by-step coding mode. There are two steps for performing a join operation. The first is to group every subtable by one key field (which is the primary table’s primary key) to make them have the same actual primary key as the primary table. The second is to join the two or more tables through their primary keys (or the actual primary keys). In essence, SQL and SPL implement the join operations in the same way, but SPL’s stepwise design makes easy coding.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code