Skip to content

Comparison of SQL and SPL:Join Operations (Ⅰ)

esProcSPL edited this page Apr 29, 2024 · 1 revision

The join operation is used to combine records of two or more tables. This essay explains the handling of join operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you. Looking ${article} for details.


The independent relationship between data of tables is called table association. There are four types of table relationships – one-to-one, many-to-one, one-to-many and many-to-many. Two or more tables can be joined through the associative relationship for performing an associative query on them.

Ⅰ. One-to-one relationship

The one-to-one relationship exists between two tables where a record in a table corresponds to only one record in another table; and vice versa, and usually, that use the same primary key fields.

Suppose there are a student table and student_contacts table. Both use student ID as the primary key. The former stores student names, genders, birthdays, departments, and classes. The latter records contacts of students and their addresses. Each student record corresponds to a piece of contact information, and each contact corresponds to one student.

【Example 1】Based on STUDENT table and STUDENT_CONTACTS table, get names of students and their contacts and contacts’ addresses for students who have the contact information. Below is part of the source data:

STUDENT:

ID NAME DEPARTMENTID CLASSID GENDER BIRTHDAY
1 Rebecca 1 1 F 2010/09/08
2 Ashley 1 1 F 2010/10/09
3 Rachel 1 1 F 2011/04/29
4 Emily 1 1 F 2010/11/24
5 Ashley 1 1 F 2011/03/03

STUDENT_CONTACTS:

ID CONTACTS ADDRESS
1 Mrs. Moore 124 Guangming North Road
2 Mrs. Wilson 116 Baishi Road
3 Mr. Johnson No.8, Mingcheng Road, Haidian District
4 Mr. Smith 12 Fuxing Road
5 Mr. Smith 462 Shijingshan Road

SQL solution:

There could be some students who do not have contacts information. To find names and contact information of target students, we use the inner join to select records from STUDENT table where IDs exist in both tables. An inner join is often called a join simply. It will delete all rows that cannot match the other table from the result table. SQL INNER JOIN inner joins the first and the second table and ON statement defines the joining condition. Below is SQL statements:

SELECT 
    S.NAME,C.CONTACTS,C.ADDRESS 
FROM 
    STUDENT S 
INNER JOIN 
STUDENT_CONTACTS C
ON S.ID=C.ID

SPL solution:

SPL calls two or more tables having one-to-one relationship between them the homo-dimension tables. One is the homo-dimension table of the other, and vice versa. SPL join() function is used to perform a join operation, an inner join operation, by default.

A
1 =T("Student.txt")
2 =T("StudentContacts.txt")
3 =join(A1:S,ID;A2:C,ID)
4 =A3.new(S.NAME,C.CONTACTS,C.ADDRESS)

A1: Import Student table from the source file.

A2: Import StudentContacts table from the source file.

A3: Inner joins the two tables through their ID field.

A4: Create a new table sequence made up of fields storing student names, their contacts and addresses.

SPL supports retrieving a data table from the database. Suppose the data comes from database "db"’s "STUDENT" table, A1 in the above SPL script can be rewritten as follows:

A
1 =connect("db").query("select * from STUDENT")

【Example 2】Based on EMPLOYEE table and MANAGER table, get salaries, including the allowance, of all employees, including managers. Below is part of the source data:

EMPLOYEE:

ID NAME BIRTHDAY STATE DEPT SALARY
1 Rebecca 1974/11/20 California R&D 7000
2 Ashley 1980/07/19 New York Finance 11000
3 Rachel 1970/12/17 New Mexico Sales 9000
4 Emily 1985/03/07 Texas HR 7000
5 Ashley 1975/05/13 Texas R&D 16000

MANAGER:

ID ALLOWANCE
18 7000
2 11000
4 7000
6 10000
7 9000

SQL solution:

Managers are employees, too, but the MANAGER table stores allowance information. To query salaries of all employees, the left join is needed to get records of all employees, including managers. The left join is also called left outer join. It joins two tables based on the left table and lists all records in the left table and records in the right table that match the left table according to the specified condition in the result table. SQL LEFT JOIN is used to left join tables. Below is SQL statements:

SELECT 
    E.ID,E.NAME,E.SALARY+NVL(M.ALLOWANCE,0) INCOME
FROM EMPLOYEE E
LEFT JOIN
MANAGER M 
ON E.ID=M.ID

SPL solution:

SPL join() function is used to perform the join operation. @1 option enables a left join.

A
1 =T("Employee.csv")
2 =T("Manager.txt")
3 =join@1(A1:E, ID; A2:M, ID)
4 =A3.new(E.ID, E.NAME, E.SALARY+M.ALLOWANCE:INCOME)

A1: Import Employee table from the source file.

A2: Import Manager table from the source file.

A3: Left join the two tables through ID field based on the first table, the employee table.

A4: Create a new table sequence made up of fields storing employee IDs, names, and salaries.

The one-to-one relationship is the simplest among all relationships. Two tables are joined directly by matching their primary keys. Both SQL and SPL can handle this type of associative relationship effectively.

Ⅱ. One-to-many relationship

The one-to-many relationship exists between two tables where a record of one table corresponds to any one or more records in the other table. The table at the “one” end is called the primary table and the one at the “many” end is called the subtable (or subordinate table). Suppose there are orders table and order detail table, each order has the only ID but each order ID may correspond to multiple order detail records. We call the orders table the primary table and the order detail table the subtable.

【Example 3】Based on ORDERS table and ORDER_DETAIL table, calculate the total amount in each order. Below is part of the source data:

ORDERS:

ID CUSTOMERID EMPLOYEEID ORDER_DATE ARRIVAL_DATE
10248 VINET 5 2012/07/04 2012/08/01
10249 TOMSP 6 2012/07/05 2012/08/16
10250 HANAR 4 2012/07/08 2012/08/05
10251 VICTE 3 2012/07/08 2012/08/05
10252 SUPRD 4 2012/07/09 2012/08/06

ORDER_DETAIL:

ID ORDER_NUMBER PRODUCTID PRICE COUNT DISCOUNT
10814 1 48 102.0 8 0.15
10814 2 48 102.0 8 0.15
10814 3 48 306.0 24 0.15
10814 4 48 102.0 8 0.15
10814 5 48 204.0 16 0.15

SQL solution:

SQL uses JOIN statement to handle the one-to-many relationship. Below is SQL statements:

SELECT 
    ID, SUM(PRICE*COUNT) AMOUNT
FROM (
    SELECT 
        Orders.ID, Detail.PRICE, Detail.COUNT 
    FROM ORDERS Orders
    INNER JOIN 
    ORDER_DETAIL Detail
    ON Orders.ID=Detail.ID
)
GROUP BY ID
ORDER BY ID

SPL solution:

SPL uses join() function to join the primary table and the subtable through the subtable’s one key field and the primary table’s key when there is only one subtable.

A
1 =T("Orders.txt")
2 =T("OrderDetail.txt")
3 =join(A1:Orders,ID; A2:Detail,ID)
4 =A3.groups(Orders.ID; sum(Detail.PRICE*Detail.COUNT):AMOUNT)

A1: Import Orders table.

A2: Import OrderDetail table.

A3: Join the two tables through their ID fields.

A4: Group A3’s joining result table and calculate the total amount for each order.

【Example 4】Suppose the ORDERS table has another subtable for recording the payment information. We want to find the orders for which not all payment has been received, that is, those where the accumulative payment is less than the total order amount. Below is part of the source data:

ORDERS:

ID CUSTOMERID EMPLOYEEID ORDER_DATE ARRIVAL_DATE
10248 VINET 5 2012/07/04 2012/08/01
10249 TOMSP 6 2012/07/05 2012/08/16
10250 HANAR 4 2012/07/08 2012/08/05
10251 VICTE 3 2012/07/08 2012/08/05
10252 SUPRD 4 2012/07/09 2012/08/06

ORDER_DETAIL:

ID ORDER_NUMBER PRODUCTID PRICE COUNT DISCOUNT
10814 1 48 102.0 8 0.15
10814 2 48 102.0 8 0.15
10814 3 48 306.0 24 0.15
10814 4 48 102.0 8 0.15
10814 5 48 204.0 16 0.15

ORDER_PAYMENT:

ID PAY_DATE AMOUNT CHANNEL INSTALMENTS
10814 2014/01/05 816.0 3 0
10848 2014/01/23 800.25 2 1
10848 2014/01/23 800.25 0 0
10848 2014/01/23 800.25 3 1
10966 2014/03/20 572.0 2 1

SQL solution:

Just JOINing the three tables is not the right way of doing this because there could be many-to-many relationship between ORDER_DETAIL table and ORDER_PAYMENT table. When the subtables are grouped by order ID, the field will become their unique, actual primary keys. Then the three tables can be joined through the ID field.

SELECT 
    Orders.ID,Detail.AMOUNT,Payment.PAY_AMOUNT 
FROM ORDERS Orders
INNER JOIN
(
    SELECT ID, SUM(PRICE*COUNT) AMOUNT 
    FROM ORDER_DETAIL
    GROUP BY ID
) Detail
ON Orders.ID=Detail.ID
INNER JOIN
(
    SELECT ID, SUM(AMOUNT) PAY_AMOUNT 
    FROM ORDER_PAYMENT
    GROUP BY ID
) Payment
ON Orders.ID=Payment.ID
WHERE PAY_AMOUNT<Detail.AMOUNT
ORDER BY ID

SPL solution:

When the subtables are grouped by ID, the field will become the real primary key. Now we can treat them as tables with the one-to-one relationship in between (or the homo-dimension tables).

A
1 =T("Orders.txt")
2 =T("OrderDetail.txt")
3 =T("OrderPayment.txt")
4 =A2.groups(ID; sum(PRICE*COUNT):AMOUNT)
5 =A3.groups(ID; sum(AMOUNT):PAY_AMOUNT)
6 =join(A1:Orders,ID; A4:Detail,ID; A5:Payment,ID)
7 =A6.new(Orders.ID, Detail.AMOUNT, Payment.PAY_AMOUNT)
8 =A7.select(PAY_AMOUNT<AMOUNT)

A1: Import Orders table.

A2: Import OrderDetail table.

A3: Import OrderPayment table.

A4: Group OrderDetail table and calculate the total amount in each order.

A5: Group OrderPayment table and calculate the total payment amount in each order.

A6: The join() function joins Orders table and the grouped & summarized OrderDetail table and OrderPayment table through their ID fields.

A7: Create a new table sequence consisting of fields of order ID, order amount and order payment amount.

A8: Select records where the payment amount is less than the order amount, that is, those that have not received all payment.

Without the support of stepwise coding, SQL will write the whole procedure in a single statement, which is sure to be complicated. SPL displays clear logic by using the step-by-step coding mode. There are two steps for performing a join operation. The first is to group every subtable by one key field (which is the primary table’s primary key) to make them have the same actual primary key as the primary table. The second is to join the two or more tables through their primary keys (or the actual primary keys). In essence, SQL and SPL implement the join operations in the same way, but SPL’s stepwise design makes easy coding.

Clone this wiki locally