-
Notifications
You must be signed in to change notification settings - Fork 339
Foreign key pre join(JOIN Simplification and Acceleration Series 6)
Let’s move on to look at how to increase JOIN query performance by making use of certain features of JOINs. As this involves lots of details, we’ll just list some easy-to-understand cases. You can find the complete illustrations in Performance Optimization e-book and related articles in https://c.scudata.com.
Let’s begin from the in-memory foreign-key-based join operations: Here are two tables:
customer table
id
name
city
…
orders table
seq
date
custkey
amount
…
In which the cid field in the orders table is the foreign key pointing to id field, the customer table’s primary key.
To find out the total orders amount in each city (to make the discussion simpler, we won’t specify any condition), we can express the query in SQL as follows:
SELECT customer.city, SUM(orders.amount)
FROM orders
JOIN customer ON orders.cid=customer.id
GROUP BY customer.city
Generally, the database uses hash join algorithm to calculate hash values of the associative keys respectively in the two tables and match the values.
We can write the algorithm using the simplified JOIN syntax (DQL) introduced in previous section:
SELECT cid.city, SUM(amount)
FROM orders
GROUP BY cid.city
This implies that a better optimization plan is possible. Let’s look at how to do it.
Suppose all data can be loaded into the memory, we can optimize the query by switching foreign key values into addresses.
By switching values of the foreign key field, cid, in the fact table orders to addresses of corresponding records in the dimension table customer – that is, converting the cid values to records of the customer table – we can directly reference a field of the customer table for calculation.
As it’s inconvenient to describe the detailed computing process in SQL, we demonstrate it in SPL by using a file as the data source:
A | |
---|---|
1 | =file("customer.btx").import@b() |
2 | >A1.keys@i(id) |
3 | =file("orders.btx").import@b() |
4 | >A3.switch(cid,A1) |
5 | =A3.groups(cid.city;sum(amount)) |
In the above SPL script, A1 reads data from customer table; A2 sets the primary key for customer table and creates index on the key.
A3 reads orders table; and A4 replaces values of A3’s foreign key field cid with corresponding records in A1. After the script is executed, orders table’s cid field will be converted into the corresponding record of customer table. A2 creates an index to facilitate the subsequent switch. Since the fact table is generally far larger than the dimension table, the index can be reused.
A5 performs grouping and aggregation. Since each cid field value is now a record, we can directly reference a field using the operator during the traversal of orders table. cid.city can be correctly executed now.
After the switch action defined in A4 is finished, the cid filed of A3’s in-memory fact table already has values that are addresses of corresponding records in A1’s dimension table. The process is called foreign key address-alization. Now we can directly retrieve a field of the dimension table without looking it up in A1 by matching the foreign key value. It takes just some time of constant magnitude to get a field from the dimension table by avoiding hash value calculation and matching.
Usually, we still use the hash method to create the primary key index in A2 by calculating the key’s hash values, and to calculate cid’s hash values to match A2’s hash index table at A4’s switch action. In terms of computational amount, the foreign key address-alization strategy and the conventional hash partitioning method are almost neck and neck in handling one single JOIN.
If data can be wholly loaded into the memory, the advantage of foreign key address-alization strategy is that the addresses can be reused. This means that we just need to do the calculation and comparison of hash values once (A1-A4). Later the results are ready to be used by a join operation on the same two associated fields, significantly boosting performance.
The strategy can be devised because the foreign key has unique corresponding records in the dimension table – that is, one foreign key value matches only one record in the dimension table, so we can convert each cid into its corresponding record in A1. With SQL JOIN definition, we cannot assume that each foreign key points to a unique record in the dimension table and thus the strategy is unimplementable. Moreover, SQL does not have record address data type and each join needs hash value calculation and matching.
If there are multiple foreign keys in a fact table that refer to multiple dimension tables, the conventional hash join strategy can only handle one JOIN at a time. For a number of JOINs, each has to perform all the operations and needs to store its intermediate result for use in the next JOIN. The computing process is intricate, and data will be traversed many times. The foreign key address-alization strategy in that case, however, just traverses the fact table once without generating intermediate results, creating a clear computing process.
There is another point. Memory is parallelism-friendly, but the hash join algorithm isn’t easily paralleled. Even if we can segment tables to calculate hash values in parallel, the strength of the parallel processing will be much offset by the resource conflict that occurs when records with same hash values in two corresponding segments are being located and gathered for comparison. Under the foreign-key-based join model, two tables being joined are not equal by defining one as the fact table and the other as the dimension table, and we just need to segment the fact table to perform parallel processing.
By reforming it with the foreign key address-alization concept, the hash join algorithm’s ability of handling multi-foreign-key joins at one time and performing parallel processing can be improved in some degree. Some database products can do this kind of engineering optimization, but usually the optimization is valid only when two tables are involved. When more than two tables are JOINed and when multiple JOIN types are involved, it isn’t easy for the database to distinguish which table should be identified as the fact table to be traversed and processed in parallel and which tables should be used as the dimension tables on which the hash indexes should be built, making the optimization effect unguaranteed. So sharp decline in performance often occurs when the number of tables being JOINed increases (often when 4 or 5 tables are being JOINed but no significant increase in the total data amount). By introducing the foreign key concept to the JOIN model to specifically handle multi-foreign-key joins, the fact table and dimension tables can always be clearly identified. The increase of the number of tables to be JOINed will only lead to a linear decline in performance.
At present in-memory database technologies are hot. But our analysis above shows that it’s not easy for the SQL-based in-memory databases to handle JOIN operations fast.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code