-
Notifications
You must be signed in to change notification settings - Fork 340
The difficulty of SQL stems from relational algebra
In the field of structured data computing, SQL is still the most widely used working language, not only adopted by all relational databases, but also targeted by many new big data platforms.
For a certain computing technology, people usually care about two efficiencies. One is the descriptive efficiency of operations, and the other is the execution efficiency of operations. This is easy to understand. If the descriptive efficiency is too low, it means that the development cost is too high and it is difficult to write programs for calculation; If the execution efficiency is low, it takes a long time to obtain results, which greatly reduces its practical value. Simply put, it means writing simple and running fast.
However, we have previously analyzed some computational scenarios for structured data, and SQL actually does not perform well in both simple writing and fast running. When the situation is slightly more complex, it is difficult to handle, often resulting in thousands of lines of nested N layers of code and tens of gigabytes of computation requiring several hours to run. Two typical examples are calculating the number of days a stock has been continuously rising and the TopN for the entire set and within groups of a large set. The details will not be repeated here, and those interested can refer to previous posts.
How many days does a stock continuously rise for the longest time? Originally a simple task, due to the lack of ordered computing ability, it needs to be written in a multi-layered nested form, which is difficult to write and understand:
select max(ContinuousDays) from (
select count(*) ContinuousDays from (
select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (
select TradeDate,case when Price>lag(price) over ( order by TradeDate) then 0 else 1 end UpDownTag from Stock ))
group by NoRisingDays )
To get the top 10 out of 100 million pieces of data, and to get the top 10 of each group after grouping, there is a significant difference in writing, and they all have ORDER BY in the statement, which means that high complexity large sorting has to be performed, and can only rely on the database optimizer to find low complexity algorithms.
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders )
WHERE rn<=10
We have also analyzed the reasons. SQL lacks discreteness, resulting in incomplete set orientation, difficult ordered calculations, and difficulty in describing high-performance algorithms....
However, there is a deeper reason behind this, as the fundamental difficulty of SQL actually stems from its theoretical foundation, namely relational algebra.
To explain this view, we need to analyze what using programs to implement calculations is actually doing.
Essentially, the process of writing a program is the process of translating problem-solving ideas to a precise formal language that can be executed by a computer. For example, just like elementary school students solving practical problems, after analyzing the problem and coming up with a solution, they also need to list four arithmetic expressions. The same goes for calculating with a program. Not only do we need to come up with a solution to the problem, but we also need to translate the solution to actions that the computer can understand and execute in order to implement the calculation.
For the formal language used to describe computational methods, its core lies in the algebraic system used. We cannot strictly define the concept of algebraic system here, we will only explain it in layman's terms. To solve a certain computational problem, people define some data types and a set of operational rules for these data types, ensuring the closeness and consistency of these operations, which can be called an algebraic system. For example, the familiar rational numbers and their four arithmetic operations are an algebraic system used to solve the numerical calculation needs in daily life. Closeness refers to the requirement that the calculation result must still be a defined data object, such as the result of the four arithmetic operations on rational numbers being still a rational number. And consistency refers to the fact that these operations cannot produce contradictory results. For example, we need to agree that they cannot be divided by 0, otherwise dividing a certain number by 0 into any number will lead to logical contradictions.
If the design of this algebraic system is not carefully considered, and the provided data types and operations are inconvenient, it will make it very difficult to describe the algorithm. At this point, a strange phenomenon occurs: the difficulty of translating the solution to the code far exceeds that of solving the problem itself.
For example, we have learned to use Arabic numerals for daily calculations since childhood, and implementing addition, subtraction, multiplication, and division is very convenient. Everyone naturally believes that numerical operations should be like this. In fact, not necessarily! I think most people know about something called Roman numerals. I'm not sure if the Roman numeral system also has familiar addition, subtraction, multiplication, and division operations (which cannot be easily implemented like Arabic numerals, and the definition of operations may be different). I've also been confused about how ancient Romans went shopping?
Then, we know that whether a program can be written simply is actually an algebraic problem behind programming languages.
And as we have mentioned before, running fast is essentially the same thing as writing simple, that is, it can make high-performance algorithms easy to write. In this way, running fast is still an algebraic problem.
Let's make another analogy:
Most students who have attended elementary school in China know the story of Gauss calculating 1+2+3+...+100. Ordinary people just add 100 times step by step. The Gauss child is very smart and has found that 1+100=101, 2+99=101, …, 50+51=101, and the result is 50 multiplied by 101, he quickly finished calculating and went home for lunch.
After hearing this story, we all feel that Gauss is very clever and can come up with such a clever method, which is simple and fast. This is not wrong, but it is easy to overlook one point: in the era of Gauss, multiplication already existed in the human arithmetic system (also an algebra)! As mentioned earlier, when we learn the four arithmetic operations from a young age, we may take multiplication for granted, but in fact, multiplication was invented later than addition. Although multiplication is defined by addition, its characteristics can be utilized to calculate using a 9x9 table, which greatly improves efficiency. If there were no multiplication in Gauss's era, even with clever Gauss, it is impossible to quickly solve this problem.
The mathematical foundation of SQL is relational algebra, which is an algebraic system used to perform batch structured data calculations. This is also why databases that use SQL are also called relational databases.
Relational algebra has been invented for fifty years, and the application requirements and hardware environment fifty years ago are vastly different from today. Due to the large number of existing users and the lack of mature new technologies, SQL based on relational algebra remains the most important database development language today. Although there have been some improvements in the past few decades, the foundation has not changed. Faced with contemporary complex requirements and hardware environments, relational databases are not as adept.
Relational algebra is too simple and lacks sufficient data types and operations. Therefore, when using SQL to describe the solution of a problem, it is necessary to find convoluted ways to implement. For example, the problem of stock price increases, because relational algebra applies the theory of unordered sets in mathematics and does not create the concept of order for SQL, turns a simple problem into a difficult problem, even if it takes a detour, it is difficult to write. This leads to the phenomenon that the difficulty of solving translation problems is greater than that of solving the problem itself, as mentioned earlier. The top 10 problem is also the same case, the aggregation operation designed for relational algebra does not include TOPN and it does not have a set data type. Therefore, this operation cannot be designed as an aggregation operation, and can only be described as a large sort.
Relational algebra is like an arithmetic system that only involves addition but has not yet invented multiplication, and many things are inevitable if not done well.
When the calculation is simple or the performance requirements are not high, using SQL is still relatively convenient, after all, there are many masters and related software is also very rich. But the data requirements in modern applications are becoming increasingly complex, and the amount of data is also increasing. Continuing to use SQL will seriously affect work efficiency. Moreover, unfortunately, this problem is theoretical and no matter how optimization in engineering is done, it is of no use and can only be improved to a limited extent, not eradicated. However, the vast majority of database developers would not have thought of this layer, or rather, to cater to the compatibility of existing users, they did not intend to think of this layer. So, the mainstream database industry has been circling in this circle.
Then what should we do? How to make calculations writing simpler and running faster?
Invent new algebra! Algebra with ‘multiplication’. This is the difference of esProc SPL. We gave the algebraic foundation of SPL a mathematical name: discrete datasets. SPL is the formal language of this algebra. The advantages of SPL have been discussed multiple times in the previous articles. With the support of new algebra, we are truly able to write simple yet run fast.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code