-
Notifications
You must be signed in to change notification settings - Fork 339
Data warehouse without using SQL
Currently the vast majority of data warehouses employ SQL to process data. Following decades of development, SQL has become the standard language in the database world, and amassed a large user population, so it is normal to support SQL for data warehouses. However, in the context of contemporary big data, as the business complexity keeps increasing, the abilities of SQL seem to be increasingly inadequate in the data warehouse scenario where computation is the primary task. A typical manifestation is that some data warehouses begin to integrate non-SQL languages like Python. Not to mention whether the two languages with very different styles can be well integrated and complementary, just judging from this trend, it is clear that the industry is doubtful of the abilities of SQL.
In this article, we will present a non-SQL-based data warehouse “esProc”. Since esProc does not use SQL as query language (SPL instead), we can regard it as a new type of data warehouse temporarily.
Why doesn't esProc use SQL?
To answer this question, we need to ascertain the reason why the data warehouses still introduce Python even when SQL is already available, and what problems does it want to solve?
We know that SQL does not provide good support for procedural computation. Even with CTE syntax, it is still very complicated to describe complex calculations in SQL, and often requires nesting multiple layers of code and associating repeatedly. In addition, the dataset of SQL is unordered, and SQL is very bad at ordered calculation. When handling order-related operations, coding in SQL is often cumbersome or even impossible to implement. The language characteristic of SQL itself determines that it is not good at implementing certain complex calculations., yet such calculations are common in the data analysis scenarios of data warehouse. For example, it is difficult to code in SQL when performing a funnel analysis for an e-commerce company (calculating the user churn rate of every step such as page browsing, adding to cart, placing order, and paying). In contrast, for the scenarios involving multiple sequential steps and repeated use of result, it is much easier to implement in a language (like Python) that supports stepwise and ordered calculations.
In other words, SQL lacks adequate abilities.
However, introducing third-party languages like Python to make up for the lack of abilities will make the technology stack complex. Even if we don't consider whether Python can make up for the lack, and nor do we consider whether the two languages can be well incorporated, the rise in system complexity brought about by multiple different-style technologies alone will inevitably lead to high development and O&M costs.
In addition to the lack of abilities, the SQL-based relational database has the problem of closedness.
Since the main function of database is transaction processing (TP), it requires many constraints to ensure data consistency, etc. For example, only the data that meet criteria can be loaded into database; only the data inside the database can be processed, which is what we called the closedness. The data warehouse is developed based on database, and inherits the closedness of database.
Closeness is very important for TP business but, it is meaningless and even very disadvantageous for AP business that mainly focuses on analysis and calculation. The closedness requires that the data can only be used after being loaded into the database, this will result in the inability to combine and calculate the data from multiple databases at will, thus greatly limiting the application scenarios of data warehouse.
In addition, the data sources of modern data application are diverse. In addition to different databases, we often face a variety of data sources and data types. Since the closed SQL-based database cannot compute the data outside the database, it has to import the data before computing, resulting in the addition of an ETL action. This action not only increases the workload of programmers and the burden of database, but losses the real-timeness of data. Usually, the data outside the database have irregular formats, and it is not easy to load them into databases having strong constraints and, even ETL action is performed, it first needs to load the raw data into database in order to utilize database’s computing ability. As a result, ETL is changed to ELT, which increases the burden of database.
Moreover, the closedness makes it inconvenient for users to freely utilize the method of trading space for time. We know that the storage resource is much cheaper than the computing resource. If we redundantly store data in multiple ways for different computing objectives, we may obtain better query experience. However, SQL needs to store the data to tables. Creating too many tables will make the metadata bigger, resulting in a significant increase in O&M costs and, too many tables will also bring capacity and performance issues to the data warehouse, facing the scaling pressure. Many large organizations store thousands of intermediate tables in their central data warehouses. These tables have been accumulated for years, but have to be retained out of an abundance of caution, thus causing immense burden on the capacity, performance, operation and maintenance of the database.
SQL doesn’t do a good job either in performance.
As we know, the execution efficiency of SQL depends on the optimization degree of database optimization engine, and a good database will choose more efficient execution path according to the computing objective of SQL (rather than its literally expressed logic). However, such auto-optimization mechanism works only for simple calculations. Once the calculation becomes slightly more complex, the engine will not work, and the code has to be executed according its literally expressed logic, resulting in a sharp decline in performance. For example, for the funnel analysis task mentioned above, someone wrote a three-step funnel calculation code in SQL and executed in the database, yet the speed was too slow to be feasible. We believe that you must have often encountered cases where SQL performs poorly in actual business and, it is common to see that it takes a couple of hours to run SQL codes in many batch job scenarios, all of which are caused by the low performance of SQL.
The lack of abilities, closedness (results in heavy in use) and low performance are the main problems the SQL-based data warehouse is facing.
Introducing Python based on SQL cannot solve problems either. In addition to high use and O&M costs caused by complex technology stack mentioned above, Python also cannot achieve high performance.
Python itself doesn’t provide good support for big data computation, and does not offer corresponding external storage computation types (such as cursor) for the calculation of data exceeding memory capacity, making it exceptionally complicated to process big data. Moreover, Python does not support true multi-thread parallel processing. The parallel processing of Python is fake, which is actually the serial processing for CPU, or even slower than serial processing, making it difficult to leverage the advantages of modern multi-core CPU.
More importantly, Python needs to calculate based on SQL database tables, but these tables (storage) cannot be intervened by the outside world as they are private to the database. Many high-performance algorithms, however, need to organize the data based on computing objective. For example, the efficient ordered merge algorithm can be utilized if the data are sorted by associated fields. Unfortunately, the high-performance algorithms cannot be utilized due to the failure to intervening the storage, and it naturally fails to ensure the performance. In addition, when Python reads the data of database, it will involve IO costs, which will also lead to low computing performance.
It seems that to solve the problems of SQL, we have to abandon SQL.
It is a fact that non-SQL computing technologies have always been present, and the typical representative is Spark. When Spark was born, Scala was used as the programming language and, relying on Spark’s large-scale distributed computing ability, there was a great tendency to replace SQL. Unfortunately, however, as Spark was used more deeply, we realized that Spark does not have the ability to replace SQL (because the implementation process is too cumbersome, and the performance is low). In addition, due to the difficulty to use Scala, the programmers have to resort to SQL again.
Next let's take a closer look at the abilities of esProc to see what the differences are.
The formal language of esProc-based data warehouse is SPL, not the SQL commonly used in the industry. The reason for abandoning SQL is that SQL has many problems, such as the lack of abilities, closedness, and low performance, while SPL can effectively solve these problems. Here below are some advantages of SPL.
First, SPL naturally supports procedural computation.
Procedural computation can effectively reduce the implementation difficulty of complex business. For the same code of 100 lines, the complexities of writing it as 100 statements and one statement are completely different. Although CTE syntax and stored procedure make SQL have the procedural computing ability to a certain extent, it is far from enough. In contrast, SPL naturally supports procedural computation, allowing us to divide complex calculation into multiple steps to reduce the implementation difficulty.
Second, SPL provides richer data types and algorithms.
Compared to SQL that doesn’t offer explicit record data type (SQL will treat a single record as a temporary table with only one record, i.e., a single-member set), SPL provides a specialized structured data object: table sequence and provides rich computing libraries based on the table sequence, thereby making SPL have complete and simple structured data processing ability.
Here below are part of common calculation codes written in SPL:
Orders.sort(Amount) // sort
Orders.select(Amount*Quantity>3000 && like(Client,"*S*")) // filter
Orders.groups(Client; sum(Amount)) // group
Orders.id(Client) // distinct
join(Orders:o,SellerId ; Employees:e,EId) // join
By means of the procedural computation and table sequence, SPL can implement more calculations. For example, SPL supports ordered operation more directly and thoroughly. For the grouping operation, SPL can retain the grouped subset, i.e., the set of sets, which makes it convenient to perform further operation on the grouped result. In contrast, SQL does not provide explicit set data type, and cannot return the data types such as set of sets. Since SQL cannot implement independent grouping, grouping and aggregating have to be bound as a whole.
In addition, SPL has a new understanding on aggregation operation. In addition to common single value like SUM, COUNT, MAX and MIN, the aggregation result can be a set. For example, SPL regards the common TOPN as an aggregation calculation like SUM and COUNT, which can be performed either on a whole set or grouped subsets.
In fact, SPL has many other features, making it more complete than SQL and richer than Python. For example, the discreteness allows the records that make up a data table to exist dissociatively and be computed repeatedly; the universal set supports the set composed of any data, and allows such set to participate in computation; the join operation distinguishes three different types of joins, allowing us to choose an appropriate one according to actual situation...
With complete computing capabilities, not only is it simple to code, but it also eliminates the need to resort to other computing capabilities. Therefore, the technology stack is simple, and all problems can be solved within one system.
Unlike the SQL-based database that requires loading data into database before calculation (closedness), SPL can directly calculate when facing diverse data sources, and hence it has good openness.
SPL does not have the concept of “warehouse” of traditional data warehouses, nor does it have the concept of metadata, let alone constraints. Any accessible data source can be regarded as the data of esProc and can be calculated directly. Importing the data into database is not required before calculation, and exporting the data out of database deliberately is also not required after calculation, as the result can be written to target data source through its interface.
SPL encapsulates access interfaces for common data sources such as various relational databases (JDBC data source), MongoDB, HBase, HDFS, HTTP/Restful, SalesForces and SAP BW. Logically, these data sources have basically the same status, and can be calculated separately or in a mixed way after being accessed, and the only difference is that different data sources have different access interfaces, and different interfaces have different performance.
In terms of data storage, SPL differs greatly from traditional data warehouses.
SPL has no metadata. The data of SPL is directly stored in files, and any type of open-format file can be processed. In order to ensure computing performance, SPL also designs a specialized binary file format.
Currently, SPL provides two file types: bin file and composite table. The bin file adopts the compression technology (faster reading due to less space occupation), and stores the data types (faster reading as a result of avoiding parsing data type). Since the bin file supports the double increment segmentation mechanism that can append data, it is easy to implement parallel computing by utilizing the segmentation strategy, and the computing performance is ensured. The composite table supports the columnar storage, which has great advantage when the number of columns (fields) involved in calculation is small. In addition, the composite table not only supports the double increment segmentation mechanism, but adopts the index technology, allowing us to utilize the advantage of columnar storage, and improve the performance more easily through parallel computing.
Since the binding of storage and computation is eliminated, it is easy to implement the separation of storage and computation and then implement elastic computing, and it also makes the cloud computing easier.
In addition, the cost of file storage is lower. In AP computing scenario, users can flexibly design a space-time trade-off scheme, which is nothing but storing a few more files. Even if the number of redundant files reaches up to ten thousand (it is easy for contemporary file systems to handle the data of such a scale), there isn’t any burden. Also, it is simple to manage data files by category under the file system’s tree structure. And the O&M costs are lower.
For more information, visit:Data warehouse running on file system
Based on flexible file storage, we can design the data organization form (storage) flexibly according to computing objective to achieve high performance. In addition to high-performance storage, SPL provides many high-performance computing mechanisms and algorithms for big data.
In order to cope with the big data computing scenario where the amount of data exceeds memory capacity, SPL offers cursor computing method.
=file("orders.txt").cursor@t(area,amount).groups(area;sum(amount):amount)
Moreover, SPL provides parallel computing support for both in-memory and external storage calculations. By adding just one @m option, parallel computing can be implemented and the advantages of multi-core CPU can be fully utilized, which is very convenient.
=file("orders.txt").cursor@tm(area,amount;4).groups(area;sum(amount):amount)
In addition to cursor computing and the parallel computing, SPL offers many built-in high-performance algorithms. For example, after SPL treats the TOPN as an ordinary aggregation operation, sorting action is avoided in the corresponding statement, so the execution is more efficient.
Similarly, SPL provides many such high-performance algorithms, including:
- In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning...
- External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval...
- Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping...
- Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning...
- Merge and join: ordered merging, merge by segment, association positioning, attached table...
- Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension...
- Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing...
With the support of high-performance file storage and algorithms, esProc often achieves a performance improvement of several times to dozens of times, and even thousands of times in some cases, compared to traditional SQL-based data warehouses in practice.
Can we draw a conclusion that SPL have no disadvantages compared to SQL?
Of course not, because nothing is perfect in this world.
After decades of development, many SQL-based databases have owned powerful optimization engine. For simple operations that are suited to be implemented in SQL, the optimization engine can optimize the slow statements written by ordinary programmers and achieve better performance. In this sense, the requirements for programmers are relatively low. Certain scenarios, such as multidimensional analysis, have been optimized for years, some SQL engines can also handle them very well and obtain extreme performance.
In contrast, SPL did little in automatic optimization., and depends almost entirely on programmers to write low-complexity code to achieve high performance. In this case, programmers need receive some training to familiarize themselves with SPL’s philosophy and library functions before getting started with SPL. Besides, SPL is written in Java, which brings some benefits such as good compatibility, strong migration, and easy to adapt to the cloud environment. However, the CPU and memory resources cannot be fully utilized due to JVM limitations. For some simple scenarios, SPL is still not as good as a fully optimized SQL engine in terms of performance.
In conclusion, SQL is not the only option for data warehouse, we have a better alternative - SPL.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code