Can SPL and SQL be integrated?

 

SQL and SPL are both general-purpose processing technologies for structured data, and each has its own characteristics. Specifically, SQL is highly popularized and widely used, many users have a natural ability to query data with SQL, and it is easy for them to get started once the data engine supports SQL; it is relatively easy to migrate historical programs. SPL is concise and efficient, providing more agile syntax that can simplify complex calculations, while supporting the procedural computing and naturally supporting step-wise coding; the computing system of SPL is more open, making it possible to perform mixed computing for multiple data sources at the same time, and easily obtain higher computing performance with built-in high-performance storage and high-performance algorithms; it is more flexible to utilize, enabling it to be used independently or integrated into applications.

So, here comes a question: can we integrate SQL and SPL into a data computing system so as to select a right technology based on different scenarios to give play to their respective advantages?

To answer this question, we need to consider from several aspects.

First, see if the SQL-supported database supports SPL.

In principle, all RDBs can act as the data source of SPL, and SPL accesses the data through database-provided drivers. When reading the data, SQL is still used, yet subsequent calculations are to be performed in SPL. Although this is different from calculating in the database directly in SQL, for applications, the only difference is the location where the calculation occurs, and there isn’t significant difference in terms of application request and data use. Therefore, from this point of view, they can be integrated and work well.

However, we also need to answer why we add SPL based on SQL? To be sure, we want to solve some problems.

If it is a functional requirement, it is no problem, because some calculations are difficult or even impossible to code in SQL, while coding in SPL can simplify them. For example, if we want to calculate the maximum consecutive days that a stock keeps rising, coding in SQL is tortuous:

select max (consecutive_day)
from (select count(*) (consecutive_day
      from (select sum(rise_mark) over(order by trade_date) days_no_gain
            from (select trade_date,
                        case when closing_price>lag(closing_price) over(order by trade_date)
                             then 0 else 1 END rise_mark
                  from stock_price ) )
      group by days_no_gain)

As you can see that such statements are difficult to code, understand and debug, and are often mixed in multi-step complex process in practice, making development and maintenance more difficult. Yet, retrieving data first with SQL and then calculating with SPL is much simpler:

db.query("select closing_price from stock_price order by trade_date").group@i(closing_price<closing_price[-1]).max(~.len())

In addition, it is easy to debug, greatly reducing the development and maintenance costs.

With the help of SPL's characteristics like agile syntax and procedural calculation, it can often significantly simplify the implementation difficulty of algorithms. Therefore, SPL can be a useful supplement to SQL. Moreover, SPL provides JDBC driver and can be fully encapsulated, just like using a database.

undefined

However, if you want to solve the performance problem, it is not that simple. For the existing systems that are already functionally stable, most of the problems they are facing may be the performance, and the purpose of adding SPL is to speed up. Although SPL has rich high-performance algorithm libraries, can we make use of them here to improve performance?

Unfortunately, the probability is low! It is difficult to employ SPL on existing databases (SQL system) to improve performance, the reason is the data storage, i.e., the form of data organization.

We know that high-performance algorithms are closely related to data storage, and SPL is not an exception. To give full play to the effectiveness of SPL high-performance algorithms, you need to organize the data in a certain form. For example, SPL’s one-sided partitioning algorithm requires storing the data orderly; the ordered merge algorithm requires the data to be in order by associated key; the foreign key sequence-numberization requires converting the foreign key value to the sequence number of dimension table in advance, etc. To implement high-performance algorithms, storage is the foundation. Yet, the storage of database is transparent to the outside world and cannot be interfered with, let alone adjust it according to computing characteristics, as a result, for the database-based storage, it is usually impossible to employ SPL algorithms to implement high performance.

In general, the integration of SQL and SPL on the existing database system should depend on the actual needs. Although SPL can work based on SQL database, high performance cannot be obtained.

In this case, can we change ourthinking to let SPL support SQL? In other words, does providing two query syntaxes in one system work?

Currently SPL does provide certain SQL support, called simple SQL. In SPL, you can use simple SQL to query many data sources like text, NoSQL, WebService, and it supports filtering, association, sub-query and other operations (equivalent to the SQL92 standard), and it is convenient for those who are familiar with SQL. For example, you can embed the following code in SPL:

select e.Dept,sum(o.Amount) from d:/Orders.csv o
left join d:/Employees.xlsx e on o.SellerId=e.Eid
group by e.Dept

to implement operations on the file.

Unfortunately, this method still cannot guarantee high performance and only supports relatively simple scenarios, for example, it only works for the query of small dataset. There are two reasons. One is that the SPL team is not a professional SQL implementer, so it is not easy to make a very perfect and efficient SQL; the other is that SQL itself has some limitations, resulting in a difficult in making further breakthrough technically, and hence there is not much willingness to continue to improve SQL.

If that’s the case, then does it work to automatically convert SQL to SPL? The application/user is responsible for generating SQL and transferring it to SPL, and SPL is responsible for translating the received SQL into SPL native syntax before execution.

Theoretically, SPL is a superset of SQL, and any calculation implemented in SQL can indeed be done in SPL, and it is possible to interpret or migrate SQL. Although there is a lot of work to do, it is not very difficult. However, this method can only obtain syntax compatibility rather than high performance. SQL itself has limitations in implementing high-performance algorithms, and because SQL lacks some key data types (such as record type) and basic operations (like ordered computing), many high-performance algorithms cannot be described, and as a result, it has to implement such algorithms in a tortuous way, resulting in a very low performance, and a worse performance after translation, even it cannot handle big data at all. To get an efficient code, it needs to write according to the characteristics of operation model, yet SQL statements usually do not contain such information. For example, the following SQL code is to calculate TopN:

SELECT TOP 10 x FROM T ORDER BY x DESC

The execution logic expressed in the SQL statement is to do big sorting for all data, and then take the TOP 10. The sorting of big data is very slow, involving multiple times of swaps between memory and external storage. If this statement was executed according to its superficial meaning, the efficiency would be very low. Fortunately, traditional database is usually provided with strong optimization engine that can guess what the SQL code actually means and perform an effective optimization instead of executing according to its superficial meaning. For example, for the TopN operation, the optimization engine of database will adopt a more efficient execution logic to avoid big sorting, and this logic is also easy to implement in SPL syntax. But, if the execution logic of translating the SQL statement as-is is adopted, big sorting is unavoidable, and efficient algorithm cannot be employed, and as a result, a high performance cannot be ensured.

Relational database has been around for decades, and the optimization of database never stops. To get the said abilities, it requires extensive experience and long-term continuous commitment. In contrast, SPL team is far inferior to traditional database vendors in this regard, and basically unable to achieve this result. If it needs to translate, it usually has to hard copy SQL’s execution logic, resulting in a very low performance. To perform this operation task, it can only be done by traditional database vendors with their own experience after SPL is more popularized.

So, in practice, how should SQL and SPL share the work?

At present, the use of SQL in OLTP business is still relatively necessary (the OLTP function of SPL is under development). To analyze with SPL, you need to first synchronize the business data (usually cold data) to SPL storage, this allows you to get high performance. In addition, you can achieve real-time query for full data by virtue of T+0 query ability of SPL (visit: Create Easy and Efficient T+0 Queries with Open-source SPLfor details), and you can also work with OLTP database to implement HTAP requirements (for more information, visit: HTAP database cannot handle HTAP requirements), in this way, you can enjoy the convenience of both SQL and SPL at the same time.

For new OLAP system, you can directly utilize SPL's own storage and computing ability without involving SQL, such as the construction of data warehouse, data lake. In this way, you can enjoy not only the high performance of SPL, but a simpler SPL code than SQL.

The current Lakehouse is like a false proposition

When you want to improve the performance of existing OLAP system through SPL, you need to do according to actual situation. Usually, you need to first determine where the problem is, and then move the data involved in the problem to SPL, and finally make use of the high-performance computing power of SPL to speed up.

As the application progresses, you can gradually migrate things originally implemented in SQL to SPL to optimize the performance while improving the application framework. For more information, visit:

Open-source SPL Eliminates Tens of Thousands of Intermediate Tables from Databases

Goodbye, Stored Procedures - the Love/Hate Thing

Why ETL Becomes ELT or Even LET?

In summary, compatibility with SQL to a certain extent (relatively simple scenarios) is helpful for users, but if you want to further give play to the effectiveness of SPL to solve problems that SQL cannot solve, you need to employ SPL’s native syntax based on SPL storage. A further compatibility with SQL is neither practical nor necessary.