What can we do while facing the huge difficulty of writing too long, multiple-level nested SQL codes?

 

It is common for us to write SQL code to process data in our daily work. Whether the data processing in an application or the ad hoc query analysis, both can be done with SQL. Compared with other technologies (high-level languages such as Java), SQL is simpler. However, the simplicity of SQL is limited to dealing with simple requirements. For some complex computing scenarios, it is very difficult to write in SQL, which requires nesting multiple levels, up to hundreds and thousands of lines. When it comes to the length of SQL code, it is usually not measured by lines but by KBs. Such situation is not uncommon, and we believe that those who often write SQL code are no strangers to this situation.

Why does this happen? To find out why, visit: for details. Although SQL is simpler than other data processing technologies, some shortcomings still exist (such as, lack of support from ordered operation, incomplete set orientation, no object reference mechanism of high-level languages), which make it tedious to express complex calculation in SQL. Such complex and long SQL code will cause many problems.

Harms of long SQL code?

The first trouble that the complex and long SQL code brings us is that it is difficult to write.

In fact, the long code itself is not a big problem. For many tasks, long code is just caused by many steps, but is not difficult to write. The long SQL code, however, is often accompanied by difficulty, and will become extremely difficult with the increase of SQL code length. Using SQL to solve a task is very tortuous (the difficulty reflected in the way of thinking), for the reason that you cannot implement the algorithm according to normal way of thinking. For example, you have already come up with a good solution, and can solve the task in just a few simple steps, but if you use SQL to solve the same task, you have to do it in a tortuous way. It's like if you didn't nest several levels and write dozens or hundreds of lines, your professional level couldn’t be reflected. Moreover, in addition to nesting the sub-queries, the complex SQL code is accompanied by multi-table association and various filtering conditions. Therefore, you must keep a clear mind when writing SQL code, because it is difficult to locate once you write it wrong, sometimes it may crash the database. The reason for the difficulty is, of course, related to the fact that SQL does not advocate the smoothness of the coding process (although CTE syntax provides a certain degree of support). For a task that requires writing 100 lines of code, the complexities of dividing the code into 100 statements and just 1 statement are completely different.

In addition, code writing is inseparable from debugging, but for the debugging function of SQL, it can be said that it is extremely inconvenient to use. Since all major databases do not provide acceptable debugging functions until now, the development and debugging environments of SQL is simply unbearable compared with those of other high-level languages. If it is found that a long SQL statement nested in multiple levels is incorrectly written, the only thing we can do is to extract the subqueries from long statement level by level to debug and locate separately, which is time-consuming and laborious, resulting in a further increase of the difficulty of writing SQL.

The difficulty in writing means long development cycle. The purpose of writing SQL code is to serve the front-end business, if a computing requirement takes three to five days or even a week to code and debug, it probably loses the timeliness of the business. In this case, we have to bear the angry reproach from customers and could not complain.

Besides the difficulty in writing, complex SQL is also hard to maintain. The stability of data processing business is often very poor, it often needs to modify SQL code. However, it is not an easy thing to modify. Even the author himself will encounter the embarrassing situation that he could not understand the SQL code after a period of time, not to mention a novice who takes over the modifying task. Modifying a SQL code takes about the same time as rewriting, so it makes no sense.

Furthermore, complex SQL code also affects the migration of database. Although the migration of database doesn't occur fairly often, it would be fatal if it occurred. In recent years, with the rapid development of business and the rise of open-source databases, the replacement of database has occurred from time to time, this will cause SQL migration. We know that each kind of database has its own dialect, and the function used in one database may not work in another and, different databases have different levels of support for the SQL standard (for example, the window function of Oracle supports it very well, while the support of other databases is much worse). Once complex SQL code involves these"special" contents, it will be difficult to achieve database migration.

Anyway, complex SQL is the nightmare for data developers.

Solution

The problem of difficulty in using SQL has been going on for a long time, and everyone has been actively looking for a solution. In fact, many development methods no longer recommend writing complex SQL code. For example, the ORM technology can be regarded as replacing half of the functions of SQL, and more than that, the frameworks such as microservice only require SQL to perform basic data reading and writing, instead of implementing complex calculations and business processing. The idea of these methods is very clear, that is, the application of SQL is limited to basic reading and writing tasks, while the complex data processing and business logics are implemented in the application end. In this way, the above-mentioned problems of SQL can be avoided, and the architecture is more in line with the needs of modern application.

Now that SQL is not recommended, what technology should be used for those complex calculations and business logics?

Undoubtedly, other programming languages. Almost all application-level programming languages can command SQL to operate, and implementing this development architecture is not a problem. In this case, is the problem of difficulty in using SQL solved easily?

Let's examine some commonly used technologies.

Java

Java is definitely the first choice. After all, many applications are developed in Java. If the data processing is solved, it will have many advantages. Java naturally supports the procedural calculation. Although the code may be longer for complex computing, the algorithm can be implemented according to normal thinking. In this way, can we conclude that it can replace SQL?

No, it's not that simple.

Since Java lacks the specialized structured data objects and strong support from the base, it is not easy to implement the complex SQL-style calculations.

Since the structure of the return value calculated from structured data changes in the calculation process, and a large number of intermediate results are also of dynamic structure, it is difficult to define them in advance. Moreover, since Java is a strongly-typed language, the structure of the data object must be defined in advance (otherwise, only the data objects that are cumbersome in operation can be used such as map), which makes Java's structured data computing inflexible, and the lambda syntax are severely limited. Interpreted languages can simplify the definition of parameter, and the function itself can specify whether the parameter expression should be interpreted as a value parameter or a function parameter. However, Java is a compiled language and it is difficult to distinguish between different types of parameters, and hence only the design of a complex interface can the anonymous function be achieved (mainly referring to lambda syntax), which is not easy for even SQL programmers to grasp. The calculation of structured data can be significantly simplified by omitting the data object and directly referencing the field (such as writing as “unit price * quantity”), but Java lacks specialized structured data object, it is currently unable to support such seemingly simple but actually ingenious syntax, this makes Java code long and unintuitive (it can only write as “x. unit price *x. quantity”).

The lack of the structured data computing library will also lead to too long code. For the same grouping and aggregating operation, it can be written in one statement in SQL, but in Java, it will write dozens of lines, which is obviously not helpful for simplifying complex SQL. This problem has not been significantly improved even the Stream is added in Java8. Therefore, it is basically impossible to replace SQL with Java to simplify operation (the reason why some applications use Java for data processing is often due to architectural requirements, but their simplicities are far less than SQL).

Also, as a compiled language, it is difficult to for Java to achieve hot deployment, and hence it is difficult to respond to changing data computing scenarios in a timely manner.

Python

Since Java does not work well, what about the hot Python? Can we conclude that Python+SQL is much easier than Java+SQL?

Python (mainly referring to Pandas) does provide rich structured data computing library, and the implementation of computing is much simpler than Java. However, when we use Pandas in practice to process the data, especially involving complex operation, we will also encounter the situation where it is difficult to code. In fact, Pandas is not originally designed for structured data, and not the data table (the set consisting of many rows of data) that we are familiar with, but a matrix. When using Python to process structured data, it works passably good in performing some simple operations such as filtering and merging, but when it deals with more complex operations like grouping and ordered operations, it will be relatively troublesome, and we often have to solve in a tortuous way.

But that is not the main problem of Python. The difficulty of combining Python with application lies in its integration.

It is difficult for Python to be integrated in Java applications!

There are two main methods to integrate Python into Java. One method is service-based invocation. Since Python and Java are two systems, they are deployed separately and interact with each other through service (network communication). This method has the following disadvantages: i)it is more troublesome as two sets of applications need to be maintained; ii) some management problems are inevitable such as security authentication, system permission; iii)the performance of data exchange is also very poor. Therefore, this method will not be used unless it is the only option. Another method is to use an interpreter like Jython (Python on the JVM), for the reason that it is convenient to be integrated in applications due to being developed in Java, but Jython lacks sufficient computing libraries (such as Pandas), so it is difficult to cope with those complex calculations. Anyway, no matter which method is adopted, there will be huge limitation in terms of application, which makes it low feasibility to use Python to implement complex calculations in Java application.

In short, the problem cannot be solved unless the whole application is developed in Python. In that case, however, Java's various enterprise-level capabilities are useless. Therefore, it works only for some small applications.

Scala

Scala has the characteristics similar to those of Python. For example, Scala also provides DataFrame objects, and is also simpler in implementing simple structured data calculations, more simply in terms of set operation compared with Java and, Scala supports well for the characteristics of high-level language such as ordered operation and object attributization. Moreover, Scala is naturally easy to be integrated into Java because it runs on JVM. All these are the advantages of Scala.

The disadvantages of Scala include that it is relatively difficult to use, difficult to learn and more difficult to master, and it is especially disadvantageous compared with SQL when it is used for complex data processing. Therefore, we should give up the idea of simplifying complex SQL with Scala. This is probably why Spark returns to SQL. Moreover, Scala, as a compiled language, also does not support hot deployment.

SPL

These high-level languages all have shortcomings of one kind or another, and none of them are competent from the point of view of simplifying complex SQL. What else can we use to solve this problem?

In fact, we can clearly understand from the above analysis that the existing technologies (development languages) have their own advantages and disadvantages for the structured data computing, especially complex computing. For example, SQL is good at complex computing but does not support ordered operation and calculation process well. Conversely, Java supports ordered operation and calculation process but lacks of the set computing power. If the advantages of these technologies can be combined, the goal of simplifying complex SQL is achieved.

The open-source SPL is exactly such a product.

The full name of SPL is Structured Process Language, which is a programming language specially used for structured data computing.

Conventional computing power

Consistent data type

Compared with Java, SPL provides a more specialized structured data type, i.e., the table sequence. Like the data table of SQL, the table sequence is a set consisting of batch records, which has the general function of structured data type, and can seamlessly interact with SQL so as to use the return value of SQL.

The table sequence supports all structured data calculation functions, and the calculation result is also a table sequence, not a data type such as Map. For example, we can continue to perform structured data processing for the grouped and aggregated results.

Orders.groups(year(OrderDate):y; sum(Amount):m).new(y:OrderYear, m*0.2:discount)

Rich computing library

Based on the table sequence, SPL provides a wealth of calculation functions for structured data, such as filtering, sorting, grouping, deduplication, renaming, calculation of column, association, subquery, set computing, ordered computing, etc. These functions have powerful computing power and can perform the calculation independently, avoiding hard-coding, for example:

Combined query:

Orders.select(Amount>1000 && Amount<=3000 && like(Client,"\*bro\*"))

Inner join:

join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))

SPL supports the simple form of Lambda syntax, without defining the function name and function body, you can directly use expression as the parameter of function. When modifying the business logic, you don’t need to redefine the function, but simply modify the expression. Since SPL is an interpreted language, there is no need to explicitly define the type of parameter when using parameter expression, which makes the Lambda interface simpler. For example, to calculate the sum of squares, if you want to calculate the square in the process of sum, you can directly write: Orders.sum(Amount*Amount).

Moreover, SPL, as an interpreted-execution language, also naturally supports the dynamic data structure, and can dynamically generate new table sequence according to the structure of calculation result, which is especially suitable for calculations such as computed column, grouping and aggregating, and association. For relatively complex calculation, it usually needs to be split into multiple steps, and the data structure of each intermediate result is almost different. Since SPL supports dynamic data structure, there is no need to define the structure of these intermediate results in advance. For example, we want to find out the customer whose payment amount in every month is in the top 10 based on the customer payment record table of a certain year, the code is as follows:

Sales2021.group(month(sellDate)).(~.groups(Client;sum(Amount):sumValue)).(~.sort(-sumValue)).(~.select(#<=10)).(~.(Client)).isect()

Process control and SQL collaboration

In addition to providing the set operation capability comparable to SQL, SPL also provides good support for process control and stepwise computing. With flexible branch judgment statements, loop statements, combined with specialized structured data objects, SPL can easily implement various business logics. For example, we want to find out the top n major customers whose cumulative sales account for half of the total, we can achieve it step by step like this:


A

1

=db.query(“select client,sum(amount) amount from sales group by client order by amount desc”)

2

=A1. sum(amount)/2

3

=0

4

=A1.pselect((A3=A3+amount,A3>=A2))

5

=A1(to(A4))

First, use SQL to group and aggregate, and sort in descending order by the aggregated value. Then, SPL uses the calculation results of SQL to perform the subsequent calculation in a step-by-step manner. The effective combination of SPL and SQL achieves the goal of simplifying complex calculations to a large extent.

In practice, the interaction and collaboration between SPL and SQL are often needed, so as to give full play to the advantages of both, including the reading and writing of database, transaction processing, flow processing, and calling of stored procedure.

Writing to database (update/insert):

db.update@u(A7,sales;ORDERID)

Executing DML statement:

db.execute("insert into DEPARTMENT(DEPT, MANAGER) values(?,?)","TecSupport",9)

Calling the stored procedure

db.proc({db_proc(?,?),,:101:"o":table1,1:0:"i": })

Rich set computing capabilities, together with the procedural calculation and flow control (including command SQL), make SPL have the capabilities equivalent to SQL and Java, and the implementation is simpler than Java.

Capabilities that surpass SQL

SPL not only covers all computing power of SQL, but provides more powerful language capabilities. Based on these characteristics, the operations that are originally difficult to code in SQL can be done with ease. Simplifying complex calculations becomes an easy task.

Discreteness, and more thorough set orientation based on discreteness

Set orientation is a basic feature of SQL, that is, it supports the data in participating in operations in the form of set, but SQL is not good in discreteness, specifically, all set member must participate in the operation as a whole, and any member cannot be separated from the set. By contrast, the high-level languages like Java are good in discreteness, and the array members can be operated independently. However, a more thorough set orientation needs to be supported on discreteness so as to allow the set members to separate from the set to freely form a new set with other data to participate in the operation.

SPL combines SQL’s set orientation and Java’s discreteness, and hence a more thorough set orientation can be achieved.

It is easy for SPL to express “set of sets”, which is suitable for calculation after grouping. For example, find out the students whose score of every subject is in the top 10:


A

1

=db.query(“select * from score”) .group(subject)

2

=A2.(~.rank(score).pselect@a(~<=10))

3

=A1.(~(A3(#)).(name)).isect()

Support ordered computing

Ordered computing is the typical combination of discreteness and set orientation, for the following reasons: i)the order of members is meaningful only in a set, thus, the set orientation is required; ii) each member needs to be separated from adjacent members during the ordered computing, it requires the discreteness. Since SPL integrates the set orientation and discreteness, it naturally supports ordered computing.

Specifically, SPL can reference the member with absolute position. For example, to take the third order, we can write Orders(3); to take the first, third, and fifth records, we can write Orders([1, 3,5]).

SPL can also reference the member with relative position. For example, to calculate the amount growth rate of each record relative to the previous record:

Orders.derive(amount/amount[-1]-1)

In addition, SPL can use # to represent the sequence number of current record. For example, to divide the employees into two groups by sequence number, one group with odd sequence numbers, and another group with even sequence numbers:

Employees.group(#%2==1)

With the support of ordered computing, it is very convenient to deal with the order-related operations in structured data calculation (such as LRR, YOY, top 20%, ranking).

Object reference

The field of SPL table sequence can store the records or records set, which allows the association relationship to be expressed intuitively in the way of object reference, even if there are more relationships, they can be expressed intuitively. For example, find out the male employees whose manager is female according to the employee table:

Employees.select(gender:"male",department.manager.gender:"female")

More convenient function syntax

The provision of a large number of powerful structured data calculation functions is originally advantageous, but it will make it difficult to distinguish the functions with similar function, increasing the learning difficulty invisibly.

SPL provides the unique function option syntax. This syntax allows the functions with similar function to share one function name, and their difference can be distinguished just by the options. For example, the basic function of the select function is to filter, and if you only want to filter out the first record that meets the conditions, you can use the option @1:

Orders.select@1(Amount>1000)

When the amount of data is large, and the parallel computing is used to improve performance, you only need to change the option to @m:

Orders.select@m(Amount>1000)

When using the binary search to quickly filter the sorted data, you can use the option @b:

Orders.select@b(Amount>1000)

The options of function can also be used in a combined way, for example:

Orders.select@1b(Amount>1000)

Usually, the parameters of structured operation function are very complex. For example, SQL needs to use various keywords to separate the parameters of one statement into multiple groups, but this will involve many keywords and also make the statement structure inconsistent. SPL supports parameter layering, and divides the parameters into three layers from high to low by semicolons, commas, and colons, and simplifies the expression of complex parameter in a general way:

join(Orders:o,SellerId ; Employees:e,EId)

Convenient editing and debugging functions

Unlike the difficult-to-use editing and debugging function of SQL, SPL provides concise and easy-to-use development environment that is characterized by single step execution, breakpoint setting, WYSIWYG result preview window, etc., which make the development efficiency higher.

1png

SPL adopts the cell-style coding, the code is naturally aligned and clear in levels. When programming procedural calculation, the cell name can be directly used to reference the calculation result of the previous step without taking the trouble to define the variables (although SPL supports defining variables), therefore, it is very convenient. The easy-to-use development environment naturally yields twice the result with half the effort, further simplifying the difficulty of complex calculations.

Application integration, low coupling and hot swap

SPL provides a standard application interface (JDBC/ODBC/RESTful) through which it can be easily integrated into applications. Especially for Java applications, SPL can be integrated as an embedded engine, making the application itself have strong data computing capabilities.

Code example of calling SPL by JDBC:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();

Moreover, SPL adopts the interpreted-execution mechanism that naturally supports hot swap, which makes it very friendly to deal with the data processing requirements that are poor in stability and often need to be modified and added. Furthermore, SPL scripts can be executed independently from Java programs and placed outside the Java. Modification and maintenance of the script can also be done independently. After the modified script is uploaded, it will take effect in real time to ensure that the application can provide data services without interruption.

Using SPL to place algorithms outside can effectively reduce the coupling between the application and database, and the independent SPL module can further reduce the coupling between different modules of the application, making the application architecture clearer and more reasonable.

In conclusion, the reason why SPL is simpler is that SPL combines the advantages of SQL and other high-level languages like Java, and adds many features based on these advantages. Therefore, coding for complex calculations becomes simpler. Moreover, the perfect combination between SPL and application makes application architecture more reasonable. With the help of SPL, we believe that complex SQL code with thousands of lines will no longer exist one day in the future.