How the performance improvement by orders of magnitude happened

We have worked out some performance optimization cases before, although there are not many, no one failed. After optimization, the speed is increased several times at least, and dozens of times at most, and even hundreds of times in extreme cases. Speeding up by an order of magnitude has basically become the norm. Here are some cases:

Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times 

Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency

Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times

Open-source SPL turns pre-association of query on bank mobile account into real-time association

Open-source SPL speeds up batch operating of bank loan agreements by 10+ times

Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes


How did we make it?
These cases have one thing in common: they were all originally implemented using SQL on various databases (or HADOOP/Spark), including hundreds of lines of SQL statements for querying and thousands of lines of stored procedure for batch operating. After re-implementing in esProc's SPL, it achieves such effect.

Is there anything magical on esProc SPL? Can it make various operations run faster?

It's a pity that it cannot. The esProc is also a software developed in Java, and usually a little slower than the database written in C/C++ when performing the same operation.

Then, what happened?

The root cause is that we implemented different algorithms in SPL. Software can't speed up hardware, but we can design lower-complexity algorithms to effectively reduce the amount of calculation. In this way, the speed will increase accordingly. A computing task was originally to do 100 million additions, if it can be reduced to 1 million, the operation speed will naturally be faster by 100 times; in this case, even if each operation is a little slower, the overall performance will improve, therefore, it is not magical at all here.

As long as the high-performance algorithms and storage can be implemented, what technology to use is less important. Of course, it can be implemented in C/C++ or Java. Actually, since the esProc is written in Java, using Java directly to implement these algorithms will in principle be faster than SPL, and using C/C will generally be faster than Java (memory allocation in Java consumes more time).

However, although the code written in Java and C is faster than that written in SPL, the code is much longer (estimated to be 50-100 times longer), which will lead to excessive development workload, and such workload is also an indicator that needs to be considered in practice. Sometimes, fast in running and easy in writing are the same thing, which is to implement high-performance algorithms efficiently.

The esProc SPL strengthens the data type of structured data, and provides many basic high-performance algorithms. Coding in SPL is to use these algorithms in a combined way, and thus it is of course much more convenient. If we have to say where the magic of SPL is, that's it.

So, can't the same thing be done by continuing using SQL?

No, it can’t. The reason is that SQL is designed too sketchy, and the relational algebra, as its theoretical basis, lacks many data types and basic operations, making it impossible to describe many high-performance algorithms. As a result, only slow algorithms can be used. Although many databases and big data platforms are now optimized in practice, they can only deal with simple scenarios. Once the situation becomes complicated, the database optimizer will be “confused”, therefore, it cannot solve the root of these problems, which is a problem in theory, and cannot be solved at the engineering level.

The theoretical basis of SPL is no longer the relational algebra, but the discrete dataset we invented. Under this system, because there are more data types and operations, more high-performance algorithms can be developed. SPL is an implementation of discrete dataset, which encapsulates many existing algorithms. Of course, this algebraic system can also be implemented from scratch in Java and C++, and thus the high-performance code can be written in both Java and C++. However, SQL cannot.

Let's take a simple example, we want to take the top 10 out of 100 million pieces of data, write in SQL is like this:

select top 10 x,y from T order by x desc

There is an "order by" in this statement. Strictly executing it will involve big sorting, which is very slow. In fact, we can come up with an algorithm that does not use big sorting, but such algorithm cannot be described in SQL, and it can only count on the database optimizer. For the simple situation described in this SQL statement, it can indeed be optimized by many commercial databases, and usually, a good performance can be obtained without using the big sorting algorithm. However, when the situation is more complicated, such as taking the top 10 out of each group, it needs to use window functions and sub-queries to write in SQL like this:

select * from
    (select y,*,row_number() over (partition by y order by x desc) rn from T)
where rn<=10

At this time, the database optimizer will be confused, unable to guess the purpose of this SQL statement, and thus it has to honestly execute the sorting logic (there is still the word “order by” in this statement), resulting in a sharp drop in performance.

SPL, however, is different. There is the concept of universal set in the discrete dataset. TopN operation is considered to be the same aggregation operation as SUM and COUNT, except that its return value is a set. In this case, there is no sorting action in the statement written in SPL for the above task:

T.groups(;top(-5;x))

The statement after grouping is also very simple, without big sorting as well:

T.groups(y;top(-5;x))

Click here: Performance optimization skill:TopN to find more detailed test comparison about this problem.

Therefore, we will re-code when optimizing the performance, and will not continue using SQL to maintain compatibility. Although it needs to do much to understand the original logic and re-implement, it is worthwhile to gain several times or dozens of times of performance improvement.

In addition, storage is also very important. Only with the help of a suitable storage mechanism can a good algorithm take effect. Therefore, we should not continue to store data in the database to obtain high performance, and instead we need to move the data out and store in a different way. After changing the storage method, the calculation process that originally needs to be buffered may be unnecessary, and the operation that originally needs to be traversed many times needs to be traversed only once or even becomes unnecessary. Reducing the access amount to hard disk is very effective for improving performance.

From the above principle, we know that if we could not design a better algorithm for the calculation target, we would not achieve speedup. For example, for the summation of a very simple large table, if it needs to calculate 100 million times in SQL, and it also needs 100 million times in SPL, then it’s impossible for SPL to calculate faster, and it will generally be slower (Java can't catch up with C/C++). However, when the computing task is complex enough, and when we encounter a SQL code with hundreds of lines of statements and nested with multiply levels (slow SQL is usually not too simple), we can almost always find enough points to optimize, therefore, none of the cases we’ve worked has failed. As a result, in practice, esProc written in Java greatly surpasses the database written in C/C++, thanks to algorithms.

We even ran an ad to look for slow processes written in SQL WANTED: Unbearably Slow Query and Batch Job. And we were responsible for speeding up by an order of magnitude.

Let's look at this speed-up principle from a different perspective: high performance depends not on code, but on algebra, and the code is just a means of implementation. The most important thing is to master and use these algorithms, not the SPL syntax. SPL syntax is very simple and much easier than Java; basically, you will be able to use it in two hours, and will be fairly proficiently in two or three weeks. But the algorithms are not that simple, you need to study hard and practice more to master. If these cases are directly placed before an inexperienced user, the effect is often not good, for the reason that the algorithms are not fully understood.

Conversely, as long as the algorithms are mastered, what syntax to use is a relatively minor issue (of course, using too sketchy language like SQL still doesn't work). It’s just like treating a patient, after finding out the pathological cause, it can analyze what medicines will work. In this case, the disease can be cured whether from buying the ready-made medicine directly (using encapsulated SPL) or collecting herbs in the mountain (coding in Java/C++ directly), except the level of trouble and the cost of payment is different.

If you are interested in the high-performance algorithms provided by SPL that are different from SQL, we recommend reading the performance optimization book on Raqforum at:

Performance Optimization - Preface

 

We have organized these algorithms into systematic knowledge, and some of them are pioneered in the industry and cannot be found in other textbooks and papers. 

If you follow these courses in this book and master these algorithms, you can write your own high-performance code to speed up at order of magnitude. Even if you don't write the code yourself, you can understand the principle, and won't be fooled by many big data products trumpeting “searching trillions of data in seconds”.