Another culprit behind the slow running and crashing of a database

 

That’s right, it’s the famous JOIN.

JOIN has always been a major challenge in database computing, and the industry has come up with many ways to calculate it. If no optimization is done, it would be to loop through and traverse two associated tables, which is a multiplication level complexity that can be unbearable if the data volume is slightly larger. Mature databases are certainly not so foolish. For the most common equivalence JOIN (with an association condition of equal key values), the HASH JOIN method is usually used, which can reduce the computational complexity by K times (K is the length of the HASH space, and we won’t go into detail here, there is a lot of information otherwhere).

The HASH JOIN algorithm works well in memory. When the amount of data is too large to fit in memory, it is likely to involve HASH heap caching, and when unlucky, a second HASH is required, resulting in uncontrollable performance. Distributed calculation is even more troublesome. The database industry has invented methods such as broadcast join and shuffle join to switch to a single machine JOIN (which will not be elaborated here), which is very troublesome and results in a sharp decline in performance, leading to the phenomenon of slower speed when there are more cluster nodes. Some databases only implement in-memory algorithms for higher speed, so it’s not surprising that they crash when the data volume is large.


Is JOIN really so difficult to deal with?

If we strictly follow the JOIN defined by SQL to implement, it is indeed quite difficult. However, when everyone is striving to solve JOIN performance issues, they almost never consider whether this definition is reasonable.

In fact, in reality, the vast majority of equivalence JOINs with business significance will have primary keys (or logical primary keys) involved, rather than random (without primary keys, it is mostly likely that the code was written incorrectly). By utilizing this feature, it is possible to optimize the performance of JOINs. Unfortunately, SQL’s definition of JOIN does not address this point at all.


We can divide equivalence JOINs into two categories: one is the foreign key association between ordinary fields of the fact table and the primary key of the dimension table, which is a many-to-one relationship; The other is the primary key association between the primary key of the primary table and part of the primary keys of the sub table, which is a one-to-many relationship (which may degenerate into a one-to-one homo-dimension table relationship); Then use their respective features to optimize performance.

Dimension tables are usually relatively small and can be fully read into memory. In this way, as long as the fact table is traversed, no matter how large the fact table is or how many dimension tables there are, there is no need for HASH heap buffering, let alone a second HASH. After distinguishing the dimension table, it can be directly replicated in the cluster nodes without the need to perform broadcast or shuffle during calculations.

If the dimension table is too large to be read into memory, it can be stored (distributed) in an orderly manner according to the primary key; When the fact table is small, the JOIN operation becomes a search operation (on the dimension table), and there is no HASH heap buffering or second HASH. When the fact table is also large, it can be buffered according to the interval of the dimension table. On the one hand, it only needs unilateral heap buffering (to reduce the buffer amount), and on the other hand, it is also impossible to have a second HASH. Although it is not too fast in this case, it still has a lot of advantages over HASH JOIN. When distributed, the dimension table can also be loaded into the memory of cluster nodes, and the fact table can be traversed separately at each node.

The situation of primary sub (homo-dimensional) tables is even simpler. After being stored in an orderly manner according to the primary key, a very low complexity merging algorithm can be used. JOIN can be implemented with just a little bit of memory and one traversal, without involving HASH heap partition or a second HASH, and without shuffle actions, even without calculating HASH. No matter how large is the data amount, it will not crash.


Unfortunately, relational databases cannot implement these algorithms, as they need to faithfully implement the definition of relational algebra. Some databases may undergo engineering optimizations, such as using merge join when discovering the data is ordered. However, due to the unordered set foundation of relational algebra, the physical order of the data cannot be guaranteed (only logical order cannot avoid the huge cost of hard disk jumping), and in most cases, these algorithms cannot be adopted.


Well, esProc SPL can!

Strictly speaking, esProc SPL is not a database, but a professional computing engine. Its definition of equivalence JOIN is divided into two categories as above, and clearly defines the concepts of fact table, dimension table, primary table, and sub table (although we often refer to these terms when discussing databases, relational algebra does not strictly define them). esProc SPL no longer uses relational algebra and SQL, but has created a discrete dataset theory based on ordered sets and invented a new programming language SPL, which naturally supports physically ordered storage (both in memory and external storage) and can implement the efficient algorithms mentioned above; SPL also provides corresponding functions for different categories of JOINs, allowing programmers to easily utilize these algorithms to achieve high performance.


For foreign key association with small data amount and can be fully loaded into memory, if done only once, the method of esProc SPL is not significantly different from the HASH JOIN of the database, as the HASH value also needs to be calculated to find the associated record. But the HASH value calculated in SPL can be reused, and the next time the same dimension table participates in the association (which is a common occurrence), there is no need to calculate it again. Specifically, if the fact table can also be loaded into memory, the association can be done beforehand, and there is no need to perform HASH calculations and comparisons when doing JOIN. These all rely on the feature that the primary key of the dimension table participates in the association, which cannot be implemented in SQL system that does not recognize this feature.

By utilizing the orderliness of SPL, dimension table numbering can also be achieved. Namely, convert foreign keys into sequence numbers, which can directly locate dimension table records using sequence numbers, avoid HASH calculation and comparison, and improve association calculation performance. The unordered SQL system also cannot implement this algorithm.

In other cases, such as when the data volume is too large to fit in memory, the previous analysis can already demonstrate the enormous advantages of the SPL algorithm. Due to space limitations, the principle of SPL executing JOIN will no longer be explained in detail here. Interested buddies can go to Scudata Forum to search for relevant information.


Let’s take a look at the test results of a join operation and a wide table (in seconds):


4C16G 8C32G
esProc SPL StarRocks ClickHouse esProc SPL StarRocks ClickHouse
Wide table 114.2 129.9 74.3 57.7 62.1 33.2
Two table join 21.5 78.8 204.1 11.5 35.1 89.3
Seven table join 55.6 152.5 Memory overflow 30.6 73.3 Memory overflow

It can be seen that the wide table operation performance of esProc SPL is not as good as ClickHouse, but with the adoption of these new algorithms (specifically using techniques such as multi-layer dimension table pre association, dimension table primary key numbering, and primary sub table merging), JOIN performance has significant advantages. It is also not surprising that ClickHouse’s JOIN performance is poor. When two tables are joined, there is only small dimension table, and it is just slow but still able to handle it. When seven tables are joined, large primary and sub tables are involved, the phenomenon of collapse occurs.

The complete test report can be found in SPL computing performance test series: associate tables and wide table .


Let’s still look at this actual spatiotemporal collision case: identify the top 20 phones that have appeared the most frequently in the same time period and location as a specified phone, with a data scale of approximately 25 billion rows. The SQL is written as follows:

WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
    SELECT B.id id, COUNT( DISINCT B.tn ) cnt
    FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
    WHERE A.id=a AND B.id<>a
GROUP BY id )
ORDER BY cnt DESC
LIMIT 20

There is a self-JOIN here, and a single node ClickHouse crashes directly, and a 5-node cluster takes more than 30 minutes to get the result. The SPL code gets the result in less than 6 minutes using just one node:

A
1 =now()
2 >NL=100000,NT=3*96
3 =file("T.ctx").open()
4 =A3.cursor(tm,loc;id==a).fetch().align(NL*NT,(loc-1)*NT+tm\900+1)
5 =A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1))
6 =A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt))
7 =interval@ms(A1,now())

From the SPL code, there is no trace of JOIN at all, because after changing the JOIN definition, SPL can integrate JOIN into other computational processes. Here A4 is creating a dimension table, and A4((loc-1)*NT+tm\900+1) in A5 is equivalent to the filtering effect of inner-join. It also utilizes the aforementioned sequence numbering mechanism, effectively improving JOIN performance.


Careful readers may find that the effectiveness of the esProc SPL algorithm depends on the order of data by ID, and the order of data generation is usually not the ID, but the time. Then, can this algorithm only be applied to previously sorted historical data, and become invalid for new data that cannot be sorted together in time?

esProc has taken this into account, and SPL’s multi-zone composite table can achieve incremental sorting when data enters, ensuring that the data is sorted by ID in real-time when read, allowing this ordered calculation scheme to be applied to the latest data. Moreover, statistics on fact tables usually involve time intervals, and SPL’s pseudo table supports a two-dimensional ordering mechanism, which can quickly filter out data outside the time interval and further improve computational performance.


esProc is a pure Java software that can perform operations in any JVM environment and can be seamlessly embedded into Java programs, giving the computing power of a data warehouse to applications in various scenarios in a very lightweight manner.
esProc provides a visual development environment that supports single step execution, breakpoint setting, and WYSIWYG result preview. Developing and debugging is much more convenient than SQL and stored procedures.

SPL also has comprehensive process control statements, such as for loops and if branches, and supports subroutine calls. It has the procedural ability only available in stored procedures, and can comprehensively replace SQL and stored procedures.


Finally, esProc SPL is open source and free. It is here https://github.com/SPLWare/esProc.