ClickHouse is fast, esProc SPL is faster

 

ClickHouse, as an open-source analysis database, is known for being fast. Is that true? Let's test and verify it through comparative test.

ClickHouse vs Oracle

First of all, we conduct a comparative test on ClickHouse (CH for short) and Oracle database (ORA for short) under the same hardware and software environment, and use the internationally recognized TPC-H as the test benchmark. The test is to execute the calculation requirements (Q1 to Q22) defined with 22 SQL statements for 8 tables. The test is performed on one machine within 12 threads, and the total data amount is around 100G. Since the SQL statements corresponding to TPC-H are relatively long, we do not list them here.

Q1 is a simple calculation, that is, traverse a single table to group and then aggregate. Test result is shown in the figure below:..

We can clearly see from the figure that CH performs better than ORA, showing that CH’s columnar storage works very well, and the speed of traversing a single table is very fast. The main cause that makes ORA much slower is that it uses the row-based storage.

However, if we increase the computational complexity, will CH perform as good as above? Let's check it out in tests for Q2, Q3, Q7 of TPC-H. Test results:

..

It can be seen that the performance of CH decreases significantly when the calculation becomes complicated. Specifically, Q2 involves a small amount of data, the effect of columnar storage is not obvious, and the performance of CH is almost the same as that of ORA; Q3 involves a large amount of data, due to the columnar storage, the performance of CH exceeds that of ORA; Q7 also involves a large amount of data, but the performance of CH is not as good as that of ORA because the calculation becomes complicated.

The speed of accomplishing complicated calculation depends mainly on the performance optimization engine. Although CH’s columnar storage has huge advantages, its performance is still overtaken by ORA that uses the row-based storage, which shows that the algorithm optimization ability of CH is far inferior to that of ORA.

TPC-H’s Q8 is a more complicated calculation, which needs to join multiple tables in sub-query. When this calculation is executed in CH, the result is still not obtained after running for over 2000 seconds, the most likely reason is that CH is crashed. In contrast, it takes 192 seconds to obtain the result in ORA. For Q9 calculation that adds like to sub-query of Q8, an error “out-of-memory” is directly prompted in CH, while the result is obtained after 234 seconds in ORA. Moreover, there are some other complicated operations that cannot be worked out in CH, so it is impossible to make an overall comparison.

Both CH and ORA are based on SQL, however, the statements that can be optimized in ORA cannot be executed normally in CH, which further proves that the optimization engine of CH is relatively poor.

It is said that CH is only good at performing single table traversal operation, and its performance is not even as good as MySQL when involving association operation. From the above test results, this statement does not seem to be false and groundless. Therefore, those who want to use CH should consider such a question: what is the application scope of single table traversal operation?

esProc SPL

Open-source esProc SPL also takes high performance as an advantage to publicize. Let's compare it with CH and ORA.

We still use TPC-H to test:

..

As can be seen that for the more complicated operations Q2, Q3, and Q7, SPL runs faster than CH and ORA. For Q8 and Q9 that CH can't obtain the result, it takes 37 and 68 seconds respectively in SPL, which are also faster than ORA. The reason is that SPL can use better algorithms, and their computational complexity is lower than that of optimized SQL in ORA, and far lower than SQL executed in CH. In addition, with the help of columnar storage, SPL developed in Java finally outperforms CH and ORA implemented in C++.

Now we can roughly draw a conclusion that esProc SPL performs very well in both simple and complicated calculations.

However, for simple operations like Q1, CH performs slightly better than SPL, which seems to further prove the previous conclusion that CH is particularly good at simple traversal operation.

Wait a second, this is not the final conclusion. SPL has a “secret weapon”.

SPL provides a columnar cursor mechanism in its enterprise edition. Let's conduct a comparative test on the performance of SPL (using columnar cursor) and CH. The test is to perform the simplest grouping and aggregating calculation, and the data amount is 800 million. (Since the configuration of machine used in this test is slightly lower than the one used in previous TPC-H tests, the test results will be different, but it is not important. Here we mainly focus on the relative value.)

For simple grouping and aggregating calculations, SQL code:

SQL1

SELECT mod(id, 100) AS Aid, max(amount) AS Amax
FROM test.t
GROUP BY mod(id, 100)

The test result is shown in the figure below:

..

After using the columnar cursor mechanism, SPL’s performance in simple traversing and grouping calculation is the same as CH. If the columnar cursor was used in Q1 test, SPL would also achieve the same performance as CH.

During the test, we found that when 800 million pieces of data are stored as text format, they occupy 15G of disk space, yet they occupy 5.4G space in CH, and 8G in SPL. It indicates that both CH and SPL compress the data, and the compression ratio of CH is higher, which further proves that the storage engine of CH is actually well. However, since SPL achieves the same performance as CH, it shows that both the storage engine and algorithm optimization of SPL are all doing well, and the high-performance computing ability is more balanced.

SPL of current version is written in Java. In Java, the speed of generating the objects for computation is very slow after reading data, while there is no such problem in CH developed in C++. For complicated operations, since the proportion of data-reading time is not high, the performance affected by slow generation of objects in Java is not obvious; for simple traversal operation, however, because the proportion of data-reading time is high, SPL runs slower than CH in the previous test. The columnar cursor optimizes the data reading scheme, and no longer generates many small objects, which makes the number of times of generating objects reduce greatly. In this way, the performance gap is narrowed. From the perspective of storage alone, there is no obvious difference between SPL and CH.

Next, let's look at the comparative test on the conventional calculation TopN. SQL code:

SQL2:
SELECT * FROM test.t ORDER BY amount DESC LIMIT 100

Comparative test result:

..

Looking at CH's SQL2 alone, the conventional method of calculating TopN is to take the first N data following the full sorting. When the amount of data is large, the performance would be very poor if full sorting was done. The test result of SQL2 shows that the optimization should already be done in CH just like SPL, and full sorting is avoided, therefore, both perform fast, and SPL is a little faster.

In other words, esProc SPL performs better no matter it is a simple or complicated operation.

Further gap

The gap is more than that.

As mentioned earlier, both CH and ORA use SQL, and both are based on the relational model, so they all face the problem of SQL optimization. TPC-H test proves that some scenarios that can be optimized in ORA cannot be optimized in CH, or even the calculation result cannot be obtained in CH. Therefore, for calculations that ORA can't optimize, it is more impossible for CH to optimize. For example, we change the simple grouping and aggregating calculation of SQL1 to two grouping and aggregating results and then join them. The code written in CH’s SQL is roughly like this:

SQL3:

SELECT *
FROM (
 SELECT mod(id, 100) AS Aid, max(amount) AS Amax
 FROM test.t
 GROUP BY mod(id, 100)
 ) A
 JOIN (
 SELECT floor(id / 200000) AS Bid, min(amount) AS Bmin
 FROM test.t
 GROUP BY floor(id / 200000)
 ) B
 ON A.Aid = B.Bid

In this case, the comparative test result shows that the computing time in CH is doubled, and that remains unchanged in SPL:

..

The reason for this result is that SPL uses not only the columnar cursor, but the multipurpose traversal mechanism. This mechanism allows SPL to calculate multiple grouped results through one traversal, which greatly reduces the access amount to hard disk. In contrast, CH's SQL cannot write such operations, and can only rely on the optimization ability of CH itself. Yet, the optimization ability of CH algorithm is very poor, and its optimization engine did not work in this test, and hence it has to traverse twice. As a result, the performance is decreased by half.

SPL code for implementing multipurpose traversal is very simple, roughly as follows:


A

B

1

=file("topn.ctx").open().cursor@mv(id,amount)

2

cursor A1

=A2.groups(id%100:Aid;max(amount):Amax)

3

cursor

=A3.groups(id\200000:Bid;min(amount):Bmin)

4

=A2.join@i(Aid,A3:Bid,Bid,Bmin)

Now we change the conventional calculation of TopN in SQL2 to calculating the in-group TopN after grouping. SQL code:

SQL4:

SELECT
    gid,
    groupArray(100)(amount) AS amount
FROM
(
    SELECT
        mod(id, 10) AS gid,
        amount
    FROM test.topn
    ORDER BY
        gid ASC,
        amount DESC
) AS a
GROUP BY gid

The result of comparative test on the grouped TopN calculation is as follows:

..

The speed of calculating grouped TopN in CH is 42 times slower than that of conventional TopN calculation, showing that CH is likely to perform the sorting action in this case. That is to say, after the calculation becomes complicated, the optimization engine of CH also did not work. Unlike SQL, SPL regards TopN as an aggregation operation, and use the same calculation logic as that used in operations like sum and count, that is, it only needs to traverse the original data once. In this way, calculating the in-group TopN after grouping are the same as summing and counting, which avoids sorting calculation. Therefore, the speed of calculating the grouped TopN in SPL is 22 times faster than CH.

Moreover, SPL code for computing the grouped TopN is not complicated:


A

1

=file("topn.ctx").open().cursor@mv(id,amount)

2

=A1.groups(id%10:gid;top(10;-amount)).news(#2;gid,~.amount)

Not only running fast

Let’s take a look at the common funnel operation in e-commerce system. SPL code is still very concise:


A

B

1

=["etype1","etype2","etype3"]

=file("event.ctx").open()

2

=B1.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(etype) && …)

3

=A2.group(id).(~.sort(etime))

=A3.new(~.select@1(etype==A1(1)):first,~:all).select(first)

4

=B3.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null))))

5

=A4.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)

CH's SQL cannot implement such calculation. Let's take ORA as an example to see the writing method in SQL for the three-step funnel:

with e1 as (
	 select gid,1 as step1,min(etime) as t1
	 from T
	 where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime<to_date('2021-01-25', 'yyyy-MM-dd')
		    and eventtype='eventtype1' and …
	group by 1
),
with e2 as (
	select gid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
	from T as e2
	inner join e1 on e2.gid = e1.gid
	where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e2.etime > t1
		    and e2.etime < t1 + 7
		    and eventtype='eventtype2' and …
	group by 1
),
with e3 as (
	select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
	from T as e3
	inner join e2 on e3.gid = e2.gid
	where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date('2021-01-25', 'yyyy-MM-dd') and e3.etime > t2
		    and e3.etime < t1 + 7
		    and eventtype='eventtype3' and …
	group by 1
)
Select
  sum(step1) as step1,
  sum(step2) as step2,
  sum(step3) as step3
from
  e1
  left join e2 on e1.gid = e2.gid
  left join e3 on e2.gid = e3.gid

It needs to write more than 30 lines of code when using ORA's SQL to implement the three-step funnel operation, which is quite difficult to understand. Moreover, this code is related to the number of steps of funnel, every extra step needs to add one sub-query. In contrast, SPL is much simpler, and the same SPL code can handle any number of steps.

For such complex SQL code, it is very difficult to write, let alone performance optimization.

CH's SQL is far inferior to ORA, and unable to code such complicated logic in general, and it can only write C++ code externally to implement. In other words, only the storage engine of CH can be used in this case. While external computing in C++ has the potential to obtain good performance, the development cost is very high. There are many similar examples, none of which can be directly implemented in CH.

In conclusion, using CH to calculate some simple scenarios (such as traversing a single table) is indeed very fast, and performs almost the same as SPL. However, for high-performance computing, we should not consider the speed of simple calculations only, but various scenarios. For complicated operations, SPL not only far outperforms CH in performance but is much simpler in coding. SPL is able to process all scenarios of high-performance data computing, and it can be said that SPL defeats CH completely.