Performance optimization skill: Pre-Joining

I Problem introduction & solving

SQL JOIN performance is a long-standing problem. Especially when there are a lot of to-be-joined tables, the performance plummets.

SQL handles JOINs using hash group approach. The method first calculates hash values of the join keys and then traverses and compares records with same hash values from two tables. The set of computations is necessary for each JOIN.

If the data size is comparatively small and data can fit into the memory, we can load data into the memory in advance and create the association using the in-memory pointer mechanism. Specifically, this approach calculates hash values and makes comparison while loading data and save the associations with pointers. The subsequent calculations can directly reference the associated records. This saves the trouble of calculating and comparing hash values and increases performance.

The problem is that SQL doesn’t support pointer data type. So the optimization remains an idea. Even if the data can be wholly loaded into the memory, we still cant take the advantage of the pre-joining approach. The SQL-based relational database products inherit the problem, too. By contrast, SPL can implement the optimization thanks to its support of pointer data type.

In the following part well test SQLs performance in achieving a two-table join and multi-table joins and SPLs performance in achieving them using the pre-joining trick.

 

II Test environment

Eight data tables, a total of 50G data size (should be fit into the memory), have been generated according to TPCH benchmark.

There are two test computers. Each has an Intel2670 16-core CPU, a 2.6 GHz processor, a 128GB RAM and an SSD.

The test table is orderdetail, which is generated by cutting off records of lineitem table. It has the same structure and can fit into the memory.

For the convenience of the test, I adopt the single-threaded processing and just leave the multicore ability alone.

 

III SQL query testing

Here I write the test SQL code in Oracle to calculate the yearly total orders amount over orderdetail table.

1. Two-table join

SQL statements:

select

l_year,

sum(volume) as revenue

from

(

select

extract(year from l_shipdate) as l_year,

(l_extendedprice * (1 - l_discount) ) as volume

from

orderdetail,

part

where

p_partkey = l_partkey

and length(p_type)>2

) shipping

group by

l_year

order by

l_year;

 

2. Six-table join

SQL statements:

select

l_year,

sum(volume) as revenue

from

(

select

extract(year from l_shipdate) as l_year,

(l_extendedprice * (1 - l_discount) ) as volume

from

supplier,

orderdetail,

orders,

customer,

part,

nation n1,

nation n2

where

s_suppkey = l_suppkey

and p_partkey = l_partkey

and o_orderkey = l_orderkey

and c_custkey = o_custkey

and s_nationkey = n1.n_nationkey

and c_nationkey = n2.n_nationkey

and length(p_type) > 2

and n1.n_name is not null

and n2.n_name is not null

and s_suppkey > 0

) shipping

group by

l_year

order by

l_year;

 

3. Test results


Two-table join

Six-table join

Query time (s)

26

167

 

Both queries contain nest queries. The auto-optimized Oracle query performs even better than a query without nest queries (The latter may compute group by and select repeatedly).

Both results are the best-performing among multiple executions. We found that the first Oracle query execution is always the slowest. That means that when the memory is sufficient to hold all data the database will buffer all data into the memory (Oracle has a big buffer pool). We select the fastest execution to get the computing time as pure as possible by getting rid of the external memory retrieval time.

By defining that the filtering condition is always true, which means no record is filtered away, we perform the query over the whole orderdetail table. So both tests have equal computing amount.

According to the test results, the six-table join is 167/26=6.4 slower than the two-table join. Besides the external memory retrieval time, the rapid performance drop originates from the time-consuming table joins and simple comparisons between join key values.

SQLs JOIN performance is amazingly bad.

 

IV SPL pre-joining solution test

1. Pre-joining

SPL script:


A

1

>env(region,  file(path+"region.ctx").create().memory().keys@i(R_REGIONKEY))

2

>env(nation,  file(path+"nation.ctx").create().memory().keys@i(N_NATIONKEY))

3

>env(supplier,  file(path+"supplier.ctx").create().memory().keys@i(S_SUPPKEY))

4

>env(customer,  file(path+"customer.ctx").create().memory().keys@i(C_CUSTKEY))

5

>env(part,  file(path+"part.ctx").create().memory().keys@i(P_PARTKEY))

6

>env(orders,file(path+"orders.ctx").create().memory().keys@i(O_ORDERKEY))

7

>env(orderdetail,file(path+"orderdetail.ctx").create().memory())

8

>nation.switch(N_REGIONKEY,region)

9

>customer.switch(C_NATIONKEY,nation)

10

>supplier.switch(S_NATIONKEY,nation)

11

>orders.switch(O_CUSTKEY,customer)

12

>orderdetail.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)

 

The first 7 lines read in seven composite tables respectively into the memory to generate memory tables and set them as global variables. The last 5 lines establish association between the seven tables. The pre-joining script will be executed at the start of the SPL server to make preparations for subsequent query.

Below is the structure of pre-joined table objects in the memory (Take the first record of orderdetail table as the starting point.):

undefined 

 

2. Two-table join

SPL script:


A

1

=orderdetail.select(len(L_PARTKEY.P_TYPE)>2)

2

=A1.groups(year(L_SHIPDATE):l_year;  sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

 

3. Six-table join

SPL script:


A

1

=orderdetail.select(len(L_PARTKEY.P_TYPE)>2  && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_SUPPKEY>0 )

2

=A1.groups(year(L_SHIPDATE):l_year;

sum(L_EXTENDEDPRICE  * (1 - L_DISCOUNT)):revenue)

 

The SPL script is simple after table is pre-joined. We can directly reference a field of the associated table, which can be treated as the sub-attribute of the referencing attribute. That makes the code easier to understand.

 

4. Test results


Two-table join

Six-table join

Query time (s)

28

56

The six-table join query is only 2 times slower than the two-table join query due to the added computing time (spent in referencing associated field). Yet we dont spend time on the join thanks to the pre-join.

 

V Summary

Test results:

Query time (s)

Two-table join

Six-table join

Performance decline (times)

SQL

26

167

6.4

SPL pre-joining

28

56

2

 

A six-table SQL join is 6.4 times slower than a two-table SQL join. This means that a SQL join uses a lot of CPU resources, which results in really bad performance. With the pre-joining method, the number is 2 and the performance drop is kept in a very limited range.

Pre-joining is an effective trick to increase join query performance when there are a lot of tables involved. SPL can use the method to make a difference while Relational databases, including in-memory databases, cannot.