Performance optimization skill: Partial Pre-Association

 

I. Problem introduction & solving

  In Performance optimization skill: Pre-Association, we tested the technique of pre-loading data tables in memory and querying after pre-association. But if the size of dimension table(s) and fact table exceeds the memory, we can load the dimensional table(s) only in memory and index it (them) to achieve at least half of the hash computation, which is called partial pre-association.

Well use the lineitem table, which has the largest data volume and can not be loaded in memory, to test this solution by pre-loading the seven relatively small dimension files and leaving lineitem, the fact file, to be loaded in real time during the query.

 

II. SQL query test

  Here Oracle database is still taken as the representative for testing SQL code, which calculates the total revenue of parts orders every year over the lineitem table.

 

1. Two-table join

  The SQL query is as follows:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     lineitem,
                     part
              where
                     p_partkey = l_partkey
                     and length(p_type)>2
       ) shipping
group by
       l_year
order by
       l_year;

2. Six-table join

  The SQL query is as follows:

select
       l_year,
       sum(volume) as revenue
from
       (
              select
                     extract(year from l_shipdate) as l_year,
                     (l_extendedprice * (1 - l_discount) ) as volume
              from
                     supplier,
                     lineitem,
                     orders,
                     customer,
                     part,
                     nation n1,
                     nation n2
              where
                     s_suppkey = l_suppkey
                     and p_partkey = l_partkey
                     and o_orderkey = l_orderkey
                     and c_custkey = o_custkey
                     and s_nationkey = n1.n_nationkey
                     and c_nationkey = n2.n_nationkey
                     and length(p_type) > 2
                     and n1.n_name is not null
                     and n2.n_name is not null
                     and s_suppkey > 0
       ) shipping
group by
       l_year
order by
       l_year;

3. Test results


Two-table join Six-table join
Query time (s) 235 2669

  Both results are the best-performing among multiple executions.

  The six-table join is 2669/235=11.4 times slower than the two-table join, indicting the notable performance degradation.

 

III. SPL partial pre-association test

1. Partial pre-association

  SPL script:


A
1 >env(region,   file(path+"region.ctx").open().memory().keys@i(R_REGIONKEY))
2 >env(nation,   file(path+"nation.ctx").open().memory().keys@i(N_NATIONKEY))
3 >env(supplier,   file(path+"supplier.ctx").open().memory().keys@i(S_SUPPKEY))
4 >env(customer,   file(path+"customer.ctx").open().memory().keys@i(C_CUSTKEY))
5 >env(part,   file(path+"part.ctx").open().memory().keys@i(P_PARTKEY))
6 >env(orders,file(path+"orders.ctx").open().memory().keys@i(O_ORDERKEY))
7 >nation.switch(N_REGIONKEY,region)
8 >customer.switch(C_NATIONKEY,nation)
9 >supplier.switch(S_NATIONKEY,nation)
10 >orders.switch(O_CUSTKEY,customer)

  The first 6 lines read in six dimension tables respectively in memory to generate six in-memory tables, create indexes on them and set them as global variables. The last 4 lines establish association among the six in-memory dimension tables. The pre-association script will be executed at the start of the SPL server to make preparations for subsequent query.

 

2. Two-table join

  SPL script:


A
1 =file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2)
3 =A2.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE*(1 - L_DISCOUNT)):revenue)

  Temporary loading needs the cursor-style retrieval and association on cursor. The code after that is similar to that for all-in-memory processing.

 

3. Six-table join

  SPL script:


A
1 =file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)
3 =A2.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null &&  L_SUPPKEY.S_NATIONKEY.N_NAME != null && L_SUPPKEY.S_SUPPKEY>0 )
4 =A3.groups(year(L_SHIPDATE):l_year;sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

  Similar to the all-in-memory processing, the code after the cursor-style retrieval and association is quite concise and easy to understand.

 

4. Test results


Two-table join Six-table join
Query time (s) 266 472

  The six-table join works only 1.8 times slower than the two-table join. The extra time is spent in associating L_ORDERKEY field and L_SUPPKEY field to corresponding dimension tables respectively and performing filtering by specified conditions (i.e. referring fields of these associated tables). Theres no need to handle associations among dimension tables because we already pre-associated them. And the performance also became faster since we cut off half of the hash computation for joins between the fact table lineitem and the dimension tables by creating indexes on them in advance.

 

VI. Summary

  Test results:

Query time (s) Two-table join Six-table join Performance decline (times)
SQL 235 2669 11.4
SPL pre-association 266 472 1.8

  It takes SQL 11.4 times slower to execute the six-table join than to execute the two-table join, which shows that SQL consumes a lot of CPU to process JOIN, resulting in considerable decrease in performance. By contrast, SPL six-table join is only 1.8 times slower after pre-association. The latter is able to process even more table joins with a pretty stable performance.

  In conclusion, for a query involving a lot of tables joins, if the memory is large enough to hold all the dimension tables, the partial pre-association in SPL can improve the performance effectively. Unfortunately, for relational databases, the database engine cant optimize itself when many table joins are required, thus leading to inevitable, serious performance degradation.