Performance optimization skill: Pre-Joining of Dimensions

I. Problem introduction & solving

  During multidimensional analysis when both the dimension table(s) and the fact table can fit into the memory, we can load them in, pre-join them to make the subsequent query faster and more efficient. But if the size of data files exceeds the memory size, we can load in the dimensional table(s) only and index it (them) to achieve at least half of the hash computation. This is what I call pre-joining of dimensions. We’ll use our old data files (Those used in the above article) to test this solution by pre-loading the seven relatively small dimenson files and leaving lineitem, the fact file, to be loaded in real time during query.

 

II. SQL query testing

  Here I still write the test SQL code in Oracle to calculate the total yearly orders amount over lineitem table.

 

1. Two-table join

  The SQL query is as follows:

    select
        l_year,
        sum(volume) as revenue
    from
        (
            select
                extract(year from l_shipdate) as l_year,
                (l_extendedprice * (1 - l_discount) ) as volume
            from
                lineitem,
                part
            where
                p_partkey = l_partkey
                and length(p_type)>2
        ) shipping
    group by
        l_year
    order by
        l_year;

2. Six-table join

  The SQL query is as follows:

    select
        l_year,
        sum(volume) as revenue
    from
        (
            select
                extract(year from l_shipdate) as l_year,
                (l_extendedprice * (1 - l_discount) ) as volume
            from
                supplier,
                lineitem,
                orders,
                customer,
                part,
                nation n1,
                nation n2
            where
                s_suppkey = l_suppkey
                and p_partkey = l_partkey
                and o_orderkey = l_orderkey
                and c_custkey = o_custkey
                and s_nationkey = n1.n_nationkey
                and c_nationkey = n2.n_nationkey
                and length(p_type) > 2
                and n1.n_name is not null
                and n2.n_name is not null
                and s_suppkey > 0
        ) shipping
    group by
        l_year
    order by
        l_year;

3. Test results


Two-table   join

Six-table   join

Query   time (s)

235

2669

  Both results are the best-performing among multiple executions.

  The six-table join is (2669/235=11.4) times slower than the two-table join.

 

III. PL pre-joining-on-dimensions solution test

1. Pre-joining of dimensions

  SPL script:


A

1

>env(region,    file(path+"region.ctx").create().memory().keys@i(R_REGIONKEY))

2

>env(nation,    file(path+"nation.ctx").create().memory().keys@i(N_NATIONKEY))

3

>env(supplier,    file(path+"supplier.ctx").create().memory().keys@i(S_SUPPKEY))

4

>env(customer,    file(path+"customer.ctx").create().memory().keys@i(C_CUSTKEY))

5

>env(part,    file(path+"part.ctx").create().memory().keys@i(P_PARTKEY))

6

>env(orders,file(path+"orders.ctx").create().memory().keys@i(O_ORDERKEY))

7

>nation.switch(N_REGIONKEY,region)

8

>customer.switch(C_NATIONKEY,nation)

9

>supplier.switch(S_NATIONKEY,nation)

10

>orders.switch(O_CUSTKEY,customer)

  The first 6 lines read in six dimension tables respectively into the memory to generate six memory tables, create indexes for them and set them as global variables. The last 4 lines establish association between the six memory dimension tables. The pre-joining script will be executed at the start of the SPL server to make preparations for subsequent query.

 

2. Two-table join

  SPL script:


A

1

=file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)

2

=A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2)

3

=A2.groups(year(L_SHIPDATE):l_year;  sum(L_EXTENDEDPRICE * (1 -   L_DISCOUNT)):revenue)

  An ad hoc loading needs the cursor-style retrieval and a pre-joining of cursors. The code after that is similar to that for a pure memory processing.

 

3. Six-table join

  SPL script:


A

1

=file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)

2

=A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)

3

=A2.select(len(L_PARTKEY.P_TYPE)>2    &&   L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME   != null  &&   L_SUPPKEY.S_SUPPKEY>0 )

4

=A3.groups(year(L_SHIPDATE):l_year;

sum(L_EXTENDEDPRICE  * (1 - L_DISCOUNT)):revenue)

  The code following the cursor-style retrieval and pre-joining is as concise and easy to understand as that for a pure memory processing.

 

4. Test results

  Two-table join Six-table join
Query time (s) 266 472

  The six-table join works only 1.8 times slower than the two-table join. It spends time linking L_ORDERKEY field and L_SUPPKEY field to corresponding dimension tables respectively and performing filtering by specified conditions (i.e. referencing fields of these associated tables). There’s no need to handle associations between dimension tables because we already pre-joined them. And it became faster because we cut off half of the hash computation for joins between the fact table lineitem and the dimension tables by creating indexes on them beforehand.

 

VI. Summary

  Test results:

Query time (s)

Two-table join

Six-table join

Performance decline (times)

SQL

235

2669

11.4

SPL pre-joining

266

472

1.8

  It takes SQL 11.4 times slower to execute the six-table join than to execute the two-table join. This shows that SQL needs a lot of CPU to process table joins, which results in considerably decrease in performance. By contrast, the figure is 1.8 for SPL’s pre-joining method. The latter is able to process more table joins with a more stable performance.

  The conclusion: with a query involving a lot of table joins, if the memory is large enough to hold all dimension tables, SPL’s pre-joining of dimension tables can increase the performance effectively. Unfortunately relational databases have the problem that the database engine can’t optimize a scenario requiring many table joins, and thus leading to an inevitable, seriously performance drop.