Performance optimization skill: Joining Big Fact Data and Big Dimension Data

 

I Problem introduction & Solution

In Performance Optimization: The Join between Small Fact Data and Big Dimension Data, the SPL algorithm makes the most of the memory to retrieve a big enough batch of join key values from a fact table, sort them and compare them with the key values in dimension table to find the eligible records. It avoids traversing a big dimension table. But how can we do if both the fact data and the dimension data exceed the memory space?

 

SQLs solution is to respectively divide the fact table and the dimension table into small enough partitions that can be held by the memory according to hash keys, write them to the external memory one by one, and then read each of them in to do the in-memory join. If a partition is still too large for the memory, a second hash partitioning is needed. The method requires buffering all data in both tables.

 

If the dimension table is ordered, we can divided it evenly, calculate the start value and end value for each segment, and then divide the fact table accordingly. As we can read in the dimension table directly by segments, only the fact table needs to be partitioned by hash values and buffered. We call this the hash-fact-table-only scheme. Since the dimension table is evenly divided, each segment can fit into the memory. You just need to do the hash partitioning once, and this increases the overall performance.

 

SPL uses the second solution. Now lets look at how fast it runs and compare it with the Oracle HASH JOIN algorithm.

 

II Test environment & computing scenario

The test computer has two Intel2670 16-core processors of 2.6 GHz, a 64GB RAM and an SSD, where a 16-core virtual machine with a 64GB RAM is running.

We have created on the virtual machine a dimension table named account, which is made up of 3 fields (accountid, name and state) and has a total of 10 billion records, and a fact table trade that contains 16 billion records with 4 fields  tradedate, outid (account from which money is transferred), receiveid (account to which money is transferred) and amount (transferred amount). accountid is the foreign key for the fact tables outid field and received field. The relationship between both pairs is a one-to-many relationship.

 

In Performance Optimization: The Join between Small Fact Data and Big Dimension Data, the fact table has two fields to relate to accounted field in the dimension table (account), which we call the dual-dimension join. According to the test result, it takes Oracle nearly 5 hours to run the execution when the records count in the fact table is 15 million. So for a 10-billion-record fact table, it is estimated the time spent is over 24 hours. So we choose to relate one field in the fact table to the dimension table. For this we call uni-dimension join. Our goal is to calculate the total transferred amount in a certain period of time.

 

Both dual-dimension and uni-dimension are used to do the SPL tests. And we will adopt a 4-threads parallel processing in all tests to shorten the test time.

 

III Tests

1. With Oracle

SQL query:

select  /*+ parallel(4) */

state,

sum(amount) as amount

from

account,

trade

where

outid = accountid

and tradedate >= date '2008-01-01' + interval '1500' day(4)

group by

state

order by

state;

/*+ parallel(4) */ defines 4 parallel threads.

 

2. In SPL

Test SPL script:

 


A

1

=now()

2

=elapse(date("2008-01-01"),1500)

3

=file(path+"account.ctx").create()

4

=file(path+"trade.ctx").create().cursor@m(outid,amount;tradedate>=A2;4)

5

=A4.joinx@u(outid,A3:accountid,state;4000000)

6

=A5.groups(state;sum(amount):amount)

7

=interval@s(A1,now())

 

joinx() function works with @u option to speed up a join when both fact data and dimension data are big. Here the functions last parameter defines the number of records read in for each sub-cursor when the fact table cursor is split. When there is enough available memory space, the greater this value the higher the performance.

 

3. Test results & explanations

Below are results of tests over different volumes of fact data (Unit: Sec):

 

Records count in the fact table after filtering

10 billion

12 billion

14 billion

15 billion

16 billion

Oracle

730

802

860

894

>10小时

SPL

486

562

643

681

730

 

According to our measurement, a 10-billion-row table normally exceeds 8GB space. An optimal Oracle algorithm is able to load in 15-billion rows by adopting the data compression technique. But as we observed, when data increased to 16-billion rows the usage of memory had reached its limit and a large amount of swap space was occupied. This resulted in a ridiculously slow query execution speed. The query wasnt achieved after 11 hours execution. So we had to terminate it. SPLs hash-fact-table-only algorithm, intended for scenarios when a large amount of data is stored in external memory, can handle data of any size and needs to hash partition one table only. This brings speed up of linear growth.

 

IV SPL tests on uni-dimension join dual-dimension join

1. Uni-dimension

SPL query:

 


A

1

=now()

2

=elapse(date("2008-01-01"),1500)

3

=file(path+"account.ctx").create()

4

=file(path+"trade.ctx").create().cursor@m(outid,receiveid,amount;tradedate>=A2;4)

5

=A4.joinx@u(outid,A3:accountid,state;4000000)

6

=A5.groups(state;sum(amount):amount)

7

=interval@s(A1,now())

 

2. Dual-dimension

SPL query:

 


A

1

=now()

2

=elapse(date("2008-01-01"),1500)

3

=file(path+"account.ctx").create()

4

=file(path+"trade.ctx").create().cursor@m(outid,receiveid,amount;tradedate>=A2;4)

5

=A4.joinx@u(outid,A3:accountid,state:out_state;receiveid,A3:accountid,state:receive_state;4000000)

6

=A5.groups(out_state;sum(amount):amount)

7

=interval@s(A1,now())

 

3. Test results & explanations

Below are results of tests over different volumes of fact data (Unit: Sec):

 

Records count in the fact table after filtering

10 billion

12 billion

14 billion

16 billion

Uni-dimension

500

614

664

782

Dual-dimension

1146

1375

1501

1957

 

The dual-dimension join has twice as much computation amount as the uni-dimension join. But the execution time is only slightly doubled, also of linear growth. You wont have the crazy situation again.