Performance optimization skill: Joining Big Fact Data and Big Dimension Data
I Problem introduction & Solution
In Performance Optimization: The Join between Small Fact Data and Big Dimension Data, the SPL algorithm makes the most of the memory to retrieve a big enough batch of join key values from a fact table, sort them and compare them with the key values in dimension table to find the eligible records. It avoids traversing a big dimension table. But how can we do if both the fact data and the dimension data exceed the memory space?
SQL’s solution is to respectively divide the fact table and the dimension table into small enough partitions that can be held by the memory according to hash keys, write them to the external memory one by one, and then read each of them in to do the in-memory join. If a partition is still too large for the memory, a second hash partitioning is needed. The method requires buffering all data in both tables.
If the dimension table is ordered, we can divided it evenly, calculate the start value and end value for each segment, and then divide the fact table accordingly. As we can read in the dimension table directly by segments, only the fact table needs to be partitioned by hash values and buffered. We call this the hash-fact-table-only scheme. Since the dimension table is evenly divided, each segment can fit into the memory. You just need to do the hash partitioning once, and this increases the overall performance.
SPL uses the second solution. Now let’s look at how fast it runs and compare it with the Oracle HASH JOIN algorithm.
II Test environment & computing scenario
The test computer has two Intel2670 16-core processors of 2.6 GHz, a 64GB RAM and an SSD, where a 16-core virtual machine with a 64GB RAM is running.
We have created on the virtual machine a dimension table named account, which is made up of 3 fields (accountid, name and state) and has a total of 10 billion records, and a fact table trade that contains 16 billion records with 4 fields – tradedate, outid (account from which money is transferred), receiveid (account to which money is transferred) and amount (transferred amount). accountid is the foreign key for the fact table’s outid field and received field. The relationship between both pairs is a one-to-many relationship.
In Performance Optimization: The Join between Small Fact Data and Big Dimension Data, the fact table has two fields to relate to accounted field in the dimension table (account), which we call the dual-dimension join. According to the test result, it takes Oracle nearly 5 hours to run the execution when the records count in the fact table is 15 million. So for a 10-billion-record fact table, it is estimated the time spent is over 24 hours. So we choose to relate one field in the fact table to the dimension table. For this we call uni-dimension join. Our goal is to calculate the total transferred amount in a certain period of time.
Both dual-dimension and uni-dimension are used to do the SPL tests. And we will adopt a 4-threads parallel processing in all tests to shorten the test time.
III Tests
1. With Oracle
SQL query:
select /*+ parallel(4) */
state,
sum(amount) as amount
from
account,
trade
where
outid = accountid
and tradedate >= date '2008-01-01' + interval '1500' day(4)
group by
state
order by
state;
/*+ parallel(4) */ defines 4 parallel threads.
2. In SPL
Test SPL script:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").create() |
4 |
=file(path+"trade.ctx").create().cursor@m(outid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state;4000000) |
6 |
=A5.groups(state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
joinx() function works with @u option to speed up a join when both fact data and dimension data are big. Here the function’s last parameter defines the number of records read in for each sub-cursor when the fact table cursor is split. When there is enough available memory space, the greater this value the higher the performance.
3. Test results & explanations
Below are results of tests over different volumes of fact data (Unit: Sec):
Records count in the fact table after filtering |
10 billion |
12 billion |
14 billion |
15 billion |
16 billion |
Oracle |
730 |
802 |
860 |
894 |
>10小时 |
SPL |
486 |
562 |
643 |
681 |
730 |
According to our measurement, a 10-billion-row table normally exceeds 8GB space. An optimal Oracle algorithm is able to load in 15-billion rows by adopting the data compression technique. But as we observed, when data increased to 16-billion rows the usage of memory had reached its limit and a large amount of swap space was occupied. This resulted in a ridiculously slow query execution speed. The query wasn’t achieved after 11 hours execution. So we had to terminate it. SPL’s hash-fact-table-only algorithm, intended for scenarios when a large amount of data is stored in external memory, can handle data of any size and needs to hash partition one table only. This brings speed up of linear growth.
IV SPL tests on uni-dimension join dual-dimension join
1. Uni-dimension
SPL query:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").create() |
4 |
=file(path+"trade.ctx").create().cursor@m(outid,receiveid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state;4000000) |
6 |
=A5.groups(state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
2. Dual-dimension
SPL query:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").create() |
4 |
=file(path+"trade.ctx").create().cursor@m(outid,receiveid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state:out_state;receiveid,A3:accountid,state:receive_state;4000000) |
6 |
=A5.groups(out_state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
3. Test results & explanations
Below are results of tests over different volumes of fact data (Unit: Sec):
Records count in the fact table after filtering |
10 billion |
12 billion |
14 billion |
16 billion |
Uni-dimension |
500 |
614 |
664 |
782 |
Dual-dimension |
1146 |
1375 |
1501 |
1957 |
The dual-dimension join has twice as much computation amount as the uni-dimension join. But the execution time is only slightly doubled, also of linear growth. You won’t have the crazy situation again.
Chinese version