Performance optimization skill: Associating Big Fact Table with Big Dimension Table
I Problem introduction & Solution
In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, the SPL algorithm utilizes the feature of a small fact table which can be loaded in memory, thus it sorts and collects all the association key values from the fact table, and searches in the dimension table to find the target records, which avoids traversing the big dimension table. But how can we improve the performance if both the fact table and the dimension table exceed the memory?
The solution provided in SQL is to respectively HASH partition the fact table and the dimension table into small parts that can be loaded in memory, write them on the external storage, and then load each of them in memory to do the in-memory association. Unfortunately, if a certain part is still too large for the memory, a second HASH partitioning is needed. Meanwhile, we need to do HASH partitioning on both tables, that is, buffering all the data of both tables.
If the dimension table is stored in order, we can segment it evenly, calculate the maximum and minimum values of each segment, and then partition the fact table according to the above values. As we can load the dimension table directly by segments, only the fact table needs to be partitioned and buffered, this method is thus called the one-side partition scheme. Moreover, since the dimension table can be evenly segmented, there is no possibility of bad luck which results in some over-large parts like in the case of HASH algorithm. One single partitioning will definitely implement the query and guarantee better performance.
SPL provides the above solution, so let’s test how fast it performs and compare it with the HASH JOIN algorithm in Oracle.
II Test environment & computing scenario
The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16-core virtual machine with 64G memory is set for testing.
On the virtual machine, we create a dimension table “account” which consists of 3 fields (accountid, name and state) with a total of 10 billion records, and a fact table ”trade” that contains 16 billion records with 4 fields – tradedate, outid (account from which money is transferred), receiveid (account to which money is transferred) and amount (transfer amount). the accountid in the “account” table is the foreign key for the outid field and received field of the fact table, both are in one-to-many relationships.
In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, both outid and receiveid fields in the fact table have to be associated with the accounted field in the dimension table “account”, which is called the dual-dimension table. According to the test result, it takes Oracle nearly 5 hours to run the query when the record volume in the fact table is 15 million. So for a 1-billion-record fact table, the execution time is estimated to be over 24 hours. So we choose to associate only the outid field with the dimension table, which is called uni-dimension table. Our goal is to query the total transfer amount of each state in a certain period of time.
In SPL tests, both dual-dimension and uni-dimension will be executed to compare with each other. And we will adopt 4-thread parallel processing in all tests to shorten the test time.
III Tests
1. Oracle
SQL query statement:
select /*+ parallel(4) */
state,
sum(amount) as amount
from
account,
trade
where
outid = accountid
and tradedate >= date '2008-01-01' + interval '1500' day(4)
group by
state
order by
state;
/*+ parallel(4) */ defines 4 parallel threads.
2. SPL
Test SPL script:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").open() |
4 |
=file(path+"trade.ctx").open().cursor@m(outid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state;4000000) |
6 |
=A5.groups(state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
The joinx function with @u option is used to associate the big fact table with the big dimension table. Here the function’s last parameter defines the number of records to be retrieved from the cursor at each association when the cursor is split into multi-cursor. When there is enough available memory, the greater this value, the better the performance.
3. Test results & explanations
Below are the results of tests over the fact table with different data volumes (Unit: Sec):
Record number of the filtered fact table |
1 billion |
1.2 billion |
1.4 billion |
1.5 billion |
1.6 billion |
Oracle |
730 |
802 |
860 |
894 |
>10 hours |
SPL |
486 |
562 |
643 |
681 |
730 |
It is measured that 1-billion-row table normally exceeds 8G of memory. And an optimal Oracle algorithm is able to load in 1.5-billion rows by adopting the data compression technique. But as we observe, the memory will reach its limit and a large amount of swap space will be occupied when there are 1.6 billion rows of data, resulting in extremely slow query execution. And the query has to be terminated after 11-hour execution without a result. While the one-side partition algorithm in SPL, originally designed for external storage, can process data of any size with only one partitioning, which leads to essentially linear increase in time.
IV SPL tests on dual-dimension and uni-dimension tables
1. Uni-dimension table
SPL query script:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").open() |
4 |
=file(path+"trade.ctx").open().cursor@m(outid,receiveid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state;4000000) |
6 |
=A5.groups(state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
2. Dual-dimension table
SPL query script:
A |
|
1 |
=now() |
2 |
=elapse(date("2008-01-01"),1500) |
3 |
=file(path+"account.ctx").open() |
4 |
=file(path+"trade.ctx").open().cursor@m(outid,receiveid,amount;tradedate>=A2;4) |
5 |
=A4.joinx@u(outid,A3:accountid,state:out_state;receiveid,A3:accountid,state:receive_state;4000000) |
6 |
=A5.groups(out_state;sum(amount):amount) |
7 |
=interval@s(A1,now()) |
3. Test results & explanations
Below are the results of tests over the fact table with different data volumes (Unit: Sec):
Record number of the filtered fact table |
1 billion |
1.2 billion |
1.4 billion |
1.6 billion |
Uni-dimension table |
500 |
614 |
664 |
782 |
Dual-dimension table |
1146 |
1375 |
1501 |
1957 |
The dual-dimension table has twice as much computation amount as the uni-dimension table, and the execution time is only slightly doubled, which also increases linearly and does not lead to completely uncontrollable situations.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version