Performance optimization skill: Associating Big Fact Table with Big Dimension Table
I Problem introduction & Solution
In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, the SPL algorithm utilizes the feature of a small fact table which can be loaded in memory, thus it sorts and collects all the association key values from the fact table, and searches in the dimension table to find the target records, which avoids traversing the big dimension table. But how can we improve the performance if both the fact table and the dimension table exceed the memory?
The solution provided in SQL is to respectively HASH partition the fact table and the dimension table into small parts that can be loaded in memory, write them on the external storage, and then load each of them in memory to do the inmemory association. Unfortunately, if a certain part is still too large for the memory, a second HASH partitioning is needed. Meanwhile, we need to do HASH partitioning on both tables, that is, buffering all the data of both tables.
If the dimension table is stored in order, we can segment it evenly, calculate the maximum and minimum values of each segment, and then partition the fact table according to the above values. As we can load the dimension table directly by segments, only the fact table needs to be partitioned and buffered, this method is thus called the oneside partition scheme. Moreover, since the dimension table can be evenly segmented, there is no possibility of bad luck which results in some overlarge parts like in the case of HASH algorithm. One single partitioning will definitely implement the query and guarantee better performance.
SPL provides the above solution, so let’s test how fast it performs and compare it with the HASH JOIN algorithm in Oracle.
II Test environment & computing scenario
The test computer has two Intel2670 CPUs, 2.6G frequency, 16 cores in total, 64G memory and an SSD hard disk, where a 16core virtual machine with 64G memory is set for testing.
On the virtual machine, we create a dimension table “account” which consists of 3 fields (accountid, name and state) with a total of 10 billion records, and a fact table ”trade” that contains 16 billion records with 4 fields – tradedate, outid (account from which money is transferred), receiveid (account to which money is transferred) and amount (transfer amount). the accountid in the “account” table is the foreign key for the outid field and received field of the fact table, both are in onetomany relationships.
In Performance Optimization Skill: Associating Small Fact Table with Big Dimension Table, both outid and receiveid fields in the fact table have to be associated with the accounted field in the dimension table “account”, which is called the dualdimension table. According to the test result, it takes Oracle nearly 5 hours to run the query when the record volume in the fact table is 15 million. So for a 1billionrecord fact table, the execution time is estimated to be over 24 hours. So we choose to associate only the outid field with the dimension table, which is called unidimension table. Our goal is to query the total transfer amount of each state in a certain period of time.
In SPL tests, both dualdimension and unidimension will be executed to compare with each other. And we will adopt 4thread parallel processing in all tests to shorten the test time.
III Tests
1. Oracle
SQL query statement:
select /*+ parallel(4) */
state,
sum(amount) as amount
from
account,
trade
where
outid = accountid
and tradedate >= date '20080101' + interval '1500' day(4)
group by
state
order by
state;
/*+ parallel(4) */ defines 4 parallel threads.
2. SPL
Test SPL script:
A 

1 
=now() 
2 
=elapse(date("20080101"),1500) 
3 
=file(path+"account.ctx").open() 
4 
=file(path+"trade.ctx").open().cursor@m(outid,amount;tradedate>=A2;4) 
5 
=A4.joinx@u(outid,A3:accountid,state;4000000) 
6 
=A5.groups(state;sum(amount):amount) 
7 
=interval@s(A1,now()) 
The joinx function with @u option is used to associate the big fact table with the big dimension table. Here the function’s last parameter defines the number of records to be retrieved from the cursor at each association when the cursor is split into multicursor. When there is enough available memory, the greater this value, the better the performance.
3. Test results & explanations
Below are the results of tests over the fact table with different data volumes (Unit: Sec):
Record number of the filtered fact table 
1 billion 
1.2 billion 
1.4 billion 
1.5 billion 
1.6 billion 
Oracle 
730 
802 
860 
894 
>10 hours 
SPL 
486 
562 
643 
681 
730 
It is measured that 1billionrow table normally exceeds 8G of memory. And an optimal Oracle algorithm is able to load in 1.5billion rows by adopting the data compression technique. But as we observe, the memory will reach its limit and a large amount of swap space will be occupied when there are 1.6 billion rows of data, resulting in extremely slow query execution. And the query has to be terminated after 11hour execution without a result. While the oneside partition algorithm in SPL, originally designed for external storage, can process data of any size with only one partitioning, which leads to essentially linear increase in time.
IV SPL tests on dualdimension and unidimension tables
1. Unidimension table
SPL query script:
A 

1 
=now() 
2 
=elapse(date("20080101"),1500) 
3 
=file(path+"account.ctx").open() 
4 
=file(path+"trade.ctx").open().cursor@m(outid,receiveid,amount;tradedate>=A2;4) 
5 
=A4.joinx@u(outid,A3:accountid,state;4000000) 
6 
=A5.groups(state;sum(amount):amount) 
7 
=interval@s(A1,now()) 
2. Dualdimension table
SPL query script:
A 

1 
=now() 
2 
=elapse(date("20080101"),1500) 
3 
=file(path+"account.ctx").open() 
4 
=file(path+"trade.ctx").open().cursor@m(outid,receiveid,amount;tradedate>=A2;4) 
5 
=A4.joinx@u(outid,A3:accountid,state:out_state;receiveid,A3:accountid,state:receive_state;4000000) 
6 
=A5.groups(out_state;sum(amount):amount) 
7 
=interval@s(A1,now()) 
3. Test results & explanations
Below are the results of tests over the fact table with different data volumes (Unit: Sec):
Record number of the filtered fact table 
1 billion 
1.2 billion 
1.4 billion 
1.6 billion 
Unidimension table 
500 
614 
664 
782 
Dualdimension table 
1146 
1375 
1501 
1957 
The dualdimension table has twice as much computation amount as the unidimension table, and the execution time is only slightly doubled, which also increases linearly and does not lead to completely uncontrollable situations.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version