Performance Optimization Exercises Using TPC-H
1 Preparing data
We use TPC-H tool to generate the original data of text file format. There are altogether 8 tables, with a total size of 100G (See table sizes below).
Then we generate SPL binary files for these text files. Each data table corresponds to a binary file.
Acomposite table has index information that always requires at least a minimum space utilization, so it is not suitable for storing small data tables. That’s why we store small data tables in row-wise storage bin files, which helps decrease the size of data stored and enables loading all data at one time. A large data table, on the other hand, is stored in a column-wise storage composite table that uses primary key as the dimension. When dumping data to the composite table, use sortx() function to sort the data table.
1.1 region/nation
Generate a bin file (Take region table as an example):
A |
|
1 |
=file("region.tbl").import(; , "|").new(_1:R_REGIONKEY, _2:R_NAME, _3:R_COMMENT).sort(R_REGIONKEY) |
2 |
=file("region.btx").export@b(A1) |
There is no need to explicitly create data structure for a bin file, we just write data to it directly.
1.2 customer/supplier/part/orders
Generate a composite table (Take customer table as an example):
A |
|
1 |
=file("customer.tbl").cursor(; , "|").new(_1:C_CUSTKEY, _2:C_NAME, _3:C_ADDRESS, _4:C_NATIONKEY, _5:C_PHONE, _6:C_ACCTBAL, _7:C_MKTSEGMENT, _8:C_COMMENT).sortx(C_CUSTKEY;15000000) |
2 |
=file("customer.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
3 |
>A2.append(A1) |
Creating a composite table requires listing data structure, specifying primary key field(s) (by preceding them with #), and sorting data by primary key. The second parameter in sortx() function is determined by memory size, and can be omitted without impact.
1.3 partsupp/lineitem
partsupp and lineitem are sub tables of part and orders respectively. To ensure synced segmentation between sub table and primary table, use @p option at composite table creation. Take lineitem table as an example:
A |
|
1 |
=file("lineitem.tbl").cursor(; , "|").new(_1:L_ORDERKEY, _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY, _5:L_QUANTITY, _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG, _10:L_LINESTATUS,_11:L_SHIPDATE, _12:L_COMMITDATE, _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT, _15:L_SHIPMODE, _16:L_COMMENT).sortx(L_ORDERKEY,L_LINENUMBER;4000000) |
2 |
=file("lineitem.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT) |
3 |
>A2.append(A1) |
A2 uses @p option to ensure that records having same L_ORDERKEY will be put into same segment, so that we can get result of parallel computation correctly.
2 Data table size
Below lists sizes of the original files and SPL files:
Table |
Row count |
tbl size |
ctx/btxsize |
region |
5 |
1K |
8K |
nation |
25 |
2K |
10K |
customer |
15M |
2.4G |
1.3G |
part |
20M |
2.4G |
1.0G |
supplier |
1M |
140M |
0.1G |
partsupp |
80M |
12.2G |
4.9G |
orders |
150M |
17.8G |
7.2G |
lineitem |
600M |
79.5G |
29.4G |
3 TPC-H exercises explained
Performance Optimization Exercises Using TPC-H – Q1
Performance Optimization Exercises Using TPC-H – Q2
Performance Optimization Exercises Using TPC-H – Q3
Performance Optimization Exercises Using TPC-H – Q4
Performance Optimization Exercises Using TPC-H – Q5
Performance Optimization Exercises Using TPC-H – Q6
Performance Optimization Exercises Using TPC-H – Q7
Performance Optimization Exercises Using TPC-H – Q8
Performance Optimization Exercises Using TPC-H – Q9
Performance Optimization Exercises Using TPC-H – Q10
Performance Optimization Exercises Using TPC-H – Q11
Performance Optimization Exercises Using TPC-H – Q12
Performance Optimization Exercises Using TPC-H – Q13
Performance Optimization Exercises Using TPC-H – Q14
Performance Optimization Exercises Using TPC-H – Q15
Performance Optimization Exercises Using TPC-H – Q16
Performance Optimization Exercises Using TPC-H – Q17
Performance Optimization Exercises Using TPC-H – Q18
Performance Optimization Exercises Using TPC-H – Q19
Performance Optimization Exercises Using TPC-H – Q20
Performance Optimization Exercises Using TPC-H – Q21
Performance Optimization Exercises Using TPC-H – Q22
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version