Learn performance optimization skills from TPCH tests
I、 Objection
TPCH is a set of benchmarks issued by the Transaction Processing Performance Council for the decision and support abilities of databases. By simulating complex queries related to business in database, TPCH examines the comprehensive processing ability of database and obtains the response time of database operation.
The TPCH benchmark model defines a database model whose capacity can be selected at 8 levels ranging from 1GB to 10000GB. The database model consists of eight data tables including CUSTOMER, LINEITEM, NATION, ORDERS, PART, PARTSUPP, REGION and SUPPLER, as well as 22 SQL query statements. It covers a wide range of contents and can test the operation performance of the database thoroughly.
The SQL of TPCH contains some complex queries with multi-layer nesting which results in poor performance. For these queries, if we can adopt more reasonable storage schemes, design low-complexity algorithms and cooperate with parallel skills, better performance can be obtained. Unfortunately, due to the limitation of the theory, many ideas can not be implemented in SQL. As a result, SQL programmers do not pay attention to these performance optimization methods either, and only endure the low-speed operation of database.
This series of articles will discuss the performance optimization techniques of data operation through these 8 tables and 22 SQL statements. Each statement will be carefully analyzed to find its operation and data characteristics, and more reasonable algorithms will be designed to achieve the goal. Because it is difficult for SQL to implement these algorithms and storage schemes, we will use esProc composite tables to store the data, and use SPL to implement these algorithms as well as to compare the performance with SQL on Oracle. It is hoped that by the following contents readers will be able to learn the high performance computing technologies that are applicable in various situations.
II、 Environment
CPU: 2 Intel 3014, main frequency 1.7G, 6 cores per CPU.
Hard Disk (SSD): 1T 561MB/s (Read) 523MB/s (Write) Interface: SATA 6.0Gb/s
Memory: 64G.
Operating system: Linux CentOS 7
III、 Data preparation
In order to avoid the differences in the ability of reading data from hard disk, we store both Oracle data table and SPL composite table files on the same SSD .
1. TPCH raw data generation
The original data of 8 tables are generated by TPCH official tool with the total data volume as 100G. The size of each file is shown in the following table.
2. Importing data into Oracle
It is relatively easy to import data into Oracle. Create a database named tpch in Oracle database, execute the dss.ddl and dss.ri files generated by TPCH tool to create 8 data tables as well as the primary and foreign key relationships of the tables. Then execute the ctl file to import the original data into each table. For example, the ctl file for importing lineitem table data is as follows:
LOAD DATA INFILE '/home/sjr/tpch_2_17_0/tbls/lineitem.tbl'
INTO TABLE LINEITEM
(L_ORDERKEY terminated by '|',
L_PARTKEY terminated by '|',
L_SUPPKEY terminated by '|',
L_LINENUMBER terminated by '|',
L_QUANTITY terminated by '|',
L_EXTENDEDPRICE terminated by '|',
L_DISCOUNT terminated by '|',
L_TAX terminated by '|',
L_RETURNFLAG terminated by '|',
L_LINESTATUS terminated by '|',
L_SHIPDATE date "yyyy-mm-dd" terminated by '|',
L_COMMITDATE date "yyyy-mm-dd" terminated by '|',
L_RECEIPTDATE date "yyyy-mm-dd" terminated by '|',
L_SHIPINSTRUCT terminated by '|',
L_SHIPMODE terminated by '|',
L_COMMENT terminated by '|')
Save the script as a lineitem.ctl file and execute it in the linux command line:
> sqlldr database username/password@tpch control="/home/sjr/oracle/ctl/lineitem.ctl"
You can view the execution time of importing data in the log. Data importing of other data tables is similar, and will not be further elaborated.
3. Generating SPL composite tables
SPL composite tables are also generated from the original data file tbl. Each table generates a composite table file, with the primary key of the table as the dimension of the composite table. For tables with small data like the region and nation table, better performance can be achieved by using bin files to process the data; for tables with big data, the sortx function is needed to sort the primary keys. For jvm configuration of the sortx function, refer to "The impact of JVM parameter adjustment on sortx".
Execute the SPL script to generate the required composite tables. The SPL script to generate each composite table is described below.
3.1 region/nation bin file
Take the region table as an example:
A |
|
1 |
=file(path+"region.tbl").cursor(;,"|").new(_1:R_REGIONKEY, _2:R_NAME, _3:R_COMMENT).sortx(R_REGIONKEY;4000000) |
2 |
=file(destinate+"region.btx").export@b(A1) |
Use the text file to create a cursor for fetching data and write them in the bin file after sorted by primary key. The second parameter can be sorted according to the size of memory, which can also take effect when omitted. The nation table is similar except that the field name is different.
3.2 customer/supplier/part/orders group tables
Take the customer table for example:
A |
|
1 |
=file(path+"customer.tbl").cursor(;,"|").new(_1:C_CUSTKEY, _2:C_NAME, _3:C_ADDRESS, _4:C_NATIONKEY,_5:C_PHONE, _6:C_ACCTBAL, _7:C_MKTSEGMENT, _8:C_COMMENT).sortx(C_CUSTKEY;15000000) |
2 |
=file(destinate+"customer.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT) |
3 |
>A2.append(A1) |
Use text file to create cursors and write to the composite table after sorted by the primary key. Note that the composite table is created by naming the primary key field (preceded by #) . The second parameter of sorting can be determined by the size of memory, which can still work when omitted.
Other tables are similar, as long as the file name, primary key name for sorting and data structure information are changed, which will no longer be repeated here.
3.3 partsupp/lineitem group tables
parsupp and lineitem are sub-tables of the part and orders table respectively. For the storage of primary and sub tables, composite tables have special requirements, as illustrated by the lineitem table.
The primary keys of the lineitem table are L_ORDERKEY and L_LINENUMBER. L_ORDERKEY corresponds to O_ORDERKEY in the orders table. L_LINENUMBER is the detailed item number of a certain order in the orders table. Such a table data structure is called the primary-sub table.
A |
|
1 |
=file(path+"lineitem.tbl").cursor(;,"|").new(_1:L_ORDERKEY, _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY, _5:L_QUANTITY, _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG, _10:L_LINESTATUS,_11:L_SHIPDATE,_12:L_COMMITDATE, _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT,_15:L_SHIPMODE, _16:L_COMMENT).sortx(L_ORDERKEY,L_LINENUMBER;4000000) |
2 |
=file(destinate+"lineitem.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY,L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX,L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE,L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT;L_ORDERKEY) |
3 |
>A2.append(A1) |
Note that A2 creates a composite table differently from the previous example, with an extra parameter indicating that the composite table will be segmented by L_ORDREKEY and will not segment records with the same value of the L_ORDERKEY field into two segments, which ensures the correct result of the parallel calculation. We will see the effect of this storage scheme in later examples.
Similarly, partsupp can be regarded as a sub-table of the part table, and the code that generates the composite table is as follows:
A |
|
1 |
=file(path+"partsupp.tbl").cursor(;, "|").new(_1:PS_PARTKEY, _2:PS_SUPPKEY, _3:PS_AVAILQTY, _4:PS_SUPPLYCOST, _5:PS_COMMENT).sortx(PS_PARTKEY,PS_SUPPKEY;4000000) |
2 |
=file(destinate+"partsupp.ctx").create(#PS_PARTKEY, #PS_SUPPKEY, PS_AVAILQTY,PS_SUPPLYCOST,PS_COMMENT;PS_PARTKEY) |
3 |
>A2.append(A1) |
IV、 Data preparation time and file size
The time comparison of importing data into Oracle and generating composite tables is shown in the table below, in which only three tables with the biggest data are compared, respectively the lineitem, orders and partsupp table.
Data table name |
LINEITEM |
ORDERS |
PARTSUPP |
Oracle |
34859 seconds |
5494 seconds |
3542 seconds |
SPL composite table |
6000 seconds |
1068 seconds |
647 seconds |
The size of the original file and SPL composite table file are shown in the table below.
Data table name |
LINEITEM |
ORDERS |
PARTSUPP |
Tbl original file |
79.5G |
17.8G |
12.2G |
SPL composite table |
29.4G |
7.2G |
4.9G |
Oracle table data is stored in the database file without separate table file. However, after importing the data, the database file is larger than the sum of the original data, which infers that each table occupies more hard disk space than the original data file.
TPCH raw data takes more than five times longer to be imported into Oracle database tables than to be transferred to SPL composite tables (even with sorting time added); moreover, it occupies more than twice as much hard disk space as composite tables (composite tables are compressed and easier to compress after being ordered). Therefore, in order to obtain better query performance, it is very worthwhile to use composite tables to store data, which saves time and space, resulting in faster data preparation.
V、 Detailed analysis of TPCH test
Learning Performance Optimization Skills from TPCH Tests - Q1
Learning Performance Optimization Skills from TPCH Tests - Q2
Learning Performance Optimization Skills from TPCH Tests - Q3
Learning Performance Optimization Skills from TPCH Tests - Q4
Learning Performance Optimization Skills from TPCH Tests - Q5
Learning Performance Optimization Skills from TPCH Tests - Q6
Learning Performance Optimization Skills from TPCH Tests - Q7
Learning Performance Optimization Skills from TPCH Tests - Q8
Learning Performance Optimization Skills from TPCH Tests - Q9
Learning Performance Optimization Skills from TPCH Tests - Q10
Learning Performance Optimization Skills from TPCH Tests - Q11
Learning Performance Optimization Skills from TPCH Tests - Q12
Learning Performance Optimization Skills from TPCH Tests - Q13
Learning Performance Optimization Skills from TPCH Tests - Q14
Learning Performance Optimization Skills from TPCH Tests - Q15
Learning Performance Optimization Skills from TPCH Tests - Q16
Learning Performance Optimization Skills from TPCH Tests - Q17
Learning Performance Optimization Skills from TPCH Tests - Q18
Learning Performance Optimization Skills from TPCH Tests - Q19
Learning Performance Optimization Skills from TPCH Tests - Q20
Learning Performance Optimization Skills from TPCH Tests - Q21
Learning Performance Optimization Skills from TPCH Tests - Q22
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version