Learn performance optimization skills from TPCH tests

I、 Objection

TPCH is a set of benchmarks issued by the Transaction Processing Performance Council for database decision support ability. By simulating complex queries related to business in database, TPCH examines the comprehensive processing ability of database and obtains the response time of database operation.

The TPCH benchmark model defines a database model whose capacity can be selected at 8 levels ranging from 1GB to 10000GB. The database model includes eight data tables including CUSTOMER, LINEITEM, NATION, ORDERS, PART, PARTSUPP, REGION and SUPPLER, as well as 22 SQL query statements. It covers a wide range of contents and can test the operation performance of the database completely.

TPCH's SQL has many multi-layer nested complex queries, which have poor performance. For these queries, if we can adopt more reasonable storage scheme, design low complexity algorithms and cooperate with parallel means, we will get better performance. Unfortunately, due to the limitation of the theoretical system, many ideas can not be realized by SQL, and the SQL programmers do not pay attention to these performance optimization methods, often can only endure the low-speed operation of the database.

This series of articles will discuss the performance optimization techniques of data operation through these 8 tables and 22 SQL statements, carefully analyze each statement, find its operation and data characteristics, and design more reasonable algorithms to achieve the same goal. Because it is difficult for SQL to implement these algorithms and storage structures, we will use esProc group tables to store data, and use SPL to implement these algorithms, while comparing the performance with SQL on Oracle. It is hoped that readers will be able to learn the high performance computing technologies applicable in various situations through these contents.

II、 Environment

CPU: 2 Intel 3014, main frequency 1.7G, 6 cores per CPU.            

Hard Disk (SSD): 1T 561MB/s (Read) 523MB/s (Write)  Interface: SATA 6.0Gb/s            Memory: 64G.            

Operating system: Linux CentOS 7

 

III、 Data preparation

In order to avoid the difference in the ability of hard disk to read data, we store Oracle data table and SPL group table files on the same SSD solid-state hard disk.

1. TPCH raw data generation

The original data of 8 tables are generated by TPCH official tool. The total data volume is 100G. The size of each file is shown in the following table.

2. Data import into Oracle

Data import into Oracle is relatively simple. Create a database named TPCH in Oracle database. Run the dss.ddl and dss.ri files generated by TPCH tool to create eight data tables and the primary and foreign key relationships of tables. Then run the CTL file to import the original data into the tables. For example, the CTL file for importing lineitem table data is as follows:

LOAD DATA INFILE '/home/sjr/tpch_2_17_0/tbls/lineitem.tbl'

INTO TABLE LINEITEM

(L_ORDERKEY terminated by '|',

L_PARTKEY terminated by '|',

L_SUPPKEY terminated by '|',

L_LINENUMBER terminated by '|',

L_QUANTITY terminated by '|',

L_EXTENDEDPRICE terminated by '|',

L_DISCOUNT terminated by '|',

L_TAX terminated by '|',

L_RETURNFLAG terminated by '|',

L_LINESTATUS terminated by '|',

L_SHIPDATE date "yyyy-mm-dd"  terminated by '|',

L_COMMITDATE date "yyyy-mm-dd"  terminated by '|',

L_RECEIPTDATE date "yyyy-mm-dd"  terminated by '|',

L_SHIPINSTRUCT terminated by '|',

L_SHIPMODE terminated by '|',

L_COMMENT terminated by '|')

Save the script as a lineitem.ctl file and execute it on the Linux command line:           

 > sqlldr database username/password@tpch control="/home/sjr/oracle/ctl/lineitem.ctl"            You can view the execution time of importing data in the log. Data imports for other data tables are similar, and will not be further elaborated.

3. Generating SPL group tables

SPL group tables are also generated by the original data file tbl. Each table generates a group table file, with the primary key of the table as the dimension of the group table. For tables with large amounts of data, sortx function is needed to sort primary keys. For JVM configuration of sortx function, refer to "The impact of JVM parameter adjustment on sortx"  (http://c.raqsoft.com.cn/article/1562723138966).            

Run the SPL script to generate the required group tables. The SPL script to generate each group table is described below.

3.1 region/nation/customer/supplier/part/orders tables

Take the region table as an example:


A

1

=file(path+"region.tbl").cursor(; , "|").new(_1:R_REGIONKEY, _2:R_NAME, _3:R_COMMENT).sortx(R_REGIONKEY;4000000)

2

=file(destinate+"region.ctx").create(#R_REGIONKEY, R_NAME, R_COMMENT)

3

>A2.append(A1)

Create cursor to fetch data with text file and sort by primary key, then write to group table. Note that when creating group table, the primary key field should be named (with # before). The second parameter of sorting can be determined according to the size of memory, and it can also work without writing.            

Other tables are similar, as long as the name of the file, the primary key name of the sorting and the data structure information are changed, no further elaboration is needed.

3.2 partsupp/lineitem tables

Parsupp and lineitem are sub-tables of part and orders tables respectively. For the storage of main and sub-tables, group tables have special requirements. Take lineitem table as an example to illustrate.

The primary keys of the lineitem table are L_ORDERKEY and L_LINENUMBER. L_ORDERKEY corresponds to O_ORDERKEY in the orders table. L_LINENUMBER is the detailed item number of an order in the orders table. Such a table data structure is called the main subtable.

 


A

1

=file(path+"lineitem.tbl").cursor(; , "|").new(_1:L_ORDERKEY, _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY,  _5:L_QUANTITY, _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG, _10:L_LINESTATUS,_11:L_SHIPDATE,   _12:L_COMMITDATE, _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT,  _15:L_SHIPMODE, _16:L_COMMENT).sortx(L_ORDERKEY,L_LINENUMBER;4000000)

2

=file(destinate+"lineitem.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY,  L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX,  L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE,  L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT;L_ORDERKEY)

3

>A2.append(A1)

Note that when A2 creates a group table, it is different from the previous one. There is an extra parameter, indicating that the group table will be segmented according to L_ORDREKEY, and the records with the same value of the L_ORDERKEY field will not be segmented into two segments, so that the correct parallel calculation results can be guaranteed. We will see the effect of this storage scheme in the following examples.            

Similarly, partsupp can be seen as a sub-table of a part table, and the code that generates the group table is as follows:

 


A

1

=file(path+"partsupp.tbl").cursor(; , "|").new(_1:PS_PARTKEY, _2:PS_SUPPKEY, _3:PS_AVAILQTY, _4:PS_SUPPLYCOST, _5:PS_COMMENT).sortx(PS_PARTKEY,PS_SUPPKEY;4000000)

2

=file(destinate+"partsupp.ctx").create( #PS_PARTKEY, #PS_SUPPKEY, PS_AVAILQTY,PS_SUPPLYCOST,PS_COMMENT;PS_PARTKEY)

3

>A2.append(A1)

 

IV、 Data preparation time and file size

The time comparison of data import into Oracle and generation of group tables is shown in the table below. Only three tables, lineitem, orders and partsupp, with the largest amount of data, are compared.

 

Data table name


LINEITEM

ORDERS

PARTSUPP

Oracle


34859 seconds

5494 seconds

3542 seconds

SPL group table


6000 seconds

1068 seconds

647 seconds

 

The original file size and SPL group table file size are shown in the table below.

Data table name

LINEITEM

ORDERS

PARTSUPP

Tbl original file

79.5G

17.8G

12.2G

SPL group table

29.4G

7.2G

4.9G

Oracle table data is stored in the database file, and there is no separate table file. However, after importing the data, the database file is larger than the sum of the original data. It can be inferred that each table occupies more hard disk space than the original data file.

 

TPCH raw data is imported into Oracle database tables more than five times as long as it is imported into SPL group tables (even with sorting time added); moreover, it occupies more than twice as much hard disk space as group tables (group tables are compressed and easier to compress after ordering). Therefore, in order to obtain better query performance, it is very worthwhile to use group tables to store data. It saves time and space, and data preparation becomes faster.

V、 Detailed analysis of TPCH test

Learning Performance Optimization Skills from TPCH Tests - Q2

Learning Performance Optimization Skills from TPCH Tests - Q3

Learning Performance Optimization Skills from TPCH Tests - Q4

Learning Performance Optimization Skills from TPCH Tests - Q5

Learning Performance Optimization Skills from TPCH Tests - Q6

Learning Performance Optimization Skills from TPCH Tests - Q7

Learning Performance Optimization Skills from TPCH Tests - Q8

Learning Performance Optimization Skills from TPCH Tests - Q9

Learning Performance Optimization Skills from TPCH Tests - Q10

Learning Performance Optimization Skills from TPCH Tests - Q11

Learning Performance Optimization Skills from TPCH Tests - Q12

Learning Performance Optimization Skills from TPCH Tests - Q13

Learning Performance Optimization Skills from TPCH Tests - Q14

Learning Performance Optimization Skills from TPCH Tests - Q15

Learning Performance Optimization Skills from TPCH Tests - Q16

Learning Performance Optimization Skills from TPCH Tests - Q17

Learning Performance Optimization Skills from TPCH Tests - Q18

Learning Performance Optimization Skills from TPCH Tests - Q19

Learning Performance Optimization Skills from TPCH Tests - Q20

Learning Performance Optimization Skills from TPCH Tests - Q21

Learning Performance Optimization Skills from TPCH Tests - Q22