Learn performance optimization skills from TPCH tests - Q1

 

I、 Query Requirement

Q1 statement queries a pricing summary report of lineItem. In a single table lineitem query within a certain period of time, the payment has been made, has been shipped and other types of goods for statistics, including business volume billing, delivery, discounts, taxes, average prices and other information.

The characteristics of Q1 statement are: single table query operation with grouping, sorting and aggregation operations. This query results in 95% to 97% of the data on the table being read.

II、 Oracle Implementation

The query SQL statements written by Oracle are as follows:

select  /*+ parallel(n) */

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_shipdate <= date '1995-12-01' - interval '90' day(3)

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

l_linestatus;

Where /*+ parallel(n) */ is Oracle's parallel query syntax, and n is the parallel number.

Script execution time, Unit: seconds

 

Number of parallel

1

2

4

8

12

Oracle

570

356

219

170

131

 

III、 SPL optimization

This is a routine grouping query, and the result set is not large. There is no special optimization technology, and the use of multiple cursors can make full use of parallel.            

The SPL script for writing Q1 queries is as follows:

 


A

1

=1

2

1995-12-01

3

=A2-90

4

=now()

5

=file(path+"lineitem.ctx").create()

6

=A5.cursor@m(L_SHIPDATE,L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT,  L_TAX, L_RETURNFLAG,L_LINESTATUS;L_SHIPDATE<=A3;A1)

7

=A6.groups(L_RETURNFLAG, L_LINESTATUS; sum(L_QUANTITY):sum_qty, sum(L_EXTENDEDPRICE):sum_base_price, sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):sum_disc_price, sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)):sum_charge, avg(L_QUANTITY):avg_qty, avg(L_EXTENDEDPRICE):avg_price, avg(L_DISCOUNT):avg_disc, count(1):count_order)

8

=interval@s(A4,now())

A1 is the number of parallel settings, which is agreed by other subsequent examples.            This code is conventional. A6 defines multiple cursors in parallel. Because the result set is small, Groups is used in A7 for small grouping.

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

570

356

219

170

131

SPL group table

336

174

91

46

38

It can be seen that the parallel effect of SPL is very good, close to linear speed-up.            SPL also performs better in single threading, mainly because group tables are stored in compressed columns.

This query involves a large amount of data and needs to read data from external storage. The hard disk access time is a factor that can not be ignored. When the calculation does not involve all data columns, the use of column storage can reduce the amount of reading. Moreover, the column storage mode is easier to compress, which further reduces the hard disk access time.

In fact, esProc SPL is currently developed in Java, which should be weaker than Oracle, which is developed in C++, if it compares the computing performance of CPU only. However, because compressed column storage reduces hard disk access time, this leads to slow running Java faster than fast C++.

However, column storage is not always effective. If mechanical hard disk is used, column storage will lead to more track-to-track seek time. Although the amount of reading is less, the time consumption caused by seek is likely to be more. This test uses SSD hard disk and has no problem of seeking time.

It's also worth noting that we put the filter conditions in A6, which is the statement of cursor creation. In this way, when SPL reads data, if it finds that the condition is not valid, it will give up reading related columns directly, and further reduce the time of hard disk access and record generation.

 

IV、 Further optimization

This SQL is grouping and statistics for a single table, with filtering and no join. Lineitem has a large amount of data and takes a long time to read. If the consumption of reading can be greatly reduced in the filtering process, the performance can be further improved. When business permits, if the filtering field l_shipdate is used as dimension field when designing group tables, the target data can be quickly selected and the query speed can be improved.

The SPL script for recreating the group table (which is only used for Q1 queries) is as follows:

 


A

1

=file(path+"lineitem.tbl").cursor(; , "|").new(_11:L_SHIPDATE, _1:L_ORDERKEY, _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY,  _5:L_QUANTITY, _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG, _10:L_LINESTATUS, _12:L_COMMITDATE, _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT,  _15:L_SHIPMODE, _16:L_COMMENT).sortx(L_SHIPDATE;4000000)

2

=file(destinate+"lineitem_Q1.ctx").create( #L_SHIPDATE,L_ORDERKEY,L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX,  L_RETURNFLAG, L_LINESTATUS, L_COMMITDATE,  L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

3

>A2.append(A1)

 

Using this group table for testing, the query time is compared as follows:

Script execution time, Unit: seconds

Number of parallel

1

2

4

8

12

Oracle

570

356

219

170

131

SPL group table

336

174

91

46

38

Optimized SPL group table

276

139

76

40

34

 

When creating a group table, it is not always sorted by the primary key. If you know the query condition beforehand or the query condition is commonly used, you can use the query field sorting to improve performance. In use, we can adopt the coexistence strategy of group table sorted by primary key and query field. When writing a query task, we can flexibly select a group table according to need.