Performance optimization case course: TPCH-Q5

 

select
	n_name,
	sum(l_extendedprice * (1 - l_discount)) as revenue
from
	customer,
	orders,
	lineitem,
	supplier,
	nation,
	region
where
	c_custkey = o_custkey
	and l_orderkey = o_orderkey
	and l_suppkey = s_suppkey
	and c_nationkey = s_nationkey
	and s_nationkey = n_nationkey
	and n_regionkey = r_regionkey
	and r_name = 'ASIA'
	and o_orderdate >= date '1995-01-01'
	and o_orderdate < date '1995-01-01' + interval '1' year
group by
	n_name
order by
revenue desc;

This is a grouping and aggregation operation after multiple tables are associated and filtered, which includes the association between the primary table orders and the sub table lineitem.

1. Data storage

Store the big tables orders and lineitem in order by the primary key orderkey, which makes it possible to merge them in order when joining. For other tables, there is no special sorting requirement, they are all stored in order by primary key.

So, we can continue to use orders.ctx, lineitem.ctx and customer.ctx in Q3.

For tables region, nation and supplier, we can directly use regin.btx, nation.btx and supplier.ctx in Q2.

Copy these tables to the main directory of this query.

2. General method

Calculation code:


A

B

1

=now()


2

1995-1-1

>name="ASIA"

3

=elapse@y(A2,1)


4

=file("region.btx").import@b().select(R_NAME==name)

5

=file("nation.btx").import@b().switch@i(N_REGIONKEY,A4:R_REGIONKEY).keys@i(N_NATIONKEY)

6

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NATIONKEY;A5.find(S_NATIONKEY)).fetch().keys@im(S_SUPPKEY)

7

=file("customer.ctx").open().cursor@m(C_CUSTKEY,C_NATIONKEY;A5.find(C_NATIONKEY)).fetch().keys@im(C_CUSTKEY)

8

=file("orders.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=A2 && O_ORDERDATE < A3,O_CUSTKEY:A7)

9

=file("lineitem.ctx").open().news(A8,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:A6)

10

=A9.select(O_CUSTKEY.C_NATIONKEY==L_SUPPKEY.S_NATIONKEY)

11

=A10.groups@u(L_SUPPKEY.S_NATIONKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

12

=A11.sort@z(revenue).run(N_NAME=A5.find(N_NAME).N_NAME)

Here we use the techniques mentioned in the previous articles: filter at cursor creation, order-based primary-sub table association, and convert join field to foreign key table pointers.

Unlike Q3, the grouping field isn’t the ordered L_ORDERKEY and group@s() becomes unsuitable. Here we use groups@u, the result is no longer ordered by the grouping field, and the performance is better compared to not adding @u.

A11 performs grouping on S_NATIONKEY instead of N_NAME because integer comparison is faster than string comparison, and replace the field with N_NAME after the grouping operation is done.

Test result:

Test items

Execution time (seconds)

General method

18

3. Data conversion

Utilize the conversion methods mentioned in previous articles: convert primary keys of dimension tables to sequence numbers, convert enumeration string field to numbers, convert date to integers.

For region, nation and supplier, we can directly use region_2.btx, nation_2.btx and supplier_2.ctx converted in Q2.

For orders, we can use orders_5.ctx in Q3.

Copy these tables to the main directory of this query.

Convert customer:


A

1

=file("customer.ctx").open().import()

2

=file("nation.btx").import@b()

3

=A1.run(C_CUSTKEY=#,C_NATIONKEY=A2.pselect(N_NATIONKEY==C_NATIONKEY))

4

=file("customer_3.ctx").create(#C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT)

5

>A4.append(A3.cursor())

Here we use the customer.ctx in Q3 and the nation.btx in Q2.

Convert lineitem:


A

1

=file("supplier.ctx").open().import().keys@i(S_SUPPKEY)

2

=file("lineitem.ctx").open().cursor()

3

=A2.run(L_SUPPKEY=A1.pfind(L_SUPPKEY))

4

=file("lineitem_3.ctx").create@p(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_EXTENDEDPRICE,L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT)

5

>A4.append(A3)

Here we use the supplier.ctx in Q2 and the lineitem.ctx in Q3.

Calculation code:


A

B

1

=now()


2

1995-1-1

>name="ASIA"

3

=days@o(elapse@y(A2,1))

=days@o(A2)

4

=file("region_2.btx").import@b(R_NAME).(R_NAME==name)

5

=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)

6

=A5.(A4(N_REGIONKEY))


7

=file("supplier_2.ctx").open()

=A7.cursor@m().skip().(null)

8

=A7.cursor@m(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=S_NATIONKEY)

9

=file("customer_3.ctx").open()

=A9.cursor@m().skip().(null)

10

=A9.cursor@m(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=C_NATIONKEY)

11

=file("orders_5.ctx").open().cursor@m(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3,O_CUSTKEY:B9:#)

12

=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;L_SUPPKEY:B7:#)

13

=A12.select(O_CUSTKEY ==L_SUPPKEY)


14

=A13.groups@u(L_SUPPKEY:N_NAME;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

15

=A14.sort@z(revenue).run(N_NAME=A5(N_NAME).N_NAME)

16

=interval@ms(A1,now())


B7 and B9 are aligned sequence and their members are true or false.

A11 and A12 assign the O_CUSTKEY of orders to C_NATIONKEY and assign the L_SUPPKEY of lineitem to S_NATIONKEY. The O_CUSTKEY==L_SUPPKEY in A13 is actually the comparison between C_NATIONKEY and S_NATIONKEY.

Test result:

Test items

Execution time (seconds)

General method

18

Data conversion

10

4. Column-wise computing


A

B

1

=now()


2

1995-1-1

>name="ASIA"

3

=days@o(elapse@y(A2,1))

=days@o(A2)

4

=file("region_2.btx").import@b(R_NAME).(R_NAME==name)

5

=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)

6

=A5.(A4(N_REGIONKEY))


7

=file("supplier_2.ctx").open()

=A7.cursor@m().skip().(null)

8

=A7.cursor@mv(S_SUPPKEY,S_NATIONKEY;A6(S_NATIONKEY)).fetch().(B7(S_SUPPKEY)=~)

9

=file("customer_3.ctx").open()

=A9.cursor@m().skip().(null)

10

=A9.cursor@mv(C_CUSTKEY,C_NATIONKEY;A6(C_NATIONKEY)).fetch().(B9(C_CUSTKEY)=~)

11

=file("orders_5.ctx").open().cursor@mv(O_ORDERKEY,O_CUSTKEY;O_ORDERDATE>=B3 && O_ORDERDATE<A3 && B9(O_CUSTKEY))

12

=file("lineitem_3.ctx").open().news(A11,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,O_CUSTKEY;B7(L_SUPPKEY))

13

=A12.derive@o(B7(L_SUPPKEY).S_NATIONKEY,B9(O_CUSTKEY).C_NATIONKEY).select@v(C_NATIONKEY==S_NATIONKEY)

14

=A13.groups@u(S_NATIONKEY;sum(L_EXTENDEDPRICE*(1-L_DISCOUNT)):revenue)

15

=A14.sort@z(revenue).new(A5(S_NATIONKEY).N_NAME,revenue)

16

=interval@ms(A1,now())


For column-wise computing, try to avoid using switch() function. Therefore, instead of using O_CUSTKEY:B9:# that performs association and filtering simultaneously, A11 uses B9(O_CUSTKEY) to filter data only. Because O_CUSTKEY is already sequence-numberized here, the customer record can be directly retrieved by row number in B9 without the need for association, which better manifests the advantages of column-wise computing.

A13 first uses O_CUSTKEY to find the corresponding customer record in B9, and then obtains the value of S_NATIONKEY. To achieve this, the aligned sequence calculated in B9 is no longer true or false, but the customer record when the condition is satisfied, or null when the condition is not satisfied.

The association between lineitem and supplier is handled similarly to that of customer.

In A13, when performing select filtering on the columnar cursor, adding @v option ensures the filtered cursor is still a columnar cursor.

Test result:

Test items

Execution time (seconds)

General method

18

Data conversion

10

Column-wise computing

4