Performance optimization case course: TPCH-Q20

 

select
    s_name,s_address
from
    supplier,nation
where
    s_suppkey in (
        select
            ps_suppkey
        from
            partsupp
        where
            ps_partkey in (
                select
                    p_partkey
                from
                    part
                where
                    p_name like 'bisque%'
            )
            and ps_availqty > (
                select
                    0.5 * sum(l_quantity)
                from
                    Lineitem
                where
                    l_partkey = ps_partkey
                    and l_suppkey = ps_suppkey
                    and l_shipdate >= date '1995-01-01'
                    and l_shipdate < date '1995-01-01' + interval '1' year
            )
    )
    and s_nationkey = n_nationkey
    and n_name = 'CHINA'
order by
    s_name;

The main query is simple, but the nested filtering condition is complex.

The filtering condition is mainly for the S_SUPPKEY field of the supplier table. Since this table is relatively small, it is easy to get S_NAME and S_ADDRESS as long as the SUPPKEY set meeting the condition is calculated. Calculating the SUPPKEY set in steps can make the problem-solving idea clearer.

The original SQL statement is to filter supplier with partsupp. Now we modify it to filter partsupp with supplier, the obtained PS _ SUPPKEY set is the same. The partsupp is relatively large, so regarding supplier and part as dimension tables to filter the fact table partsupp is more conducive to improving performance. Calculation steps:

1. Filter the nation table according to the condition, and then filter the supplier table with the result.

2. Filter the part table according to the condition.

3. Take the results of steps 1 and 2 as foreign key tables to match and filter partsupp table.

These three steps can be viewed as a view v1, expressed in SQL as follows:

select
	ps_suppkey
from
	partsupp,
	(
		select
			p_partkey
		from
			part
		where
			p_name like 'bisque%'
	),
	(
		select
			s_suppkey
		from
			supplier,nation
		where
			s_nationkey = n_nationkey
			and n_name = 'CHINA'
	)
where
	ps_partkey=p_partkey
	and ps_suppkey=s_suppkey

With v1, the original SQL statement can be modified as:

select
    s_name,s_address
from
    supplier,
	(
	    select
           distinct ps_suppkey
       from
            v1
       where
            ps_availqty > (
                select
                    0.5 * sum(l_quantity)
                from
                    lineitem
                where
                    l_partkey = ps_partkey
                    and l_suppkey = ps_suppkey
                    and l_shipdate >= date '1995-01-01'
                    and l_shipdate < date '1995-01-01' + interval '1' year
            )
	)  q_suppkey
where s_suppkey=ps_suppkey;

Observe the subquery q_suppkey, where the calculationsof v1 and lineitem fall under the situation of associating with the primary table according to the equivalence condition. We can use the method explained in previous articles to change the calculations to a JOIN computation between v1 and lineitem. After modification, the SQL statement can be rewritten as:

select
    s_name,s_address
from
    supplier,
	(
		select
			distinct ps_suppkey
		from
			v1,
			(select
				l_partkey,
				l_suppkey,
				0.5 * sum(l_quantity) quantity
			from
				lineitem
			where
				l_shipdate >= date '1995-01-01'
				and l_shipdate < date '1995-01-01' + interval '1' year
			group by
				l_partkey,
				l_suppkey
			)
		where
			ps_partkey = l_partkey
			and ps_suppkey = l_suppkey
			and ps_availqty > quantity
	)  q_suppkey
where s_suppkey=ps_suppkey;

Next, we need to continue to modify the subquery q_suppkey. Since the lineitem table is very large, it is necessary to first associate it with v1 to filter out records that do not satisfy the condition, then perform grouping and aggregation, and finally filter out the same L_SUPPKEY set. Rewrite the SQL statement as:

select
    s_name,s_address
from
    supplier,
	(
		select
			distinct l_suppkey
		from
			(select
				l_partkey,
				l_suppkey,
				ps_availqty,
				0.5 * sum(l_quantity) quantity
			from
				lineitem,v1
			where
				ps_partkey = l_partkey
				and ps_suppkey = l_suppkey
				and l_shipdate >= date '1995-01-01'
				and l_shipdate < date '1995-01-01' + interval '1' year
			group by
				l_partkey,
				l_suppkey,
				ps_availqty
			)
		where
			ps_availqty > quantity
	) suppkey
where s_suppkey=l_suppkey;

1. Data storage

There is no special requirement for tables involved in calculation, store them in order by primary key.

Continue to use lineitem.ctx from Q3, part.ctx, supplier.ctx, partsupp.ctx and nation.btx from Q2. Copy them to the main directory of this query.

2. General method

Calculation code:


A

1

=now()

2

1995-1-1

3

=elapse@y(A2,1)

4

>partname="bisque"

5

>nationname="CHINA"

6

=file("nation.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY

7

=file("part.ctx").open().cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().keys@im(P_PARTKEY)

8

=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A6).fetch().keys@im(S_SUPPKEY)

9

=file("partsupp.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;A7.find(PS_PARTKEY),A8.find(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

10

=file("lineitem.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A2 && L_SHIPDATE<A3)

11

=A10.join@i(L_PARTKEY:L_SUPPKEY,A9:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

12

=A11.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

13

=A12.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)

14

=A8.join@im(S_SUPPKEY,A13:~)

15

=A14.new(S_NAME,S_ADDRESS).sort(S_NAME)

16

=interval@ms(A1,now())

A9 is equivalent to calculating the view v1 mentioned above.

A8 retrieves S_NAME and S_ADDRESS while retrieving S_SUPPKEY from the supplier table, which avoids reading them again in A14.

A13 is equivalent to calculating the subquery q_suppkey.

A12 uses the redundant grouping key introduced in previous articles, and PS_AVAILQTY is no longer used as a grouping field, which improves performance.

A13 changes sum(L_QUANTITY)*0.5 to PS_AVAILQTY*2. Both L_QUANTITY and PS_AVAILQTY are integers, which changes decimal multiplication into integer multiplication.

A14 uses join@im, where @i option deletes non-matching records and @m option enables using order-based merge to speed up computation when both S_SUPPKEY and A13 are ordered.

Test result:

Test items

Execution time (seconds)

General method

15

3. Data conversion

For this query, we need to use two optimization methods mentioned in previous articles: dimension table primary key sequence-numberization and date-integer conversion. Copy part_2.ctx, supplier_2.ctx, partsupp_2.ctx and nation_2.btx from Q2 and lineitem_14_4.ctx from Q14 to the main directory of this query.

Calculation code:


A

B

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>partname="bisque"

6

>nationname="CHINA"

7

=file("nation_2.btx").import@b().(N_NAME==nationname)

8

=file("part_2.ctx").open()

=A8.cursor().skip().(false)

9

=A8.cursor@m(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true)

10

=file("supplier_2.ctx").open()

=A10.cursor().skip().(null)

11

=A10.cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS;A7(S_NATIONKEY)).fetch()

=A11.run(B10(S_SUPPKEY)=~)

12

=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

13

=file("lineitem_14_4.ctx").open().cursor@m(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

14

=A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

15

=A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

16

=A15.select(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)

17

=B10(A16)

18

=A17.new(S_NAME,S_ADDRESS).sort(S_NAME)

19

=interval@ms(A1,now())

A7, B8, B10 and B11 are all aligned sequences. The value of A10 is null or a supplier record, which enables the direct use of the record in A17, without having to read it again.

Test result:

Test items

Execution time (seconds)

General method

15

Data conversion

10

4. Column-wise computing

Calculation code:


A

B

1

=now()

2

1995-1-1

3

=days@o(elapse@y(A2,1))

4

=days@o(A2)

5

>partname="bisque"

6

>nationname="CHINA"

7

=file("nation_2.btx").import@b().select@1(N_NAME==nationname).N_NATIONKEY

8

=file("part_2.ctx").open()

=A8.cursor@m().skip().(false)

9

=A8.cursor@mv(P_PARTKEY;pos@h(P_NAME,partname)).fetch().run(B8(P_PARTKEY)=true)

10

=file("supplier_2.ctx").open()

=A10.cursor@m().skip().(null)

11

=A10.cursor@mv(S_SUPPKEY,S_NAME,S_ADDRESS;S_NATIONKEY==A7).fetch()

=A11.run(B10(S_SUPPKEY)=~)

12

=file("partsupp_2.ctx").open().cursor@mv(PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY;B8(PS_PARTKEY) && B10(PS_SUPPKEY)).fetch().keys@im(PS_PARTKEY,PS_SUPPKEY)

13

=file("lineitem_14_4.ctx").open().cursor@mv(L_PARTKEY,L_SUPPKEY,L_QUANTITY;L_SHIPDATE>=A4 && L_SHIPDATE<A3)

14

=A13.join@i(L_PARTKEY:L_SUPPKEY,A12:PS_PARTKEY:PS_SUPPKEY,PS_AVAILQTY)

15

=A14.groups@u(L_PARTKEY,L_SUPPKEY;PS_AVAILQTY,sum(L_QUANTITY):quantity)

16

=A15.select@v(PS_AVAILQTY*2>quantity).id(L_SUPPKEY)

17

=B10(A16)

18

=A17.new(S_NAME,S_ADDRESS).sort(S_NAME)

19

=interval@ms(A1,now())

Test result:

Test items

Execution time (seconds)

General method

15

Data conversion

10

Column-wise computing

6