"select * from ( select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from .."

chenhu RaqForum 81 No.
1 Reply • 22 View • 1 Days ago

Performance optimization case course: TPCH-Q2

select * from (
    select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment
    from part,supplier,partsupp,nation,region
    where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 25
        and p_type like '%COPPER'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'ASIA'
        and ps_supplycost = (
            select
                min(ps_supplycost)
            from
                partsupp,
                supplier,
                nation,
                region
            where
                p_partkey = ps_partkey
               and s_suppkey = ps_suppkey
               and s_nationkey = n_nationkey
               and n_regionkey = r_regionkey
               and r_name = 'ASIA'
        )
        order by s_acctbal desc,n_name,s_name,p_partkey
)
    where rownum <= 100;

The subquery in where clause

select min(ps_supplycost) from…

involves four tables: partsupp, supplier, nation, and region.

The outer SQL statement

select s_acctbal,s_name,…

has one more table part compared to the previous subquery. The p_partkey=ps_partkey in the subquery indicates the field associated with the outer statement is p_partkey.

To make this statement easier to understand, we can create a view V:

select    *
from part,partsupp,supplier,nation,region
where
p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'ASIA'
    and p_size = 25
    and p_type like '%COPPER'

The original statement can be rewritten as:

select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment
from V
where
    ps_supplycost = (
        select
            min(ps_supplycost)
        from
            V V1
        Where
            V.p_partkey = V1.p_partkey
    )

The original query is transformed into a single table query, which is equivalent to searching V for a certain record whose ps_supplycost value is the smallest among all records whose partkey values are the same as that of this record.

The essence of this operation is to group V by partkey and then aggregate each group so as to calculate the records with the smallest ps_supplycost in each group. However, this aggregation operation is not supported in SQL, which offers no other choice buta sub-query.

1. Data storage

This query is a traversal calculation. The part and partsupp are large and stored externally as columnar composite table; the Supplier is relatively large and also stored as composite table.

The part and partsupp are in primary-sub relationship and need to be sorted by the association primary key partkey. Since subsequent computations involve grouping larger result set, and the grouping key happens to be partkey, storing data in order by partkey not only solves the association issue but reduces the complexity of the grouping operation.

A composite table has index information that always occupies a minimum space, making it unsuitable for storing small data table. Therefore, small data tables nation and region are stored as row-based bin file, which helps decrease the disk space consumption and can be fully loaded during computation.

For smaller bin files and the composite table supplier, there is no special sorting requirement, so they are sorted by primary key by default.

Data conversion code:

	A
1	=file("region.tbl").import(; , "\|").new(_1:R_REGIONKEY, _2:R_NAME, _3:R_COMMENT).sort(R_REGIONKEY)
2	=file("region.btx").export@b(A1)
3	=file("nation.tbl").import(; , "\|").new(_1:N_NATIONKEY, _2:N_NAME, _3:N_REGIONKEY, _4:N_COMMENT).sort(N_NATIONKEY)
4	=file("nation.btx").export@b(A3)
5	=file("supplier.tbl").cursor(; , "\|").new(_1:S_SUPPKEY, _2:S_NAME, _3:S_ADDRESS, _4:S_NATIONKEY, _5:S_PHONE, _6:S_ACCTBAL, _7:S_COMMENT).sortx(S_SUPPKEY)
6	=file("supplier.ctx").create@y(#S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT)
7	>A6.append(A5)
8	=file("part.tbl").cursor(; , "\|").new(_1:P_PARTKEY, _2:P_NAME, _3:P_MFGR, _4:P_BRAND, _5:P_TYPE, _6:P_SIZE, _7:P_CONTAINER, _8:P_RETAILPRICE, _9:P_COMMENT).sortx(P_PARTKEY)
9	=file("part.ctx").create@y(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
10	>A9.append(A8)
11	=file("partsupp.tbl").cursor(; , "\|").new(_1:PS_PARTKEY, _2:PS_SUPPKEY, _3:PS_AVAILQTY, _4:PS_SUPPLYCOST, _5:PS_COMMENT).sortx(PS_PARTKEY,PS_SUPPKEY)
12	=file("partsupp.ctx").create@py(#PS_PARTKEY, #PS_SUPPKEY, PS_AVAILQTY,PS_SUPPLYCOST,PS_COMMENT)
13	>A12.append(A11)

The sub table partsupp must be stored in segments according to the primary table part. Therefore, A12 uses @p to create the composite table in order to ensure that parallel segments are not misaligned during association.

2. Solve association

Let’s first solve the association between tables, that is, calculate the view V mentioned earlier.

Small dimension table can be pre-loaded and pre-associated.

Script:

	A	B
1	=file("region.btx").import@b().keys@i(R_REGIONKEY)
2	=file("nation.btx").import@b().keys@i(N_NATIONKEY)
3	=file("supplier.ctx").open().import().keys@i(S_SUPPKEY)
4	=A2.switch(N_REGIONKEY,A1)	=A3.switch(S_NATIONKEY,A2)
5	>env(region,A1),env(nation,A2),env(supplier,A3)

A1-A3: Preload the dimension tables region, nation and supplier, and create the primary key with index.

A4, B4: Implement pre-association between dimension tables.

A5: Copy dimension tables to global variables for use in subsequent calculations.

The big tables part and partsupp are in primary-sub relationship and associated in an ordered merge manner.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("part.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && pos@zh(P_TYPE,type))
4	=file("partsupp.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:supplier;A3)
5	=A4.joinx@im(PS_PARTKEY,A3:P_PARTKEY,P_MFGR)
6	=A5.select(PS_SUPPKEY.S_NATIONKEY.N_REGIONKEY.R_NAME==name)
7	for A6,10000
8	=interval@ms(A1,now())

A3: Use the pre-cursor filtering technology mentioned in the previous article and apply the filter condition when creating cursor.

A4: Join the dimension table supplier onto the fact table partsupp, where PS_SUPPKEY:supplier is equivalent to switch@i.

A5: Merge part and partsupp in order, and join the fields of the primary table to the sub table.

A7: Loop through the cursor, retrieve data and calculate the time.

The complete solution to this query includes the loading of dimension table. In order to observe the total execution time, the following introduction will not use the preloading mechanism.

Calculation code that includes the loading of dimension table:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region.btx").import@b(R_REGIONKEY,R_NAME).keys@i(R_REGIONKEY)
4	=file("nation.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY).keys@i(N_NATIONKEY)
5	=file("supplier.ctx").open().import(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT).keys@i(S_SUPPKEY)
6	=A4.switch(N_REGIONKEY,A3)	=A5.switch(S_NATIONKEY,A4)
7	=file("part.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && pos@zh(P_TYPE,type))
8	=file("partsupp.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:A5;A7)
9	=A8.joinx@im(PS_PARTKEY,A7:P_PARTKEY,P_MFGR)
10	=A9.select(PS_SUPPKEY.S_NATIONKEY.N_REGIONKEY.R_NAME==name)
11	for A10,10000
12	=interval@ms(A1,now())

When preloading, store the result as global variable, which may be used in multiple calculations, so all fields are taken out. Here, A3, A4 and A5 only retrieve the required fields.

Test result without preloading:

Test items	Execution time (seconds)
Solve association	12

3. Filtering of dimension table

The method in the previous section is to join the dimension table first and then filter. Because multiple records in fact table may correspond to the same record in dimension table, the filter condition on dimension table may be calculated repeatedly. If the dimension table is filtered first to retain only the required records, there is no need to repeatedly compute the filter condition on the dimension table, and the record in fact table that does not meet the condition can be directly filtered out from the cursor when associating the dimension able with the fact table, thereby reducing the amount of data read from hard disk and the time to generate objects.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region.btx").import@b(R_REGIONKEY,R_NAME).keys@i(R_REGIONKEY)
4	=file("nation.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=A4.switch(N_REGIONKEY,A3)
6	=A4.select@o(N_REGIONKEY.R_NAME==name).keys@i(N_NATIONKEY)
7	=file("supplier.ctx").open().cursor@m(S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT;S_NATIONKEY:A6).fetch().keys@i(S_SUPPKEY)
8	=file("part.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && pos@zh(P_TYPE,type))
9	=file("partsupp.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:A7;A8)
10	=A9.joinx@im(PS_PARTKEY,A8:P_PARTKEY,P_MFGR)
11	for A10,10000
12	=interval@ms(A1,now())

A6: Filter the dimension table nation of the supplier.

A7: Use the filtered nation to join and filter supplier, which can avoid reading inappropriate records in supplier.

A9: Use the filtered supplier to join and filter partsupp, which can avoid reading inappropriate records in partsupp.

Test result:

Test items	Execution time (seconds)
Solve association	12
Filtering of dimension table	7

4. Sequence-numberization of dimension table

If the primary key values of dimension table are natural numbers starting from 1 (which are equivalent to row numbers of records), we can locate a record in the dimension table directly through the key values, i.e., according to the row numbers, rather than by computing and comparing hash values. This method speeds up the association of a dimension table with the fact table.

Locating directly by sequence number eliminates the need to create index for dimension table, and occupies much less memory space.

Conversion code:

	A	B
1	=file("region.btx").import@b()	=A1.new(#:R_REGIONKEY,R_NAME,R_COMMENT)
2	=file("region_2.btx").export@b(B1)	=file("nation.btx").import@b()
3	=B2.new(#:N_NATIONKEY, A1.pselect(R_REGIONKEY==N_REGIONKEY):N_REGIONKEY,N_NAME,N_COMMENT)
4	=file("nation_2.btx").export@b(A3)
5	=file("supplier.ctx").open().import().keys@i(S_SUPPKEY)
6	=A5.new(#:S_SUPPKEY,S_NAME,S_ADDRESS,B2.pselect(N_NATIONKEY==S_NATIONKEY):S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT)
7	=file("supplier_2.ctx").create@y(#S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT)
8	>A7.append(A6.cursor())
9	=file("part.ctx").open()	=A9.cursor().id(P_TYPE).sort()
10	=file("p_type.btx").export@b(B9)	=A9.cursor().run(P_TYPE=B9.pos@b(P_TYPE))
11	=file("part_2.ctx").create@y(#P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT)
12	>A11.append@i(B10.cursor())
13	=file("partsupp.ctx").open().cursor()	=A13.run(PS_SUPPKEY=A5.pfind(PS_SUPPKEY))
14	=file("partsupp_2.ctx").create@py(#PS_PARTKEY, #PS_SUPPKEY,PS_AVAILQTY, PS_SUPPLYCOST,PS_COMMENT)
15	>A14.append(B13)

A1-A8: Convert the primary keys of the dimension tables region, nation, and supplier to sequence numbers, and convert the corresponding foreign key fields to sequence numbers as well.

B9, A10: Use the id function to get a list of the de-duplicated values of P_TYPE, etc., and then sort and store into a btx file to become a dimension table, thereby implementing the sequence-numberization of string fields.

B10-A12: Convert the foreign key values of part to sequence numbers.

A12-A15: Convert the foreign key values of partsupp to sequence numbers.

Calculation code:

	A	B
1	=now()
2	>size=25	>type="COPPER"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(_1)	=file("supplier_2.ctx").open().import()
6	=A4.switch(N_REGIONKEY,A3:#)	=B5.switch(S_NATIONKEY,A4:#)
7	=file("part.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && pos@zh(P_TYPE,type))
8	=file("partsupp_2.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B6:#;A7)
9	=A8.joinx@im(PS_PARTKEY,A7:P_PARTKEY,P_MFGR)
10	=A9.select(PS_SUPPKEY.S_NATIONKEY.N_REGIONKEY.R_NAME==name)
11	for A10,10000
12	=interval@ms(A1,now())

A3, A4, A5: For dimension tables whose primary key values have been sequence-numberized, there is no need to create index when loading.

A8: Use the sequence number of primary key of dimension table to implement foreign key join.

In order to observe the performance difference with “Solve association”, we still use the method of joining first and then filtering here.

Test result:

Test items	Execution time (seconds)
Solve association	12
Sequence-numberization of dimension table	10

5. Aligned sequence

For the foreign key that has been sequence-numberized, we can also use aligned sequence to handle filtering on dimension table.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#;A9)
11	=A10.joinx@im(PS_PARTKEY,A9:P_PARTKEY,P_MFGR)
12	for A11,10000
13	=interval@ms(A1,now())

A5 will generate a sequence with the same length as the dimension table p_type. When the member of the sequence meets the filter condition, it is the sequence number in p_type, otherwise it is null. Since P_TYPE in part is already sequence-numberized and its value corresponds to the row number of record in p_type, A9 can use A5 (P_TYPE) to determine whether a row of data in part meets the filter condition.

B6 and B7 are also aligned sequence. When the filter condition is met, the member is the original record; otherwise, it is null.

Because there is no need to do actual search, aligned sequence can achieve very good performance and is very effective when handling dimension table filtering.

Test result:

Test items	Execution time (seconds)
Solve association	12
Filtering of dimension table	7
Sequence-numberization of dimension table	10
Aligned sequence	4

6. Understanding of aggregation

At this point, association issue has been handled. Next, we will group the association result and calculate the record with the smallest ps_supplycost value in each group.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#;A9)
11	=A10.joinx@im(PS_PARTKEY,A9:P_PARTKEY,P_MFGR)
12	=A11.groups(PS_PARTKEY;P_MFGR,minp@a(PS_SUPPLYCOST):m)
13	=A12.news(m;PS_PARTKEY,m.PS_SUPPKEY.S_ACCTBAL,m.PS_SUPPKEY.S_NAME,m.PS_SUPPKEY.S_NATIONKEY.N_NAME, P_MFGR,m.PS_SUPPKEY.S_ADDRESS, m.PS_SUPPKEY.S_PHONE,m. PS_SUPPKEY.S_COMMENT)
14	=A13.top(100;-S_ACCTBAL,N_NAME,S_NAME,PS_PARTKEY)
15	=interval@ms(A1,now())

A12: Group by PS_PARTKEY, and use the aggregate function minp to find the record with the smallest value in each group, rather than the smallest value itself. Since the grouping result is not large, the groups function here directly calculates out the result table sequence in a cumulative way.

A14: Use the top() function to get the top 100. The top function is also an aggregate function, which can get the top 100 without full sorting.

Test result:

Test items	Execution time (seconds)
Understanding of aggregation	4.5

7. Ordered grouping

After merging part and partsupp in order, the result is still ordered by primary key, which makes it possible to implement ordered grouping. Ordered grouping can return a cursor and is suitable for scenarios where the result set is large.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#;A9)
11	=A10.joinx@im(PS_PARTKEY,A9:P_PARTKEY,P_MFGR)
12	=A11.group@s(PS_PARTKEY;P_MFGR,minp@a(PS_SUPPLYCOST):m)
13	=A12.news(m;PS_PARTKEY,m.PS_SUPPKEY.S_ACCTBAL,m.PS_SUPPKEY.S_NAME,m.PS_SUPPKEY.S_NATIONKEY.N_NAME, P_MFGR,m.PS_SUPPKEY.S_ADDRESS, m.PS_SUPPKEY.S_PHONE,m. PS_SUPPKEY.S_COMMENT)
14	=A13.total(top(100;-S_ACCTBAL,N_NAME,S_NAME,PS_PARTKEY))
15	=interval@ms(A1,now())

A12: Use the order-based grouping function group() to get the result set as cursor. Since ordered grouping only needs to compare adjacent records, it consumes less memory space and avoids hash value computation & comparison. @s means to aggregate in a cumulative way.

group@s returns a cursor. Evenifthe result set is large,memory overflow will not occur.

Since the group() in A12 returns a cursor, A14 needs to use the total function of cursor together with the aggregate function top() to calculate results.

Test result:

Test items	Execution time (seconds)
Understanding of aggregation	4.5
Ordered grouping	4.5

Because the data is actually ordered by the grouping field, the effect of ordered grouping is not obvious when performing cumulative grouping.

8. Ordered grouping - small result set

When the grouping result is not large, we can use groups@o to implement ordered grouping.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open().cursor(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#;A9)
11	=A10.joinx@im(PS_PARTKEY,A9:P_PARTKEY,P_MFGR)
12	=A11.groups@o(PS_PARTKEY;P_MFGR,minp@a(PS_SUPPLYCOST):m)
13	=A12.news(m;PS_PARTKEY,m.PS_SUPPKEY.S_ACCTBAL,m.PS_SUPPKEY.S_NAME,m.PS_SUPPKEY.S_NATIONKEY.N_NAME, P_MFGR,m.PS_SUPPKEY.S_ADDRESS, m.PS_SUPPKEY.S_PHONE,m. PS_SUPPKEY.S_COMMENT)
14	=A13.top(100;-S_ACCTBAL,N_NAME,S_NAME,PS_PARTKEY)
15	=interval@ms(A1,now())

A12: Use groups@o, which means using the ordered grouping algorithm to compare only adjacent grouping fields to directly get a result table sequence.

A14: Use the top() function of table sequence to get the top 100.

Test result:

Test items	Execution time (seconds)
Understanding of aggregation	4.5
Ordered grouping	4.5
Ordered grouping - small result set	4

9. Associationandlocation

The part and partsupp are associated by primary key. Searching for a record in the associated table through the primary key of the filtered table can avoid traversing the whole table and achieving better performance. Let’s first look at the situation of associating andlocating the sub table through the primary table.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor@m(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@m(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open().news(A9,P_PARTKEY,P_MFGR,PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#)
11	=A10.groups@o(PS_PARTKEY;P_MFGR,minp@a(PS_SUPPLYCOST):m)
12	=A11.news(m;PS_PARTKEY,m.PS_SUPPKEY.S_ACCTBAL,m.PS_SUPPKEY.S_NAME,m.PS_SUPPKEY.S_NATIONKEY.N_NAME, P_MFGR,m.PS_SUPPKEY.S_ADDRESS, m.PS_SUPPKEY.S_PHONE,m. PS_SUPPKEY.S_COMMENT)
13	=A12.top(100;-S_ACCTBAL,N_NAME,S_NAME,PS_PARTKEY)
14	=interval@ms(A1,now())

A10: Take the record in the associated sub table partsupp through the filtered primary table part. Because the primary table and sub tables are in one-to-many relationship, it needs to use news() here, indicating that each record taken from the primary table may correspond to multiple sub-table records. At this time, the field of primary table will be duplicated according to the number of the records of the associated sub tables, which is equivalent to executing joinx().

We can also associate and locate the primary table through sub table.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("partsupp_2.ctx").open().cursor@m(PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;PS_SUPPKEY:B7:#)
10	=file("part_2.ctx").open().new@r(A9,P_PARTKEY,P_MFGR,PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;P_SIZE==size && A5(P_TYPE))
11	=A10.groups@o(PS_PARTKEY;minp@a(PS_SUPPLYCOST):m)
12	=A11.news(m;PS_PARTKEY,m.PS_SUPPKEY.S_ACCTBAL,m.PS_SUPPKEY.S_NAME,m.PS_SUPPKEY.S_NATIONKEY.N_NAME, P_MFGR,m.PS_SUPPKEY.S_ADDRESS, m.PS_SUPPKEY.S_PHONE,m. PS_SUPPKEY.S_COMMENT)
13	=A12.top(100;-S_ACCTBAL,N_NAME,S_NAME,PS_PARTKEY)
14	=interval@ms(A1,now())

The new() function in A10 takes the records in the associated primary table through the filtered sub table. The @r option means that when there are multiple records with the same primary key in the sub table, the record in the primary table will be duplicated.

Test result:

Test items	Execution time (seconds)
Understanding of aggregation	4.5
Ordered grouping	4.5
Ordered grouping – Cumulative	4
Association and location - Locate the sub table through primary table	3.5
Association and location - Locate the primary table through sub table	4

The condition of primary table will filter out more records, while sub table will filter out fewer. Therefore, the association and location that filters the sub table through primary table has better performance.

10. Column-wise computing

Column-wise computing mentioned in the previous query can also be used in this query.

Calculation code:

	A	B	C
1	=now()
2	>size=25	>type="COPPER"	>name="ASIA"
3	=file("region_2.btx").import@b(R_REGIONKEY,R_NAME)
4	=file("nation_2.btx").import@b(N_NATIONKEY,N_NAME,N_REGIONKEY)
5	=file("p_type.btx").import@b().(pos@zh(_1,type))
6	=A4.switch(N_REGIONKEY,A3:#)	=A6.(if(N_REGIONKEY.R_NAME==name,~,null))
7	=file("supplier_2.ctx").open()	=A7.cursor@m().skip().(null)
8	=A7.cursor(;S_NATIONKEY:B6:#).fetch()	=A8.(B7(S_SUPPKEY)=~)
9	=file("part_2.ctx").open().cursor@mv(P_PARTKEY,P_MFGR;P_SIZE==size && A5(P_TYPE))
10	=file("partsupp_2.ctx").open()
11	=A10.news(A9,P_PARTKEY,P_MFGR,PS_PARTKEY,PS_SUPPKEY,PS_SUPPLYCOST;B7(PS_SUPPKEY))
12	=A11.groups@o(PS_PARTKEY;minp@a(PS_SUPPLYCOST):m)
13	=A12.(m).conj().derive@o(B7(PS_SUPPKEY):s)
14	=A13.new(P_PARTKEY,s.S_ACCTBAL,s.S_NAME,s.S_NATIONKEY.N_NAME, P_MFGR,s.S_ADDRESS,s.S_PHONE,s.S_COMMENT)
15	=A14.top(100;-S_ACCTBAL,N_NAME,S_NAME,P_PARTKEY)
16	=interval@ms(A1,now())

In column-wise computing, it is advisable to avoid using switch() function. Therefore, instead of using PS_SUPPKEY: B7:# that performs association and filtering simultaneously, A11 uses B7 (PS_PARTKEY) to filter data only. The reason why there is no need for association here is that PS_PARTKEY has already been sequence-numberized and records can be directly retrieved from supplier by row number, which better reflects the advantages of column-wise computing.

A13 first finds the corresponding record in supplier through PS_SUPPKEY to generate a column s, and then A14 uses the new() function to search for the data of target column to be displayed through column s.

Test result:

Test items	Execution time (seconds)
Understanding of aggregation	4.5
Ordered grouping	4.5
Ordered grouping - Cumulative	4
Association and location – Associate the primary table through sub table	3.5
Association and location – Associate the sub table through primary table	4
Column-wise computing	2.5

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/cFTcUNs7

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

chenhu • 22 View • 1 Days ago

Performance optimization case course: TPCH-Q2

1. Data storage

2. Solve association

3. Filtering of dimension table

4. Sequence-numberization of dimension table

5. Aligned sequence

6. Understanding of aggregation

7. Ordered grouping

8. Ordered grouping - small result set

9. Associationandlocation

10. Column-wise computing

ToC