SQL Performance Enhancement: Segmentation-based Grouping & Aggregation

 

The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Looking SQL Performance Enhancement: Segmentation-based Grouping & Aggregation for details.

 

Problem description

Suppose we divide data table T by field x according to set X={X1<X2<…<Xnand perform a specific aggregate operation by segment number. The beginning segment (the 0th one) meets the condition x<X1, the ith (0<i<n) segment meets the condition represented by the left-closed and right-open interval [Xi<x<Xi+1), and the nth segment satisfies the condition x>=Xn. Certain segments are defined by the right-closed and left-open intervals.

Usually, the SQL for performing segmentation and summarization uses CASE WHEN statement:

select xgroup,sum(x),avg(x) from

       (select     case

                     when x< X1 then 0

                     when x< X2 then 1

                     …

                     else n

              end as xgroup,

              x

       from T)

       group by xgroup

The CASE WHEN statement checks conditions one by one in turn and stops if the current record meets the condition; and moves on to the next condition if the current one does not satisfy it. The x field of each record needs to be matched multiple times to finally decide which segment it belongs to. The number of matches is one at least and n-1 at most. The average number of matches of the x field is n/2 and the degree of complexity is O(n). When the future number of segments is big and the total volume of data is large, the overall performance of this algorithm is bad.

When X1,X2,…,Xn is dynamically changed (a parameter passed in at the front-end, for instance), it is hard to express it in SQL, let alone performance optimization.

Solution

Since set X is ordered, we can use binary search to speed up the segmentation process. The degree of complexity is O(log2n). The algorithm is much faster than the order-based segmentation when n is relatively large. Yet, it can be slower when n is small because of its complexity. Usually when n is greater or equal to 8, the binary search algorithm is really efficient. There are generally several scores of members in set X. Even it is unordered, we can first perform an in-memory sorting and the overall performance is almost not affected.

Moreover, the binary search algorithm lets set X to be easily passed in as a parameter for dynamic segmentation.

Sample code

Define parameter argX first to pass in the sequence of segments, like  [0,100,230,450,480,566,798…], specified by X. Sort the sequence parameter if it is unordered.


A

1

=arg_X.sort()

2

=file("T.ctx").open().cursor(x)

3

=A2.groups(A1.pseg(x):xgroup;sum(x),avg(x))

A1: Pass in the defined parameter and sort it out.

A2: Open the specified composite table and create a cursor based on its x field.

A3: Since the number of segments is small, we use the small result set method to perform grouping and sum and calculate the average on x field. The pseg function is used as the grouping expression where the binary search is employed to calculate the segment number in which x field falls in A1.