SQL Performance Enhancement: Multiple Grouping & Aggregate Operations on a Big Table

 

Abstract
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples.

 

Problem description

A full-table traversal is needed to group and summarize a data table. There will be n traversals when n grouping & aggregate operations are specified. When a table is too large, it takes a lot of time to load data from an external storage device. This results in I/O bottleneck from which any big data processing algorithm will suffer.

For instance, the following three grouping & aggregate operations require three SQL statements:

select sum(x) from T group by a,b

select max(y) from T group by c,d

select avg(y),min(z) from T group by a,c

Even if they are UNIONed, three traversals on T table are needed.

Solution

If the multiple grouping & aggregate operations are completed in a single traversal, the amount of external storage access will be reduced and processing becomes faster by reusing the retrieved data during the traversal.

A big data grouping & aggregate operation is achieved through the cursor. During the traversal of data in a cursor, each segment of retrieved data is grouped and summarized and accumulated to the result set. For the multiple grouping & aggregate calculations, each will be performed and accumulated to the corresponding result set. In this way, we can get multiple grouping & aggregate operations done in one traversal.

Sample code


A

B

1

=file("T1.ctx").open()


2

=A1.cursor(a,b,c,d,x,y,z)


3

cursor A2

=A3.groups(a,b;sum(x))

4

cursor

=A4.groups(c,d;max(y))

5

cursor

=A5.groupx(a,c;avg(y),min(z))

6



A1: Open a composite table.

A2: Create a cursor and be ready to retrieve fields needed in the multiple grouping & aggregate calculations.

A3: Define a channel for A2’s cursor using key word cursor. B3 configures the first grouping & aggregate operation on A3’s channel.

In the same way, A4 and A5 define two channels respectively and B4 and B5 configure the second and the third grouping & aggregate operations on A4 and A5’s channels respectively.

A6: Until all cursor statements or code blocks are executed that SPL considers that the whole set of channels are defined and begins traversal of the cursor. Data retrieved each time will be sent to the three channels respectively to perform the three grouping & aggregate operations separately. When the traversal is over, the result in each channel will be stored in the cell containing the corresponding cursor statement.

A5’s channel is special. It has a relatively large grouping result set and thus employs the external grouping algorithm and returns a cursor.