SQL Performance Enhancement: Select the Earliest Records in each Group

 

[Abstract]

The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Click SQL Performance Enhancement: Select the Earliest Records in each Group for details.

 

Problem description

Selecting the earliest records in each group is an in-group time-series operation. For example, to analyze the behaviors of users, the earliest records of each user ID and session number should be taken. Stock analysis should retrieve the records of the first trading day of each stock. The gross data volume of the in-group time-series calculation is usually quite large, which needs to be stored externally, while the data of each group are not that big, which can be hold in memory.

 

Suppose the grouping field of table T is gid and time field is time1. The calculation requirement is to group the records of table T, which satisfy a certain condition (the condition changes based on the specific situation and may include the given parameter variables), by gid and find the earliest records of each group.

 

SQL1: for database that does not support window functions

Select T1.gid,T1.time1,…,

not exist (select 1 from T T2

where T1.gid=T2.gid and T2.time1<T1.time1 and …) isFirst

where isFirst =true and …

The database usually converts EXISTS to JOIN to avoid traversing and performs well. However, the EXISTS in SQL1 whose condition is non-equivalence comparison, cannot be converted to JOIN. The database searches the data of the same gid with an earlier time for every record when it traverses the filtered data from table T, which results in bad performance.

 

SQL2: for database that supports window functions

select * from

(select gid,time1,

row_number()over (partition by gid order by time1 asc) rn

from T

where ...)

where rn=1

Generally speaking, the result set of grouping the filtered table T by gid tends to be big. To perform big group requires buffering on external storage, causing the bad computation performance.

 

Solution

1. Presorting and order-based algorithm

We first presort the original data orderly according to the grouping field. The algorithm traverses the ordered data satisfying the condition, reads the data of each group into memory in turn, and retrieves the records with the earliest time field. This method avoids searching for every record and performing grouping with a big result, thus improving the performance greatly.

 

We can also sort the original data by both grouping field and time field beforehand and spare the resorting of each group, which improves the performance less effectively, but simplifies the code obviously.

 

In fact, many grouping fields of in-group time-series calculation are certain (like the user ID and the account number) rather than randomly chosen. Sorting data by these certain fields can help many situations of in-group time-series calculation. Other in-group time-series calculations only differ in the specific algorithms, which will be introduced on their own.

 

Presorting is slow yet one-off and holds only one kind of storage without redundant data.

 

SQL, based on the unordered set, cannot ensure either the data of each group are orderly stored or the records of each group can be directly read in time order. Therefore, order-based algorithms cannot be performed directly. Instead, extra calculation is needed to ensure the retrieved data are in time order.

 

2. Newly-added data

Newly-added data are not always ordered by the grouping field, so they should not be simply appended to the end of the already ordered data. And it will be very time-consuming to directly do an ordinary full sorting of the existing data and the new data together.

 

Instead, we can sort the newly-added data first and then, along with the already ordered data, perform low-cost order-based MERGE algorithm to generate a new ordered data table. In this way, the overall degree of complexity equals to reading and writing the whole data once, which avoids the frequent temporary external buffering in ordinary full sorting, thus improving the performance greatly.

 

Furthermore, we can keep a small-scale ordered table additionally (hereinafter referred to as patch data), which will merge with the newly-added data while keeping the old ordered data unchanged. The patch data will merge with the old ordered data until they accumulate to a suitable size after a while. When performing the in-group time-series calculation, the algorithm reads from the old ordered data and patch data respectively, merges them and then calculates. In this manner, the performance will be a bit slower compared with handling one table of ordered data, but the data orderliness can still make the computation much faster.

 

The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If new data are added every day, we can do the merge once a month. The patch data store data of one month at most and the existing ordered data contain all data one month ago. That is, the patch data may be far smaller than the old ordered data, so the data to be merged every day is relatively small and the data appending is fast. As we do the whole-data order-based merge only once a month, it is fairly acceptable that it takes a littler longer to complete.  

 

Code sample

1. Retrieve the earliest records in the group when the data is ordered by grouping field and unordered by time field.

 


A

1

=file("T.ctx").open().cursor(gid,time1,;)

2

=A1.group(gid).(~.minp(time1))

3

 

A1: open the composite table T.ctx, and create a cursor based on it. The where the condition is written after the semicolon.

A2: group by gid first, then the record of the smallest time1 in each group is retrieved, and return a cursor.

A3: continue to do other calculations on the cursor.

 

Here the aggregation function minp is used, which has better performance and simpler code than the sort function when the data of each group is not big.

 

2. Retrieve the earliest records in the group when the data is ordered by both grouping field and time field.


A

1

=file("T.ctx").open().cursor(gid,time1,;)

2

=A1.group@1(gid)

3

 

A2: group by gid, the first record of each group is retrieved, and return a cursor.

A3: continue to do other calculations on the result cursor.

 

Compared with the foregoing code, this code can be quite simpler if the data are ordered by both grouping field and time field in advance.

 

3. The data are presorted, and the ordered result set is stored.


A

1

=file("T-original.ctx")

2

=A1.open().cursor().sortx(gid,time1).new(gid,time1,)

3

=file("T.ctx").create(#gid,#time1,)

4

=A3.append@i(A2)

A1: define the original composite table T-original.ctx.

A2: open the composite table, create a cursor based on it, sort the cursor by field gid and field time1, and return a new cursor where field gid and field time1 are in the leading positions. The sortx can be written as sortx(gid) if the cursor is only ordered by gid.

A3: create a new composite table T.ctx with pound sign # preceding the field name, which means this composite table is ordered by field gid and field time1. The create can be written as create(#gid,time1,) if the table is only ordered by gid.

A4: write the ordered data into the composite table T.ctx.

 

4. Append the newly-added data

Suppose the daily newly-added data for table T are stored in T_new.btx, and have the same field names in the same order as T.ctx have.


A

B

1

=file("T.ctx").open()


2

=file("T_new.btx").cursor@b().sortx(gid,time1)

3

if   (day(now())==1)

>A1.reset()

4

>A1.append@a(A2)


A1: open the composite table T.ctx.

A2: define a cursor based on T_new.btx and sort it. As the daily volume of new data is small, the sorting is a fast and in-memory operation though sortx function is used. The time1 in sortx should be removed if the cursor is only ordered by gid.

A3: identify whether the current date is day 1: if not, execute A4 and use append@a to merge the record with only the patch data; if so, execute B3 and use reset function to merge the existing ordered data with the patch data to generate a new ordered data table.