Performance Optimization - 5.3 [Traversal technology] Program cursor

 

Performance Optimization - 5.2 [Traversal technology] Ordered grouped subsets

Let’s continue to use above example. Now we want to find out those records where transactions have occurred for n consecutive days in each month, and then count the transaction amount by the day of the week of the occurrence date.

The latter task is very simple, which is a common grouping and aggregating operation. However, the former task is a little troublesome. Even if the data table has been sorted by account and date, to perform this complex operation, we still need to take out the grouped subsets first, and then write a few lines of code to filter out the result. After that, these records will be stored in memory, then how can we proceed to the next step to perform the grouping and aggregating operation?

An easy way to think of is to gradually write the calculated data into a buffer file, and then group and aggregate this file:

A B
1 =file(“trades.ctx”).open().cursor(id,dt,amount)
2 for A1;id =A2.align@a(31,day(dt)).group@o(~==[])
3 =B2.select(~.len()>=n ).conj().conj()
4 =file(“temp.btx”).export@ab(B3,dt,amount)
5 =file(“temp.btx”).cursor@b().groups(day@w(dt);sum(amount))

A2 takes out each grouped subset; B2 aligns the subset to 31 days according to the transaction date, and then splits them into continuously empty and non-empty subsets using the ordered grouping operation; In B3, find out the subsets whose time span exceeds n, at this point, we can get the records that transactions have occurred or have not occurred for n consecutive days, and then union these records, by this time, we can finally obtain the record that transactions have occurred for n consecutive days (transactions not occurred in n consecutive days are empty sets, which will not change the union result). Note that we need to conj twice here, because the result of align@a is a sequence of the sequences.

After the calculation, write the result into a temporary file. We only need to write two fields, and finally perform the grouping and aggregating operation.

Obviously, this calculation process will be very slow as the result of an action of writing and reading, because we need to write the intermediate data into a buffer file. In fact, these data can be directly used for grouping and aggregating operation, and there is no need to write them into external storage. However, since the grouping functions can only be based on table sequence or cursor, if we hard-code each batch of data to achieve grouping and aggregating, it will be too troublesome.

SPL provides the program cursor that allows us to implement this mechanism, that is, simulate the data generated in the loop as a cursor.

A B C
1 Func =file(“trades.ctx”).open().cursor(id,dt,amount)
2 for A1;id =A2.align@a(31,day(dt)).group@o(~==[])
3 return B2.select(~.len()>=n ).conj().conj()
4 =cursor@c(A1).groups(day@w(dt);sum(amount))

By defining a subprogram, the required records can be calculated and returned in the loop of this subprogram. The cursor@c()will collect the returned values and concatenate them into a cursor. When we fetch data from the cursor (such as groups() here), the cursor() function will execute the subprogram and collect the returned values. Once the collected values are sufficient to meet the number requested in this fetch, the execution of the subprogram will be suspended, and return result to this fetch, but the subprogram will not be closed. When we need to fetch data next time, the subprogram will continue running until the whole loop is over, at this point, the cursor() function will also return the signal indicating the cursor ends.

This process can concatenate the data continuously calculated in the loop as a cursor, and the intermediate data do not have to be written into a file, as a result, the operation of this complex process can also obtain higher performance. This kind of cursor is called program cursor.

As we discussed earlier, SPL provides a hash method for big grouping, but it does not provide a similar sort algorithm. We can use the mechanism of program cursor to implement a rough one, such as sorting the order table by order amount:

A B C
1 func =file(“orders.btx”).cursor@b() =100.(file(~))
2 =B1.groupn(int(amount/100)+1;C1) >B1.skip()
3 for C1 return B3.import@b().sort(amount)
4 return cursor@c(A1)

In B2, the orders are split into 100 parts by amount (here, we assume that the order amounts are basically evenly distributed in the range of 0-10000, and you can adjust the split method according to actual situation. It is necessary to ensure that the split expression and the field values to be sorted are monotonically nondecreasing or monotonically nonincreasing, and the number of records corresponding to each split value is small so that it can be loaded into memory). Then, we just need to return the sorting results of each part in order, the cursor@c() function will collect these returned values and concatenate them into a cursor.


Performance Optimization - 5.4 [Traversal technology] First-half ordered grouping
Performance Optimization - Preface