Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal

 

We know that the data reading time accounts for a large proportion in the traversal operation of the external storage data table. Since reading is unavoidable, we hope to get as many things as possible during one time reading, which means, the data read out in the traversal process can be used to the maximum extent.

For example, when we intend to group and aggregate the orders, hoping to count sales by product, then finding out the maximum order amount in each region, such two tasks with different grouping keys cannot be written in one grouping operation. A simple method to write will be:

A
1 =file(“orders.ctx”).open()
2 =A1.cursor(product,amount).groups(product;sum(amount))
3 =A1.cursor(area,amount).groups(area;max(amount))

This method will traverse the data table twice, and the amount field will be repeatedly read (assuming it is columnar storage, and more contents will be repeatedly read in case of row-based storage).

Actually, we can calculate both grouped results in one traversal with some techniques.

A
1 =file(“orders.ctx”).open()
2 =A1.cursor(area,product,amount).groups(area,product;sum(amount):samount,max(amount):mamount)
3 =A2.groups(product;sum(samount))
4 =A2.groups(area;max(mamount))

In this way, more amount of computation will be performed by CPU, and more memory will be occupied as it needs to calculate and keep a more detailed grouped result set. One advantage of this method is that a better performance can usually be obtained due to much less traversal. However, if we need to do different grouping and aggregating for more fields, the code will be much more cumbersome. If we want to do some further non-simple-grouping operations to the cursor, such as filtering followed by grouping, it is almost impossible to write with this technique.

Using SQL in the database will face this problem.

SPL provides multipurpose traversal technology to solve this kind of problem. The above operation can be written as follows:

A
1 =file(“orders.ctx”).open()
2 =A1.cursor(product,area,amount)
3 =channel(A2).groups(area;max(amount))
4 =A2.groups(product;sum(amount))
5 =A3.result()

A3 uses the channel() function to define a channel synchronized with cursor A2, on which an operation (also a group) is attached. When traversing the cursor in A4 to calculate the group, the read data will be concurrently sent to the just-mentioned operation (a group on the channel attached by A3). After traversal, the corresponding calculation results will be kept in the channel and can be taken out in A5.

In comparison with the previous code traversed twice, the computation amount of CPU in this code is the same, and only two small grouping are performed for both codes. However, the reading amount of hard disk is much less in this code, and the amount field is read only once.

A cursor can define multiple synchronous channels and attach multiple sets of operations at the same time. Moreover, such operations can be written at will. Besides the grouping, it can also be written in multiple steps. For example, A3 can be written as:

=channel(A2).select(amount>=50).groups(area;max(amount))

It can count the orders with an amount of over 50.

This mechanism works for all cursors, not limiting to composite table.

SPL also provides a statement-pattern channel syntax, such code looks more neat:

A B
1 =file(“orders.ctx”).open()
2 =A1.cursor(product,area,amount)
3 cursor A2 =A3.groups(area;max(amount))
4 cursor =A4.groups(product;sum(amount))
5 cursor =A5.select(amount>=50).total(count(1))
6

After the cursor is created, use the cursor statement to create a channel for it, and attach operations on the channel. Multiple channels can be created. If the cursor parameters are not written in the subsequent statements, it indicates the same cursor will be used. After all cursor statements (code blocks) are written, SPL will consider that all channels have been defined completely, and will start traversing the cursor, calculate the operation results of each channel and store them in the cell where the cursor statement is located.

Here, B3 and B4 respectively define corresponding operation targets, and B5 adds the number of orders with a calculated amount of more than 50. These calculation results will be placed in A3, A4, and A5 respectively (note that it is not B3, B4, B5).

The idea of multi-purpose traversal can also be applied to data split. For example, there is a large text file from which we hope to pick out the compliant data (meeting the given conditions) to perform further analysis, in this case, using the select()function of the cursor is OK. Furthermore, we may also hope to know what non-compliant data (failing to meet the conditions) are, in order to prevent this from happening again. Since one-time filtering principle cannot separate the records that meet and do not meet the conditions at the same time, the multi-purpose traversal technique can now be used.

A
1 =file(“data.txt”).cursor@t()
2
3 =channel(A1).select(!(${A2})).fetch()
4 =A1.select(${A2})
5 >file(“result.btx”).export@b(A4)
6 =A3.result()

When the filter condition is filled in A2, the channel in A3 will filter out and fetch the records that do not meet the conditions, and the cursor in A4 will filter out the records that meet the conditions, then in A5, the cursor will be traversed to write the records that meet the conditions to a new file, at last, A6 can take out the channel results. In this case, we assume there are few records that do not meet the conditions, and the memory is capable of storing such records.

However, this method still needs to calculate this condition twice (records that meet and do not meet the condition to be calculated separately). SPL provides a method directly in its select() function, which can take out the records that do not meet the condition concurrently, but such records can only be written to another file in the format of bin file.

A
1 =file(“data.txt”).cursor@t()
2
3 =A1.select(${A2};file(“error.btx”))
4 >file(“result.btx”).export@b(A3)

Similarly, it is also possible to split the big data table into multiple groups. For example, the order records will be spitted into multiple files by region for distribution purpose, in this case, the channel can also be used. However, the number of channels needs be predetermined in the code. We should know how many channels are needed in advance, and it cannot temporarily generate a new channel in the process of grouping.

To solve this problem, SPL attaches a function that can split and write-to-files when performing serial number group on the cursor. For example, the following code divides the orders into 12 months (take months as example due to its characteristics of easy serialization, and in other cases, you can do serialization yourself).

A
1 =file(“orders.txt”).cursor@t()
2 =12.(file(“order”/~/“.btx”))
3 =A1.groupn(month(dt);A2)
4 =A1.skip()

The groupn()function in A3 is a delayed cursor, which merely records the action, and will be actually calculated during cursor traversal. We should prepare a corresponding number of file objects (A2) in advance. Similar to select(), groupn() can only write data to bin files.