SPL Programming - 11.3 [Big data] Ordered cursor

 

Let’s look at the group() function on the cursor.

As mentioned earlier, for a cursor, we can’t keep all the grouped subsets after grouping in memory to continue calculation, but putting them in external storage will be very troublesome and seriously affect the performance, and the losses often outweigh the gains.

However, there is a case where a grouping subset can be placed in memory for processing. If the data in the cursor is in order to the grouping key values (always from small to large or from large to small), and each grouping subset is not large, we can always keep a grouping subset in memory. After completing the relevant summary operation for it, the current grouping subset can be discarded, and then read in the next grouping subset, and so on. In this way, we can complete some operations that must rely on grouping subsets without large memory.

This is the group() function on cursor, which requires that the cursor is ordered to grouping key values.

Moreover, this function returns a deferred cursor!

A
1 =file(“data.txt”).cursor@t()
2 =A1.group(dt)
3 =A2.fetch(2)

When generating data before, we ensured that the data in the cursor is orderly to the dt field. Now look at the result of A3:

imagepng

The fetch()result is not a tale sequence, but a sequence with two members (because of fetch(2)). Double click one of them to see:

imagepng

This looks like a table sequence. It’s one group, a record sequence composed of records taken from cursor data. In fact, it is a grouping subset.

Different from the sequence, the group()and groups() of the cursor do not correspond. The group()function requires that the cursor data is ordered to the grouping key values, and returns a deferred cursor, it does not calculate immediately; groups() will immediately traverse the cursor for calculation.

A cursor with ordered data for a key value is called an ordered cursor.

Using this grouping subset, we will calculate the order amount and number of orders in the region with the largest order amount every day. This requires to do another group aggregation to the grouping subset and get the record where the maximum value is located. Using the code we’ve learned before, it’s easy:

A
1 =file(“data.txt”).cursor@t()
2 =A1.group(dt)
3 =A2.(~.groups(area; sum(amount):S,count(amount):C).maxp@a(S))
4 =A3.conj().fetch()

It can be seen that functions such as A.()can also be used for cursor, and it is also a deferred cursor. Moreover, in the next round of calculation based on deferred cursor, ~ can be used to represent the current member like in a loop function, but it can’t use # or []. Because the relative position information of the cursor is complex, SPL has not implemented it yet, the subsequent data has not been read out, and forward adjacent reference can’t be implemented.

maxp@a()will return a sequence (actually a record sequence here), so A3 will actually be a cursor equivalent to a two-layer sequence. It takes another conj()to become a single-layer sequence. conj() also returns a deferred cursor. It also needs a fetch()to get the result. A fetch() without parameter will fetch all the data. We know that the data has only 1000 days, the final result set will not be very large and can be fetched completely.

Moreover, from the group()itself and here, we can also see that cursor does not always read out a table sequence. It may fetch() out a two-layer sequence. In fact, the cursor corresponds to a sequence, not always a table sequence. The result of the fetch() is also a sub sequence, and its member may be any data that can be member of a sequence.

group()also has the @i option, and same as the group@i() of a record sequence, it is interpreted as to divide into a new group when the condition is true, and it also returns a deferred cursor. In addition, the group(…;…) function of the cursor will also be interpreted as group(…).new(…) like an in-memory record sequence, and will also return a deferred cursor.

groups also has the @o and @i options, which can perform ordered grouping to improve running performance, but groups() always calculate immediately, and will traverse immediately even for ordered cursor.

A
1 =file(“data.txt”).cursor@t(dt,amount)
2 =A1.groups@o(month@y(dt);sum(amount):S,count(amount):C)

The result of this code is the same as that without @o, but with @o, the performance is better when using order.

Similar to the previous discussion on ordered grouping, ordered cursor can also be used to assist in log parsing. Usually, the log is really large and cannot be read into memory, so it is more necessary to use ordered cursor. SPL also extends the for statement to directly support fetching one grouping subset from an ordered cursor at a time.

Similar to the case of small data, we also discuss it in three cases:

1) Each N line corresponds to an event:

A B
1 =file(“S.log”).cursor@si()
2 =create(…r
3 for A1,3
>A2.insert(…)

Note that when reading text with a cursor, add the @si option to the cursor() function, so that it will read out a sequence composed of strings. If there is no option, it will try to interpret it as a table sequence, but the log data is often messy, and it is likely to make errors when directly parsing it as a table sequence.

2) The number of lines is not fixed. There will be a fixed starting string before each event:

3 for A1.group@i(~==“—start—”),1

To add parameter 1, it means that one grouping subset is read out at a time, or directly use the extended for statement:

3 for A1;~==“—start—”:0

Note that it is a semicolon between the cursor and the following condition.

3)The number of lines is not fixed. Each line of the same event has the same prefix (such as the user ID of the log):

3 for A1.group(left(~,6)),1

Read in one grouping subset at a time, or use the extended for statement:

3 for A1;left(~,6))

You can refer to the previous code that uses ordered grouping to process small data.