SPL Programming - 9.3 [Grouping] Order-related grouping

 

As we said, record sequences and sequences are ordered, and the positioning information of members often participates in the calculation. Grouping is no exception, and grouping keys may also be related to sequence numbers of members.

For example, we want to divide the personnel table into groups of every three records:

A
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)
2 =A1.group((#-1)\3)

The grouping in A2 does not use any information of records in A1, but only the sequence number, which is also a reasonable and even common grouping. After all, the essence of grouping is to split a large set into small sets. As long as there is a clear splitting method, it is reasonable.

To divide the personnel table into three groups, we can use the step() function mentioned before or directly use group.

A
1
2 =A1.group(#%3)

By the way, the grouping key values of these two examples are integers. We can also modify the serial number grouping mentioned earlier (need to ensure that the grouping key value is a natural number starting from 1):

A
1
2 =A1.group@n((#+2)\3)
2 =A1.group@n(#%3+1)

In addition to directly using the sequence number as the grouping key value, the grouping may also be related to adjacent members.

For example, given a string composed of letters and numbers, it should be split into consecutive letters and numbers, such as abc1234wxyz56mn098pqrst, and into small strings abc, 1234, wxyz, 56, mn, 098, pqrst.

A
1 abc1234wxyz56mn098pqrst
2 =A1.split().group@o(isdigit(~)).(~.concat())

The group@o() function will scan the whole sequence in turn. When the grouping key value is the same as that of the previous member, the member will be added to the current grouping subset. If the grouping key value changes, a new grouping subset will be generated and the current member will be added to it. After scanning, a batch of grouping subsets will be obtained to complete the grouping operation.

isdigit(x) returns true when x is a numeric character, otherwise false. We have learned A1.split()and it will split this string into a sequence of single characters. Thus, the grouping key value expression isdigit(~) will be calculated successively for this single character sequence as

false fasle false true true true true false …

That is, it is false when encountering a letter and true when encountering a number. Changing from number to letter or from letter to number will cause isdigit(~) to change, resulting in a new grouping subset, so the return value of group@o() will be

[[a,b,c],[1,2,3,4],[w,x,y,z],[5,6],….]

That is, divide the adjacent letters or numbers into a group, and then use concat() to concatenate each group into a string.

This grouping is difficult to describe with the grouping operation we learned earlier.

This is another example of doing grouping directly to a sequence.

If we know that the grouping key value itself is orderly, the ordinary equivalence grouping can also be done with @o. For example, if the order data is orderly by date, and we want to group and summarize it by month, we can write this:

A
1 =T(“data.xlsx”)
3 =A2.groups@o(month@y(OrderDate);sum(Amount),count(Amount))

@o also works for groups(). There is no difference in the result whether we use @o or not here, but if we use @o when the data is orderly, groups() only need to compare the grouping key value with adjacent member, and the calculation speed is much faster.

@o has another variant @i, with which the grouping key is a logical expression, and a new grouping subset is generated whenever true is calculated out.

Let’s review the previous problem of calculating the maximum consecutive days of stock rise. Another way of thinking is to sort the transaction data by date (often already in order), and then scan these data in turn for grouping. If the price of one day is higher than that of the previous day, it will continue to be divided into the same group as that of the previous day; If there is no rise, a new group is generated. After scanning, we will get some subsets. In each subset, the stock price rises continuously. Then we just need to see which subset has the most members.

A
1 =100.new(date(now())-100+~:dt,rand()*100:price)
2 =A1.select(day@w(dt)>1 && day@w(dt)<7)
3 =A2.group@i(price<=price[-1]).max(~.len())

By the way, the easiest way for SQL to complete this task is also to use this idea, but the code is like this (only corresponding to code in A3):

SELECT max(consecutive_day)
FROM (SELECT count(*) consecutive_day FROM
 (SELECT sum(rise\_or\_fall) OVER(ORDER BY dt)
 day\_no\_gain FROM
 (SELECT dt, CASE
 when price>lag(price)
 OVER(ORDER BY dt)
 then 0 else 1 end rise\_or\_fall
 FROM T ) )
GROUP BY day\_no\_gain)

You can feel the difference.

The grouping operation that needs to use order and position information is called order-related grouping. Order-related grouping is very useful in parsing text.

Some text files are not in very neat TXT or CSV format, but have regular text strings, in which there will be structured data, which needs to be parsed by programming if it needs to be reused, such as:

imagepng

Generally speaking, a batch of events will be described in the log, and each event will occupy several lines of text. Structure analysis is to extract the field values of each event.

Using order-related grouping, these log texts can be easily divided into small segments with events as units. The log is usually divided into events in the following ways:

1)Each N line corresponds to an event, which can be split by group((#-1\N)) :

A B
1 =file(“S.log”).read@n()
2 =create(…)
3 for A1.group((#-1)\3)
>A2.insert(…2.

2) The number of lines is not fixed. There will be a fixed starting string before each event, which can be split with group@i :

3 for A1.group@i(~==“—start—”)

3)The number of lines is not fixed. Each line of the same event has the same prefix (such as the user ID of the log), which can be used split with group@o :

3 for A1.group@o(left(~,6))

After splitting, the code can concentrate on parsing the field values from the string of each segment. Of course, this is usually not a simple task.