SPL Programming - 9.2 [Grouping] Enumeration and alignment

 

Members with equal grouping keys are divided into a group, which is called equivalence grouping. Equivalence grouping meets the following two characteristics:

1) All members of the original set are in and only in a unique group;

2) No group is empty;

Grouping satisfying these characteristics is also called complete partition in mathematics.

Then is there an incomplete partition?

Yes, the two requirements of complete partition may not be satisfied. For example, there may be an empty subset in the result, or some members of the original set may not be in any group or in multiple groups.

In this section, we will talk about this kind of grouping.

We have made statistics on grouping personnel table by gender. Generally speaking, we expect the returned table sequence to have two records, that is, one for male and one for female, and even ensure the order, so that we can know exactly which record to use to obtain the summary result of male or female.

However, equivalence grouping can not guarantee this result.

Because there are not necessarily two genders in this personnel table. If there is only one gender, there is only one record in the returned table sequence. In this case, we must compare again to know whether this record is male or female.

This is very inconvenient. We want to make sure that there are two records returned, and the order is also determined. If there is only one gender in the data to be grouped, let the summary value of the other gender be 0.

This kind of grouping is called alignment grouping. It is not completely divided because empty set may appear.

A
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)
2 [Male,Female]
3 =A1.align@a(A2,sex)
4 =A3.new(A2(#):sex,~.avg(height):avg_height,~.max(weight):max_height)

Before alignment grouping, there need to be a criterion sequence, such as A2 here. The align@a function in A3 divides the set to be grouped (A1) into several subsets, and the grouping key of each subset corresponds to the criterion sequence members one by one, that is, the records(members of the sequence to be grouped) with the same grouping key value as the criterion sequence member are divided into the same group, so that the number and order of the obtained grouping subsets are determined by the criterion sequence. If a criterion sequence member does not correspond to the grouping key value of any sequence member to be grouped, it will correspond to an empty subset.

With the grouped subset sequence, it is very easy to calculate the table sequence composed of grouping key values and summary values. Note that for the new()function in A4, A2(#) should be used to obtain the grouping key value, because some grouped subset may be empty and this information cannot be obtained from the grouped subsets.

SPL only provides the method of generating grouped subsets, and does not further provide simplified functions to calculate the summary values at one time. On the one hand, it is not difficult to write. On the other hand, alignment grouping is less commonly used than equivalence grouping.

Let’s take another example: in the previous order data, we want to calculate the sales amount by month in the last two years, regardless of whether there are no orders in a certain month or not.

A
1 =T(“data.xlsx”)
2 =to(202001,202012)|to(202101,202112)
3 =A1.align@a(A2,month@y(OrderDate))
4 =A3.new(A2(#):ym,~.sum(Amount),~.count(Amount))

In this grouping, not only empty sets may appear, but also some records of the original set may not be divided into any subset (those orders that are not in the past two years).

Careful readers may find that the align() function has an option @a, so what is the operation when there is no option?

It is used for sorting.

Similar to alignment grouping, we sometimes need to arrange a batch of data in the order we want, rather than the conventional alphabetic or coding order.

In this order table, we want to calculate sales amount by region, but the results should be arranged in the order of East, West, North, South, Center. Of course, we can use align@a that we just learned.

A
1 =T(“data.xlsx”)
2 [East,West,North,South,Center]
3 =A1.align@a(A2,Area)
4 =A3.new(A2(#):Area,~.sum(Amount),~.count(Amount))

If we know for certain that there is at most one item in each group, we can write it in another way:

A
1 =T(“data.xlsx”)
2 [East,West,North,South,Center]
3 =A1.groups(Area;sum(Amount),count(Amount))
4 =A3.align(A2,Area)

In the table sequence obtained after group aggregation in A3, there is at most one record (maybe no record) in each region. The align() function without option can rearrange these records in A2 order.

Aligned sorting is more common than alignment grouping. People have many conventional fixed sorting orders. For example, China’s provincial rankings usually rank Beijing first, rather than Anhui first according to the unicode order. Americans do not have such a fixed notion of sorting regions, but they also have conventions such as Sunday, Monday, Tuesday,… and Low, Medium, High that cannot be sorted directly in alphabetic order.

There is also a very common alignment grouping. The grouping key values are positive integers starting from 1, and the target order is to arrange the numbers from small to large. For example, the months are numbers 1-12, and weekdays are also integers of 1-7. This situation is also called serial number grouping. SPL provides an option @n to support serial number grouping in the group()and groups() functions.

A
1 =T(“data.xlsx”)
2 =A1.groups@n(month(OrderDate):M;sum(Amount),count(Amount))

When used, it is very similar to groups() without @n, but it is an alignment grouping, allowing the generation of emplty sets. If the grouping key value is less than 1, it will also be discarded. Try the stock data we generated earlier:

A
1 =100.new(now()-100+~:dt,rand()*100:price)
2 =A1.select(day@w(dt)>1 && day@w(dt)<7)
3 =A2.groups@n(day@w(dt):wday;min(price):min_p,max(price):max_p)

The table sequence returned by A3 will have 6 records, of which the first one corresponds to the grouped subset of Sunday is an empty set (its grouping key value should be 1), because the data of Sunday has been filtered out in A2, and the largest grouping key value in the remaining data is 6 (Saturday data has been filtered out, and 7 cannot be calculated out), so the result table sequence has 6 records. The first record corresponds to an empty grouped subset, and both aggregation fields are null, but the grouping key value field still has a value.

The grouping key value here does not decrease by 1, and will be 2,…, 6, corresponding to Monday,…, Friday, which is different from the previous calculation result.

We can also change this to subtract 1 from the grouping key.

A
1 =100.new(now()-100+~:dt,rand()*100:price)
2 =A1.select(day@w(dt)<7)
3 =A2.groups@n(day@w(dt)-1:wday;min(price):min_p,max(price):max_p)

Only Saturday data is filtered out in A2, and the grouping key value is subtracted by 1 when grouping in A3, thus the grouping key value of Sunday data will be calculated as 0 and discarded. A3 will calculate 5 records (without Saturday data, the maximum is 5), and this is the same result as before.

It can be seen from these two examples that serial number grouping is not an ordinary grouping with serial number as the grouping key value. Its behavior is more like alignment grouping. Empty sets may appear and members without corresponding grouping subset will be discarded.

The effect of using @n for group()is similar to that of groups(), and we won’t elaborate here.

Another advantage of serial number grouping is fast calculation speed, because the correct grouping subset can be found directly with serial number without comparison operation.

In addition to regular alignment grouping, conditional grouping will also occur, and the most common is grouping by segment. For example, people are divided into several groups according to age, or orders are divided into different groups according to amount.

For example, we divide the previous personnel table into four groups according to BMI value: <20, UnderWeight; 20-25, Normal; 25-30: OverWeight; >30, Obesity. Then calculate the average height of each group.

A
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)
2 [?<20,?>=20 && ?<25,?>=25 && ?<30,?>=30]
3 [UnderWeight,Normal,OverWeight,Fat]
4 =A1.enum(A2,weight/height/height)
5 =A4.new(A3(#):Grade,~.avg(height):avg_height)

The conditions of different levels are written into strings to form a sequence, in which ? represents the grouping key value to be substituted into the calculation. The enum() function will use the grouping key of each member in the set to be grouped to calculate these conditions in turn. If it gets true, the member will be divided into the corresponding group. The final grouping subsets will also correspond to these criterion conditions in quantity and order.

This grouping is called enumeration grouping. Obviously, enumeration grouping may also get empty sets, and some members may not be divided into any groups.

Similarly, SPL only provides the enum()function to return the grouped subsets. To get the summary value, we need to further calculate it ourselves. Unlike alignment grouping, enumeration grouping usually has another sequence to identify the name of each group (A3 here).

Because segment grouping is very common, this case can also be converted into serial number grouping with the pseg() function described earlier, which is simpler to write and faster to execute.

A
1
2 [20,25,30]
3 [UnderWeight,Normal,OverWeight,Fat]
4 =A1.group@n(A2.pseg(weight/height/height)+1)
5 =A4.new(A3(#):Grade,~.avg(height):avg_height)

group@n()is used here to calculate the grouped subsets first. Because the pseg() function will return 0 for the case that is less than the first member, it needs to add 1 as the grouping key value, otherwise this segment will be discarded.

It is also OK to calculate the summary value directly with groups(), but we need another step to fill in the segment ID correctly, otherwise we will see the grouping key value itself, that is, 1, 2, 3,…

A
1
2 [20,25,30]
3 [UnderWeight,Normal,OverWeight,Fat]
4 =A1.groups@n(A2.pseg(weight/height/height)+1:Grade;avg(height):avg_height)
5 =A4.run(Grade=A3(Grade))

In fact, alignment grouping can be regarded as a special enumeration grouping. All alignment grouping can be written by enumeration grouping.

A
1
2 [?==“Male”,?==“Female”]
3 [Male,Female]
4 =A1.enum(A2,sex)
5 =A4.new(A3(#):sex,~.avg(height):avg_height,~.max(weight):max_height)

The enum() function also supports repeated grouping (that is, a member may be divided into multiple grouping subsets), but we won’t talk about it because it is less common. You can use help when necessary.

SQL has only equivalence grouping, and it is very troublesome when encountering the calculation requirements of the enumeration grouping (including alignment grouping).


SPL Programming - Preface
SPL Programming - 9.1 [Grouping] Grouping and aggregation
SPL Programming - 9.3 [Grouping] Order-related grouping