SPL Programming - 8.4 [Data table] Loop functions

 

Since record sequences (a table sequence is also a record sequence) can be regarded as sequences, we should also be able to use loop functions for these objects. We have used new()and derive(). Let’s try the loop functions we learned before and continue to use the table sequence of 100 records created with new().

Calculate the average height, maximum weight and minimum BMI of these people.

A B
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)
2 =A1.(height).avg() =A1.avg(height)
3 =A1.(weight).max() =A1.max(weight)
4 =A1.min(weight/height/height)

Taking a record sequence as a sequence, the A.(x) operation can be performed normally. It is often used to get the sequence composed of values of a field. Structured data has multiple fields, so it is easy to form some expressions with business significance. Therefore, aggregation functions based on record sequence will often be calculated directly with an expression (B2, B3 and A4 here).

The selection function for structured data will also be more business meaningful than for conventional data.

Find the person with the largest weight and the person with the smallest BMI:

A B
1
2 =A1.maxp(weight) =A1.maxp@a(weight)
3 =A1.minp(weight/height/height)

A2 only finds the first and return, and B2 uses @a to find all the records that maximize the weight.

This is often what we are more concerned about, that is, the record corresponding to the maximum and minimum value, rather than the maximum and minimum value itself.

The records selected by conditions can also perform various set operations, as well as further aggregation and selection operations:

A B
1
2 =A1.select(height>=1.7) =A1.select(sex==“Female”)
3 =A2^B2 =A1.select(height>=1.7 && sex==“Female”)
4 =A2\B2 =A1.select(height>=1.7 && sex!=“Female”)
5 =A2.maxp@a(weight) =B2.minp(weight/height/height)
6 =A3.avg(height) =A4.max(weight)

A2 selects the persons whose height is not less than 1.7 and B2 selects the females. A3 calculates the intersection of the two, B3 uses logical operation to express the same calculation result as A3. It is the similar case with A4 and B4. A5 and B5 do further selection to the selected record sequence, and A6 and B6 continue to do aggregation to the selected record sequence.

Sorting is also a common operation:

A B
1
2 =A1.sort(height)
3 =A1.select(sex==“Female”) =A3.sort(-weight)
4 =A1.sort(height,-weight)
5 =A1.top(-3,weight) =A1.top(-3;weight)
6 =A1.ranks(height) =A1.derive(A6(#):hrank)

We know that the default sorting direction of the sort()function is from small to large, and we can use @z to reverse the order. However, when it comes to structured data, there are often multiple fields to sort, and the sorting directions of these parameters are different. In this case, there is no way to use only one @z to control. The method given by SPL is to write the parameters as opposite numbers (plus a negative sign), so that if we continue to sort from small to large, we will realize the reverse order of the original values. The method of writing negative signs into opposite numbers is also applicable to string and date time data.

In this way, A2 sorts by height from low to high, and B3 sorts females by weight from large to small. A4 will sort by height from small to large first, and those with the same height will be sorted by weight from large to small.

Using negative numbers in the top()function is another way to represent the reverse order, which is equivalent to getting the last few after sorting (if positive numbers, the first few). A5 and B5 will calculate the three largest weights and the three individuals with largest weights respectively. The calculation result of A5 is a sequence composed of three numbers, while B5 returns a record sequence composed of three records. SPL also supports ranking function. A6’s ranks function can calculate everyone’s ranking by height.

We feel that using Male and Female to represent gender is too long. Just use one letter M and F. In the future, it will be shorter when writing comparative expressions. It can be done with the run() function.

A
1
2 >A1.run(sex=left(sex,1))

When using the loop function to assign a value to a field, we can also directly use the field name to represent the field of the current record instead of writing as ~.sex.

derive()can be used to append fields to generate a new table sequence. Sometimes we need to append multiple fields, and the field to be appended later needs to be calculated from the field to be appended first. For example, we need to add a BMI field, and then add a flag field for obesity according to the value of BMI. With derive(), it will be written as follows:

| | A |
| 1 | … |
| 2 | =A1.derive(weight/height/height:bmi) |
| 3 | =A2.derive(if(bmi>25,“FAT”,“OK”):flag) |

However, the calculation of derive()is very complex. It needs to create new records and table sequence and copy the original data. The performance is poor. In principle, it should be used as little as possible. A better way is to use derive() and run() to complete the task:

A
1
2 =A1.derive(weight/height/height:bmi,:flag)
3 =A2.run(flag =if(bmi>25,“FAT”,“OK”))

Append both fields at one time in A2, and then use run()in A3 to calculate the value of flag field. The derive() function is executed only once, and the actions of creating and copying table sequence are reduced, and the operation performance can be improved a lot.

From these examples, we can realize once again that multiple fields of structured data are easy to form many calculation expressions with business significance. There are many operations on field expressions in loop functions, but they are relatively uncommon in single value sequence calculation.

Combined with the functions of reading and writing Excel files, we can now use program code to merge, filter, add calculation results, sort and perform other operations on a batch of Excel files.

Similar to the loop function of a sequence, the record sequence will also have multi-layer nesting.

For example, we want to calculate the minimum height difference between males and females in this group:

A B
1
2 =A1.select(sex==“Male”) =A1.select(sex==“Female”)
3 =A2.min(B2.min(abs(height-A2.height)))

When the inner layer needs to reference the record field of the outer loop, the corresponding variable name of the outer loop function should also be written to represent the current record, still it is not necessary to write ~.

To find out which pairs of male and female make the minimum height difference true, it is more troublesome.

A B
1
2 =A1.select(sex==“Male”) =A1.select(sex==“Female”)
3 =A2.conj(B2.([A2.~,~])) =A3.minp@a(abs(~(1).height-~(2).height))

Here, we need to use ~ to keep the record, form a male and female pair, and then find it with the selection function. A3 in B3 is no longer a record sequence, and it is meaningless to reference field names if ~ is omitted.

The members of a record sequence are records with multiple fields, and the information content is relatively rich. The result obtained by the selection function is also a record sequence composed of these records. Therefore, the information contained in the position returned by positioning functions such as pselect, pmax and pmin is seldom needed, but the position still needs to be obtained when it comes to order related calculations.

Now let’s discuss the loop functions related to order and regenerate a date related table sequence.

A
1 =100.new(date(now())-100+~:dt,rand()*100:price)
2 =A1.select(day@w(dt)>1 && day@w(dt)<7)

Randomly generate a price table of a stock from 100 days ago to today. dt field is the date and price field is the price. Because there is no transaction on weekends, filter out the weekend dates after generating the data, and use this A2 in the future. The day@w() function returns the weekday of a date, but note that its return value is 1 on Sunday and 7 on Saturday. For historical reasons, the computer system uses the habit of Westerners. Sunday is the first day of the week.

First, we want to calculate the daily increase and moving average price:

A
3 =A2.derive(price-price[-1]:gain)
4 =A2.derive(price[-1:1].avg():mavg)

In the loop functions of a record sequence, we can add [±i] after the field to refer to the field of an adjacent record, and add [a:b] to refer to the sequence composed of the field values of adjacent records. These are the same as the loop functions of the sequence.

Calculate the maximum consecutive rising days of this stock:

| | A |
| … | … |
| 3 | =0 |
| 4 | =A2.(if(price>price[-1],A3+=1,A3=0)).max() |

According to natural thinking, first fill in 0, add 1 if it rises one day, and clear to 0 if it does not rise. We can calculate the days of continuous rise to one day, and then get the maximum value.

Calculate the increase on the day when the stock price is the highest:

A
3 =A2.pmax(price)
4 =A2(A3).price-A2.m(A3-1).price

Here, use the positioning function pmax() to calculate the sequence number where the maximum value is located, and then use this sequence number to calculate the increase. If we need to consider that the highest price may appear on multiple dates, use the @a option.

A
3 =A2.pmax@a(price)
4 =A3.new(A2(~).dt,A2(~).price-A2.m(~-1).price:gain)

First calculate the sequence composed of the record sequence numbers where the maximum value is located, and then generate a two field sequence with new()based on this sequence, taking the date and gain as the fields. In this case, new() can also omit the field name. You can see what the field names of the generated table sequence will be.

Similarly, calculate the average increase on the days when the stock price exceeds 90 yuan:

A
3 =A2.pselect@a(price>90)
4 =A3.new(A2(~).dt,A2(~).price-A2.m(~-1).price:gain)

For this cross-row calculation for a certain position, SPL provides a positioning calculation function. The previous code can also be written as follows:

A
3 =A2.pmax(price)
4 =A2.calc(A3,price-price[-1])

The positioning calculation function calc()allows the syntax of ~, #, [] and so on in the loop function to be used in the non loop function.

A
3 =A2.pmax@a(price)
4 =A2.calc(A3,price-price[-1])
5 =A3.new(A2(~).dt,A4(#):gain)

The calc() function can also be used for sequences, but there are few business-meaningful scenarios when structured data is not involved, so we give some examples here.


SPL Programming - Preface
SPL Programming - 8.3 [Data table] Generation of table sequence
SPL Programming - 8.5 [Data table] Calculations on the fields