SPL Programming - 8.3 [Data table] Generation of table sequence

 

In order to further learn the operations of record sequence, we need a table sequence with a little more records, such as the table sequence with four fields we used earlier: name, sex, weight, height. It is much more convenient to generate hundreds of records and do experiments again.

Obviously, it is too troublesome to manually fill in the field values of many records. We use a program to generate them. The specific values of the fields do not affect learning. They can be generated randomly.

We can already do this with the code technology we have learned now.

A B C
1 =create(name,sex,weight,height)
2 for 100 =string(A2) =if(rand()<0.5,“Male”,“Female”)
3 =50+rand(50) =1.5+rand(40)/100
4 >A1.insert(0,B2,C2,B3,C3)
5 =A1.len()

The meaningful name is troublesome, so we use the number string instead. C2, B3, C3 are easy to understand, and then in B4, insert the generated fields into the empty table sequence created in A1. After the code is executed, we can get a table sequence with 100 records.

There’s nothing wrong with this code, but it’s still a little troublesome. We’ve learned loop functions, and it’s easy to think that there should be a loop function to solve this task. However, the loop function 100.(x) used to generate a new sequence can only return a sequence, which is not what we need.

SPL provides a loop function to return a new table sequence. The above code can be simplified into one statement:

A
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)

It’s a little long, so it’s written in two lines. The statements related to structured data are often very long. This is also the advantage of using a cell to write code. It won’t let too long code in a cell to write out of the cell and affect the code reading on the right and below.

Similar to A.(x) returning a sequence with the same length as A, A.new(…) will return a table sequence with the same number of records as A.len(). It also generates a record for each member of A, and these records will form a table sequence and return.

The new() function is a little different in the way the parameters are written.

It is obvious that each field is separated by a comma. The calculation formula of the field value is before the colon and the field name is after the colon. Is this colon a special provision of the new() function?

Yes, and no.

After we are dealing with structured data, we often face more complex function parameters. For example, to execute the new()function correctly, we need to know the value calculation formula and field name of each field in the table sequence to be generated, and the field value calculation formula and field name are a pair and cannot be misplaced. With the current new() function, the field value formula and field name are always a group of two. If we still use the comma separated writing method, we can still use the position of the parameter to determine the corresponding relationship, but it already looks dizzy. If the number of parameters in each group is uncertain, there is no way to correspond at all. In fact, the new() function does allow that field names are omitted. In many cases, it can automatically identify a reasonable field name. If the field names are to be written every time, it will be very troublesome.

In this case, one way is to use the set as a parameter. For example, this new function can be written as:

=100.new({string(~),name},{if(rand()<0.5,"Male","Female"),sex},{50+rand(50),weight},{1.5+rand(40)/100,height})

The parameters here are divided into four groups. Each field is a group, and each group has two members, namely, the calculation formula and the field name. In this way, the corresponding relationship between parameters can be described clearly. If the field name is not written, the result is that a group has only one member and will not be misaligned with the parameters of other groups.

This is the concept of multi-level parameters. The parameters written in the function should be divided into multi-level groups to ensure a clear description of the corresponding relationship between these parameters.

Theoretically, there may be more levels, so {} may be nested, which still looks a little messy.

SPL does not adopt the writing method of nested {}, but the convention only supports three-level parameters, separated by semicolons, commas and colons respectively. Semicolons are the first level. The semicolon-separated parameter is a group. If there are lower-level parameters in this group, they are separated by commas, and the next-level parameters are separated by colons. There are three levels in total.

Practice shows that three levels are basically enough, and there is rarely any parameter relationship that cannot be described clearly with this method. While using commas only is often ambiguous.

In fact, we have seen semicolons when we talked about the top() function, but we haven’t analyzed them carefully.

Multi-level parameter syntax is also the invention of SPL. It is not designed only for structured data calculation. It will also be involved in conventional numerical operations, but it often appears due to the complexity of structured data processing.

If you have learned SQL, you can compare it. Each part of SQL separated by keywords can also be understood as multi-level parameters. Only that pretending to be English will have better readability, but its universality is much worse. You should select special keywords for each statement. The complexity of structured data will force multi-level parameters to appear (in different forms).

After understanding the multi-level parameters, the new()function above is easy to understand. The colon is indeed required for the new() function parameters, because it has two levels of parameters, but it is not deliberately specified for the new() function, but a general rule in SPL.

The insert() function of the table sequence also has multi-level parameters. Field names can be specified during insertion. Fields not listed in the parameters will be filled with null after insertion.

For example, we need to add 100 records just created with the new function to the previous table sequence with 4 manual records, but only fill in weight and height, and let the others be null.

A B C D
5 =create(name,sex,weight,height) =A5.record([A1:D4])
6 for 100 >A5.insert(0,50+rand(50):weight,1.5+rand(40)/100:height)

In practice, structured data often has many fields, and it will be more troublesome if the default writing method is not allowed.

Similar to the sequence, the insert() of the table sequence can also insert multiple records in batch at one time:

A B C D
5 =create(name,sex,weight,height) =A5.record([A1:D4])
6 >A5.insert(0:100,50+rand(50):weight,1.5+rand(40)/100:height)

Multi-level parameters can easily express various intentions in one pattern.

Now we have a table sequence with 100 records, and each record corresponds to one person. We want to get a new table sequence with one more field to store the BMI of each person.

With the new() function, it is easy:

A
1 =100.new(string(~):name,if(rand()<0.5,“Male”,“Female”):sex,50+rand(50):weight,1.5+rand(40)/100:height)
2 =A1.new(~.name:name,~.sex:sex,~.weight:weight,~.height:height,~.weight/~.height/~.height:bmi)

As a loop function, ~ in A1.new() is defined as the current member when the sequence loops, that is, a record of table sequence A1. Then we can use the syntax of ~.F to refer to the field of record ~.

However, A2 written in this way is a little long.

SPL stipulates that in the loop function of a table sequence or a record sequence, the field name can be directly used to refer to the field of the current member without writing ~., that is, name is ~.name, sex is ~.sex, … Moreover, as we said earlier, the new() function can sometimes omit the field name, and it can be omitted when the current formula and the field name of the target table sequence are the same. In this way, A2 can be written as:

A
1
2 =A1.new(name,sex,weight,height,weight/height/height:bmi)

This makes it look much clearer and less error prone.

This kind of appending fields to the end is very common in structured data processing. SPL simply provides a function to process, and the code can be simplified to:

A
1
2 =A1.derive(weight/height/height:bmi)

The derive() function will append new fields based on the data structure of the original table sequence, and all the fields of the original table sequence will be copied.

It should be noted that the derive()function will create a new table sequence without changing the original table sequence. It will copy all the fields and values of the original table sequence, and the original table sequence is still there. In fact, the derive() function can be executed for a record sequence, not necessarily for a table sequence.

The new() function will of course create a new one, and it can even be executed for a sequence rather than a record sequence.

Now, we can restore the problem of merging Excel to the previous requirement: append the file name to the end.

A
1 =directory@p(“data/*.xlsx”)
2 =A1.(file(~).xlsimport@t().derive(filename@n(A1.~):File))
3 >file(“all.xlsx”).xlsexport@t(A2.conj())

The statement is a little long, and we wrote one more line. Actually all actions can be written in one statement.

For the table sequence read out by each file, use the derive()function to append a File field to the end, and the value is the current file name. It should be noted that derive() is a loop function, and the ~ written directly in its expression refers to the current record of the table sequence (record sequence). Here, when adding a file name, we should use the upper layer ~, that is, A1.~.


SPL Programming - Preface
SPL Programming - 8.2 [Data table] Table sequence and record sequence
SPL Programming - 8.4 [Data table] Loop functions