SPL Practice: Structured Text computing

 

Download Test Data

Structured Text Computing Requirements

Structured text files are a common data storage method, such as this score.txt file, which records the scores of students in all classes. The first row is the column name, and each subsequent row is the data of a student, separated by tab within each row.

..

This structured text also has various computational and processing requirements. For example, selecting the girls with the English score of 90 or above, to calculate the total score of each student and sort them in descending order, to calculate the average Chinese language score for each class.

Importing into a database and using SQL to calculate is certainly not a problem, but the process is complicated and there may not always be a database available. Writing Java code is also very tedious, but using SPL is very simple.

Read structured text file

First, let's see how to load the file. The import() function of A1 can directly read the structured text file into an in-memory data table. The @t option represents the first row is field name. For comma separated CSV files, the option @c can be used to read in, and @tc represents reading the CSV text with the first row being column names.

Read score.txt and score.csv respectively in A1/A2. A3/A4 can also read score.txt and score.csv using the T() function, which can automatically load based on the file extension name.

There are other situations when reading text data, such as not having a title row, only reading partial columns, custom separators, and text with quotation marks and carriage returns. For details on these functions, please refer to the official documentation.


A

1

=file("score.txt").import@t()

2

=file("score.csv").import@tc()

3

=T("score.txt")

4

=T("score.csv")

Try filtering

Let's first try the filtering operation, for example, to select girls who score 90 or above in English. A1 uses the T()function to load text data into an in-memory data table, and A2 uses the select() function to filter the data, it can be seen that only girls with scores equal to or above 90 remain in the result. The dot operator in A3 can combine loading data and filtering operation into one statement, with the same calculation result.

If this file is regarded as a table, the corresponding SQL statement for the operation is as follows:

SELECT * FROM score

  WHERE English>=90 AND Gender=2

Sometimes it is desired to dynamically pass in the values of the filtering criteria. The default values can be predefined as 80 points and male students, which can be modified during runtime. The select function in A4 can use these two parameter variables.

..

In A5, use the export()function to write the result table sequence of A2 back into a text file named 'filter. txt'. Similar to the import() function, the @t option indicates that the first row will be written as a title.


A

1

=T("score.txt")

2

=A1.select(English>=90 && Gender==2)

3

=T("score.txt").select(English>=90 && Gender==2)

4

=T("score.txt").select(English>=argEng && Gender==argGen)

5

=file("filter.txt").export@t(A2)

Then try aggregation

Then try some aggregation operations. Calculate the highest and average scores for both girls and boys separately, which is consistent with the SQL aggregation functions.


A

1

=T("score.txt")

2

=A1.select(English>=90 && Gender==2)

3

=T("score.txt").select(English>=90 && Gender==2)

4

=T("score.txt").select(English>=argEng && Gender==argGen)

5

=file("filter.txt").export@t(A2)

SQL

SELECT max(Math), avg(Math)

  FROM score WHERE English>=90 AND Gender=2

SELECT max(Math), avg(Math)

  FROM score WHERE English>=90 AND Gender=1

Integrate into Java applications

After the initial calculation attempts, let's take a look at how to integrate these codes into Java applications. SPL offers standard JDBC, which can provide data querying and computing capabilities to Java applications like executing SQL or calling stored procedures.

  [JDBCConection].prepareStatement(SPL statement)

[JDBCConection].prepareCall(SPL script)

Import the jar files from the installation directory \ esProc \ lib into the application:

  esproc-bin-xxxx.jar

icu4j-60.3.jar

Then copy raqsoftConfig. xml from the installation directory \ esProc \ config to the application's class path. raqsoftConfig. xml is the core configuration file of SPL, and its name cannot be changed.

JDBC executes SPL code

The driver class is InternalDriver, the connection URL is jdbc: esproc: local.

The created JDBC connection can execute SPL code like an SQL query. Load score.txt here to filter students with high scoring English. Use JDBC parameters to dynamically set the minimum score, with 90 set here. Run this Java program and, like other regular JDBC uses, return the result table sequence as a ResultSet object.

Class.forName("com.esproc.jdbc.InternalDriver");

Connection con= DriverManager.getConnection("jdbc:esproc:local://");

PreparedStatement ps = con.prepareStatement("=T(\"score.txt\").select(English >= ? )");

ps.setInt(1, 90);

ResultSet rs = ps.executeQuery();

JDBC calls SPL script

Save the SPL program as a .splx script file, here saved as scoreCalc.splx:


A

1

=T("score.txt").select(English>=90)

2

return A1

Then put the script file into the main directory of esProc, which is configured as the root directory of D drive in raqsoftConfig.xml.

..

Then you can call these scripts like using JDBC to call stored procedures:

PreparedStatement ps = con.prepareCall("call scoreCalc()");

ps.execute();

ResultSet rs = ps.getResultSet();

Executing call scoreCalc() calls and executes scoreCalc.splx. Return the result set to ResultSet.

JDBC executes SPL and returns single or multiple values

In addition to returning a table sequence of multiple rows and columns to the JDBC ResultSet object, it can also return a single value of one row and one column, or a sequence of multiple values of multiple rows and one column.


A

1

=T("score.txt").sum(Math)

2

return A1

A single value with 1 row and 1 column, with the column name fixed as "Field". For example, A2 returns the total math score of 70360.

..


A

1

=T("score.txt").sum(Math)

2

=T("score.txt").count(Math)

3

=T("score.txt").avg(Math)

4

=T("score.txt").min(Math)

5

return [A1, A2, A3, A4]

Multiple values with multi rows and 1 column, with the column name still being "Field", such as A5 returning the total score, number of people, average score, and lowest score in mathematics.

..

Using variable

Next, let's continue to look at more computing tasks and implement common SQL operations using SPL.

For the convenience of discussion, we assume that all A1 cells of the code are to load the file, and then assign the table sequence to the score variable. Afterwards, there is no need to write A1 anymore, only calculate on score.

In addition, for simplicity, SPL code no longer uses parameters, and you can modify it to have parameters by yourself.


A

1

>score=T("score.txt")

2

=score.select(English>=90 && Gender==2)

3

……

Calculate columns

Calculate a new field using multiple field values of the same row, for example, calculate the total score and average score of each student's three subjects.

SPL uses the new() function to define calculated columns, and the expression in the calculated column can use the existing calculated columns, such as calculating total first and then using total to calculate avg.


A

2

=score .new(

  Gender, Name

  ,(English+Chinese+Math):total

  ,round(total/3, 2):avg

)

In the new() function, some original columns can also be selected and mixed with new computed columns or adjust the order of the columns. Here, the Gender is adjusted before the Name.

..

SQL

SELECT

  Gender, Name

  ,(English+Chinese+Math) as total

  ,(English+Chinese+Math)/3 as avg

FROM score

Add calculated columns

If you want to add new calculated columns directly to the existing table sequence, you can use the derive() function. Add the total and average scores of the three subjects here.


A

2

=score .derive(

  (English+Chinese+Math):total

  ,round(total/3, 2):avg

)

SQL

SELECT

  *

  ,(English+Chinese+Math) as total

  ,(English+Chinese+Math)/3 as avg

FROM score

Sort

Sort all students in descending order of total score, and for students with equal total scores, sort them in ascending order of math scores. SPL uses the sort() function to implement sorting, with a negative sign before the sorting field 'total' indicating descending order and 'Math' indicating ascending order.


A

2

=score.derive(

  (English+Chinese+Math):total

).sort(-total, Math)

In the results, it is observed that there are multiple students with a total score of 281, with students with lower math scores ranking higher.

..

Fields of string and date types also support sorting. For example, to sort in reverse order by Name will compare the characters one by one from the first character to the next.

..

SQL

SELECT

  (English+Chinese+Math) as total

  ,*

FROM score

ORDER BY

  (English+Chinese+Math) DESC

  ,Math DESC

Group and aggregation

We want to group by class and gender, and calculate the number of people and average English scores. SPL uses the groups() function to implement grouping and aggregation. Define the grouping fields Class and Gender before the semicolon in parameter. When there are multiple grouping fields, separate them with commas; Define the aggregation expression after the semicolon, supporting aggregations such as sum, count, avg, min, max, etc.


A

2

=score.groups(

  Class , Gender

  ; count(Name):cnt

  , round(avg(English),2):engAvg

)

SQL

SELECT

  Class,Gender

  , COUNT (Name) AS cnt

  , ROUND(AVG (English),2) AS engAvg

FROM score

GROUP BY Class,Gender

Filter after grouping and aggregation

Before grouping, use the select() function to filter students who score 60 or above in mathematics;

After grouping and aggregation, use the select() function to filter groups with an average English score exceeding 70. SPL is defined and calculated step by step, and intermediate steps can be filtered at any time without using WHERE/HAVING to distinguish whether it occurred after the grouping.


A

2

=score.select(Math>60)

  .groups(

   Class,Gender

   ;count(Name):cnt, round(avg(English),2):engAvg

  ).select(engAvg > 70)

SQL

SELECT

  Class, Gender, COUNT (Name) AS cnt

  , ROUND(AVG (English),2) AS engAvg

FROM score

WHERE Math>60

GROUP BY Class,Gender

HAVING ROUND(AVG (English),2) > 70

Remove duplicate values

In SPL, using the id() function can remove duplicate values. After removing duplicate names for one thousand students, we see 598 names.


A

2

=score.id(Name)

SQL

SELECT DISTINCT Name FROM score

Multiple columns can also be combined to remove duplicates, which means that the values of multiple columns are considered duplicate only when they are all the same. For example, if a thousand students combine Name and Math to remove duplicates, there will be 992 left.


A

2

=score.id([Name,Math] )

SQL

SELECT DISTINCT Name,Math FROM score

Count distinct

In SPL, the icount () function is used to implement count distinct, resulting in 598 names after deduplication among 1000 students.


A

2

=score.icount(Name)

SQL

SELECT COUNT(DISTINCT Name) Name FROM score

icount also supports multi column count distinct, with 992 remaining after deduplication for names and math scores that are all the same.


A

2

=score.icount([Name,Math] )

SQL

SELECT

  COUNT(*)

FROM (

  SELECT DISTINCT Name, Math FROM score

)

Count distinct within groups after grouping

The icount() function can also be used within the groups function, allowing for count distinct within the groups. There are 54 names for 56 people in Class 2 and 53 names for 55 people in Class 3.

..


A

2

=score.groups(

  Class

  ; icount(Name):nameCnt

  , count(Name):cnt

)

SQL

SELECT

  Class

  , COUNT(distinct Name) AS nameCnt

  , COUNT(Name) AS cnt

FROM score

GROUP BY Class

TOP-N

It is also common to calculate the top few records, and the top() function gets the top N records.The first parameter is a negative number, indicating reverse order. -5 will get the top 5 records with the highest total. If it is a positive number of 5, it will get the bottom 5 records with the lowest total.

When getting the top N, duplicate values will occupy the ranking by default, such as the 4th and 5th both scoring 281.


A

2

=score.derive((English+Chinese+Math):total)

  .top(-5;total)

SQL

SELECT

  TOP 5

  *, (English+Chinese+Math) as total

FROM score

ORDER BY (English+Chinese+Math) DESC

If you want duplicate values not to occupy the ranking, you can use the @i option. Among the top 5, if the 4th has duplicate values and there is a total of 3 people, then a total of 7 people will be found in the top 5.

***.top@i(-5;total)

..

To find the maximum and minimum records can also use the maxp and minp functions, which are equivalent to top 1 and top -1, respectively.

Max ***.maxp(total) = top(1,total)

..

Min ***.minp(total) = top(-1,total)

..

TOP-N within groups

To find out the top 5 students in each class, you can use the top function in the group() function,


A

2

=score.derive(

  (English+Chinese+Math):total

).group(Class;~.top(-5;total))

The top 5 students from each class as shown in the figure are a series of subsets. Clicking on Class 3 will reveal the top 5 students from Class 3.

..

The groups()and group() functions are both used for grouping. Simply put, the former calculates some aggregation values directly after grouping, while the latter obtains subsets of each group after grouping. The subsets can be flexibly defined for more calculations and may not always be aggregated.

SQL

SELECT

  * , (English+Chinese+Math) AS total ,

  RANK OVER(

    PARTITION BY Class ORDER BY

    English+Chinese+Math DESC

  ) AS rank

FROM score WHERE rank <= 5

More calculation functions:Date calculation

SPL provides many related calculation functions for different types of data.

Here listed some calculations based on the send and receive date fields, to obtain the year, month, day, and day of the week, to calculate how many days and weeks there is a difference between two dates, and to calculate the new date after a date.

Based on send (Sending Time) 2023-12-03 15:12:23receive(Receiving Time)2024-03-15 11:12:30

year(send) = 2023

month(send) = 12

day(send) = 3

day@w(send) = 1 , Sunday

interval(send, receive) = 103103 days in between

interval@w(send, receive) = 1414 weeks in between

elapse(send,50) = 2024-01-22 15:12:2350 days after

……

More calculation functions:String calculation

There are also many calculation functions for string type data. Multiple strings can be concatenated together using concat, to determine if all characters are lowercase letters, to convert to lowercase, to truncate left, right, or middle substrings, to match strings with wildcard characters, to replace strings, and to randomly generate new strings of specific lengths using certain characters.

Based on first(First Name)Johnlast(Last Name)Smith to calculate

concat(first,last) = JohnSmith

concat(first, " ", last) = John Smith

islower(last) = false

lower(last) = smith

left(last,3) = Smi

mid(last,2,4) = mith

like(last, "*th") = true

replace(last, "th", "??") = Smi??

rands(“abcdefg12345“,5) = 4d32d

……

More calculation functions:Convert to date type

SPL provides conversion functions between date and time, various numerical values, and string types. The target type is used as the function name, and the function automatically determines the original data type and correctly converts it to the target type. For example, the date() function can pass in integers such as year, month, day, string date, and can also specify the parsing format of the string date, ultimately obtaining a date.

date(1982 , 8 , 9) = 1982-08-09

date("1982-08-09") = 1982-08-09

date("1982-08-09 12:15:30") = 1982-08-09

date("08/09/1982" , "MM/dd/yyyy") = 1982-08-09

date("9, Aug 1982" , "d MMM yyyy" : "en") = 1982-08-09

datetime("1982-08-09 13:15:30") = 1982-08-09 13:15:30

datetime("9 Aug 1982 1:15 PM" ,

"d MMM yyyy h:mm a" : "en") = 1982-08-09 13:15:00

time("13:15:30") = 13:15:30

time("13/15/30” ,"HH/mm/ss") = 13:15:30

time("1:15 PM", "h:mm a" : "en") = 13:15:00

……

More calculation functions:Convert to numerical type

There are many numerical types, including integer int, long integer, double precision floating-point number float, real number, and large floating-point number decimal. In addition to converting between various numerical types, string and date types can also be converted to numerical values.

int("68") = 68

int("68.73") = 68

int(68.73) = 68

int(3.1*3.2) = 9

float("68.73") = 68.73

float(68) = 68.0

number("68f") = 68.0

number("$1,100.05", "$#,###.##") = 1100.05

long("68.73") = 68

long(now()) = 1727677043202

decimal(1232456523427854423905234234)

……

More calculation functions:Convert to string type

When converting date, time, and numerical values into strings, the format can also be specified.

string(123) = "123"

string(now(), "MMM dd, yyyy") = "Sep 30, 2024"

string(1100.05 , "$#,##0.00") = "$1,100.05"

string(datetime("1982-08-09 13:15:30"), "d MMM yyyy h:mm a" : "en") = "9 Aug 1982 1:15 PM“

…….

There are more calculation functions available, which can be viewed in detail in the official online documentation https://doc.scudata.com.

Types of associations

Let's take a look at operations involving multiple file associations.

SPL regards associations differently from SQL, as it distinguishes between foreign key associations and primary key associations.

For example, if the City foreign key in the student table is linked to the primary key CityID in the city table, this is a foreign key association. The ordinary fields Class and School in the student table are respectively linked to the primary keys ClassID and SchoolID in the class table. This two field and multi field association is also a foreign key association.

The student table is associated with the primary key StudentID of the student health table using the primary key, which is called primary key association. These two tables have a one-to-one relationship, also known as a homo-dimensional table.

The student table is associated with the student payment table's partial primary key StudentID using the primary key, which also belongs to primary key association. These two tables have a one to many relationship, also known as the primary sub tables.

Based on different association characteristics, SPL uses different functions to distinguish and implement associations.

..

Association operations involve multiple files. Similarly, in A1, multiple texts will be loaded as SPL table sequence variables to facilitate subsequent calculations.


A

1

>student=T("student.txt"),

 city=T("city.txt"),

 class=T("class.txt"),

 stu_pay=T("stu_pay.txt"),

 stu_health=T("stu_health.txt")

Foreign key association calculation

There are class teacher ID, English teacher ID, and other information in class.txt, while teacher.txt records the teacher's name, gender, and subject being taught. We need to find out the classes where the class teacher is a Chinese teacher, and it is necessary to associate the ClassTeacher in class.txt with the teacherId in teacher.txt.

SPL uses the switch()method to join two tables. During the calculation process of A2, four table sequences are involved, with class.txt loaded as the table sequence class and teacher.txt loaded as the table sequence teacher, then use the switch() function to replace ClassTeacher field value of the class with the relevant record of the teacher, and calculate the CT table sequence. The field values of the SPL table sequence can be complex data types such as records or even sub table sequences. CT is a multi-level nested table sequence.

It is not difficult to obtain data from nested sub records. For example, when CT2 with a simple structure is generated by CT new(), ClassTeacher.Subject obtains the subject taught by the class teacher, and finally uses the select() function to find the Chinese subject.


A

2

>CT=class.switch(

  ClassTeacher ,teacher:teacherId

 )

 ,CT2=CT.new(

  Class

  , ClassTeacher.Name

  , ClassTeacher.Subject

 ).select(Subject=="Chinese")

..

SQL

SELECT

  C.Class ,T.Name ,T.Subject

FROM

  class C JOIN teacher T

    ON C.ClassTeacher=T.treachId

WHERE T.Subject="Chinese"

Multi level foreign key association

The FromCity field of the student table is foreign key to the city table, the ClassID field is foreign key to the class table, and the English, Chinese, and Mathematics teacher fields in the class table are all foreign key to the teacher table.

The student table has two foreign key fields, and the class table has three. There is also a multi-level association between the student table and the class table, and between the class table and the teacher table.

To query students from City4 and whose teacher is Teacher8, four tables need to be associated together. The switch() function supports associating multiple foreign key tables simultaneously, for example, in A2, first associate the class table with the three subject teachers, and then associate the city table and A2's class table in A3, and the result of A3 association is a more complex multi-level table sequence, which can visually display the association results.

..

Finally, in A4, use the new() function to extract the required information from the table sequence at each level and filter out students from City4 who are taught by Teacher8.

..


A

2

=class.switch(

  EnglishTeacher, teacher:TeacherID

  ; ChineseTeacher, teacher:TeacherID

  ; MathTeacher, teacher:TeacherID

)

3

=student.switch(

  ClassID, A2 : ClassID

  ; FromCity, city : CityID

)

4

=B2.new(StudentID, Name

,FromCity.Name : City ,ClassID.EnglishTeacher.Name : EnTeacher

,ClassID.ChineseTeacher.Name : CnTeacher,ClassID.MathTeacher.Name : MaTeacher

).select(City=="City4" &&

(EnTeacher=="Teacher8" || CnTeacher=="Teacher8" || MaTeacher=="Teacher8"))

SQL

SELECT

  s.StudentID, s.Name, c.Name AS city, en.Name AS EnTeacher,

    cn.Name AS CnTeacher, ma.Name AS MaTeacher

FROM

  student s JOIN city c JOIN class cl

    JOIN teacher en JOIN teacher cn JOIN ma

  ON

    s.FromCity=c.CityID AND s.ClassID=cl.ClassID AND cl.EnglishTeacher=en.TeacherID

    AND cl.ChineseTeacher=cn.TeacherID AND cl.MathTeacher=ma.TeacherID
WHERE

  City="City4" AND

  (EnTeacher="Teacher8" OR CnTeacher=“Teacher8” OR MaTeacher=“Teacher8”)

Multi field foreign key association

The switch()function is suitable for single field foreign key associations, and when multiple fields are associated, the join() function should be used.

Add a School column to the score table to record the scores of 30 classes in 5 schools, with 6 classes in each school. The class table records these 30 classes, with the primary keys being the Class and School fields. The classes are classified into Type 1, Type 2, and Type 3.

Now we need to find out the scores of all students in Type3 classes. In the parameters of the join() function, before the first comma are multiple foreign key fields, Class and School; Before the second comma is the associated table class, along with its multiple primary key fields Class and School; After that are the target field selected from the associated table, which can be multiple, and we only selected one ClassType field here.

The switch() function is for single field association that supports multiple tables, resulting in nested multi-level table sequences.

The join() function can support multi field association, but not multi table. The result is to directly join the fields of two tables to form a single-layer structured table sequence.

..


A

2

>class=class.select(ClassType==3)

3

>CT=score.join(

  Class:School

  ,class:Class:School

  ,ClassType

)

SQL

SELECT

  s.*, c.ClassType

FROM

  score s JOIN class c

    ON s.Class=c.Class

    AND s.School=c.School

WHERE c.ClassType=3

Homo-dimensional table - primary key association

The primary keys of the student table, family table, and health table are all Student IDs, and the three tables have a one-to-one homo-dimensional relationship. The latter two tables provide supplementary information on different aspects of students.

The primary key association requires the use of a new association function, such as join (A, B, C), which defines information for multiple associated tables in the parameters: the associated table (student), the field name(stu) in the table sequence after association, and the associated primary key (StudentID).

Observing the association result of A2, the three table records with Student ID 2 are associated together.

..

After association in A2, A3 filters out students and parents with myopia degree exceeding 500 degrees.

..


A

2

=join(

  student : stu , StudentID;

  family : stu_family , StudentID;

  health : stu_health , StudentID

)

3

=A2.new(stu.Name, stu_family.Father

  , stu_family.Mother, stu_health.EyeVision)

.select(EyeVision>500)

SQL

SELECT

  stu.Name, stu_family.Father

  , stu_family.Mother, stu_health.EyeVision

FROM

  student stu

    JOIN family stu_family

    JOIN health stu_health

  ON stu.StudentID=stu_family.StudentID

    AND stu.StudentID=stu_family.StudentID

WHERE sty_health.EyeVision>500

Primary sub table - primary key association

Another type of primary key association is the primary sub table, where the student payment table 'pay' is a sub table. It has two primary keys, Student ID and PayDate, and uses the Student ID as a foreign key to the primary tables student primary key.

The primary sub table also use the join() function to associate. A student has 5 payment records, and each payment record is associated with the student record in the primary table.

..

After association in A2, A3 uses the groups() function to calculate the total payment amount for each student.

..


A

2

=join(

  student : stu,StudentID;

  family : stu_family,StudentID;

  pay : stu_pay,StudentID

)

3

=A2.groups(

  stu.Name, stu_family.Father;

  sum(stu_pay.PayMoney):allPay

)

SQL

SELECT

  stu.Name, stu_family.Father,

  sum(stu_pay.PayMoney):allPay

FROM

  student stu

    JOIN family stu_family

    JOIN pay stu_pay

  ON stu.StudentID=stu_family.StudentID

    AND stu.StudentID=stu_pay.StudentID

GROUP BY stu.Name, stu_family.Father

And more…

The above is the content of structured text calculation in SPL, which provides a simple and powerful calculation solution for data outside of databases.

The examples involved here are all small data that can be read into memory. For files larger than memory, SPL also provides a cursor solution for calculation. Interested friends can refer to SPL's official documentation.

Of course, SPL's capabilities go far beyond that. SPL also supports various text formats such as JSON and XML, cross database queries, and supports mixed computing between databases and other non databases, supports for big data or complex computing, and SPL's computing power can also be used to optimize SQL performance. We will gradually introduce these contents in later topics.

SPL is an open source software that can be obtained from GitHub. https://github.com/SPLWare/esProc

Download Test Data