SPL concepts for beginners
As a programming language, SPL has some inherent characteristics. If you do not understand them, you may not have a problem in hands-on learning, but it is difficult to fully utilize its functionality in architecture design or coding. Here we attempt to establish a basic SPL conceptual framework for beginners, and you can refer to relevant documents for more information based on actual needs.
These contents are not too few and require several hours of careful reading and understanding.
I Application Structure
Firstly, it is necessary to understand the operating environment and application structure of SPL.
1. Script and Development
SPL is an interpreted executing programming language, and the code is usually not large and is commonly referred to as a script.
SPL scripts are generally submitted as files with the extension of .splx, not in text format. SPL also supports script files in text format with the extension .spl.
SPL provides a specialized IDE for writing SPL scripts, which supports regular debugging functions. IDE is a client program that can run on Windows, Linux, and Mac.
2. Environment and Integration
SPL software can be divided into three parts: IDE, JDBC, and Server. All are developed in pure Java and can run in any JVM environment with JDK1.8 or higher versions.
SPL JDBC is not an independent process, it consists of jars and can be imported into Java applications and provide standard JDBC interfaces to applications.
Comparison of JDBC drivers between SPL JDBC and traditional RDB JDBC:
1) SPL JDBC includes a complete computing library, which can provide computing services independently and no longer rely on a server to work.
2) The accepted statement is SPL instead of SQL, and executing an .splx script is similar to using SQL to execute a stored procedure in RDB.
SPL Server is an independent process that can provide external services based on the HTTP protocol, which can be invoked by non-java applications. It can also be accessed and invoked by SPL JDBC.
Multiple SPL servers can form a cluster. Many big data solutions emphasize the role of clusters, but SPL's multiple practices have shown that a single machine can solve the vast majority of tasks. Unless there is a high concurrency requirement, clusters are rarely used, and beginners do not need to pay attention to clusters.
3. Data sources and external libraries
SPL supports diverse data sources, and any data source with a Java interface can be accessed and computed.
SPL has built-in text files, Excel files, JDBC driven relational databases, and HTTP/RESTful access interfaces, allowing direct access to these data sources.
SPL provides various external libraries for other data sources. After importing the corresponding external libraries, the corresponding data sources can be accessed in SPL scripts, and the specific read and write capabilities are determined by the data source itself. SPL has already provided corresponding external libraries for common data sources in the industry .
SPL abstracts data sources into two forms: table sequences and cursors, table sequences will be read into memory at one time for processing; If the data source supports it, SPL can also use cursors to gradually read in data and process it.
SPL does not require all external data to be read into memory for calculation. Don't understand esProc as an in-memory computing engine. Of course, it can maintain a certain amount of memory data for use as an in-memory computing engine.
SPL does not have the concept of a "(data)base", and any accessible data source is logically equivalent, with only differences in the functionality and performance of the data source itself. esProc is not responsible for the management and security of the data itself, only for computation.
II Script and syntax
The basic logic and concepts of SPL code are similar to most programming languages and are very simple for experienced programmers, but there are still some SPL specific contents that need attention.
1. Variable
SPL can use regular variable names, but it is more recommended to directly use cells as variable names.
The cell names referenced in the expression will automatically transition during editing , but those in the string will not. When using $[...] to represent the string, the cells involved will also transition during editing.
SPL variables have genericity and do not need to declare data types, so what is calculated is what it is.
Variables can also be used temporarily in SPL expressions.
2. Function
SPL has a large number of syntax for function options to distinguish between slightly different running states of a function. Options can be used in combination, and special attention should be paid to the relevant options when understanding function functionalities.
replace (s, s1, s2) Replace substring
replace@1 (s, s1, s2) Replace only the first one
replace@c (s, s1, s2) None case sensitive during matching
replace@c1( s, s1, s2 ) None case sensitive during matching and only the first one is replaced
SPL uses cascade parameters to describe complex structured parameters, with parameters at each level separated by colons, commas, and semicolons. This is significantly different from using only commas to separate parameters in regular languages. When understanding function definitions, it is important to pay attention to the cascade parameter hierarchy.
if (a, b, c) Regular one layer parameters, comma
between (x, a: b) Two-layer parameters, comma and colon
hash (xi,...; n) Two-layer parameters, comma and semicolon
case (x1: y1, x2: y2,...; y) Three-layer parameters
3. Object
SPL has the concept of object, and many functions are encapsulated on objects.
SPL does not have important object-oriented mechanisms for inheritance and overloading. Strictly speaking, SPL is not an object-oriented language. In this sense, SPL is simpler than Java or even Python. The main function of SPL objects is to encapsulate a series of related functions.
4. Lambda Syntax
SPL supports Lambda syntax, and can be said to be a functional language in a sense (it also supports process).
SPL does not support defining variable names for Lambda syntax, but instead uses fixed symbols to represent variables in Lambda syntax. ~ represents the current member, # represents the sequence number of the current member, and ~[i] and ~[a: b] can also be used to represent adjacent members and subsets.
A.sum( ~*~ ) Calculate the sum of squares
A.sum( if(#%2==0,~) ) Calculates the sum of members in even positions
A.max( ~-~[-1] ) Calculate the maximum difference value
A.( ~[-1,1].avg() ) Calculate moving average
Whether a certain parameter can be written in Lamdba syntax depends on the function itself, and attention should be paid when understanding the function definition.
Refer to Lambda syntax in SPL
5. Macro
SPL is a dynamic language that can use macros to temporarily change statements during execution.
Like SQL, when it comes to table and field names in SPL statements, they can mostly be written directly without the need to be written in strings, such as:
file("employee.txt").import@t( ID, Name, Gender, Department )
The ID, Name, Gender, and Department fields can be read from employee.txt to form a data table, making coding more convenient and readable.
However, if you want the parameters of the import function to change (possibly changing the fields to be read based on the code context), it is not very convenient to describe it. Write it as
file("employee.txt").impport@t( "ID, Name, Gender, Department" )
The import function will process the entire 'ID, Name, Gender, Department' as a field name, which is obviously not what we want.
Using macros can solve this problem:
S="ID,Name,Gender,Department"
file("employee.txt").import@t(${S})
The essence of a macro is a string, which can be concatenated into an executable statement to obtain a new executable statement.
Refer to Macro in SPL
III Data object
There are three most important data objects in SPL, which are the foundation for processing data with SPL.
1. Sequence
A sequence is an ordered set in memory.
The sets in SPL are all ordered, which means a sequence member can be accessed by its sequence number. The members have a clear order in the set, which is consistent with arrays in Java.
The sequence of SPL has generalization, and it is not required that the sequence members have the same data type. The members of the sequence can be of any data type, and the following are legal sequences:
[1,2,3]
["1","2","3"]
[1,"2",3]
[1,[2,3],4,5]
Specifically, when looping through set members, you can directly write
for A
Members of A can be directly referenced in the loop body
It is not recommended to write it as
for i,1,A.len()
// Using A(i) to refer to the members of A within the loop
Don’t write it as
i=1
for i<=A.len()
// Using A(i) to refer to the members of A within the loop
i+=1
Unlike SQL, SPL provides a data object without data structures, i.e., sequence, which can only be represented as a single field table in SQL.
The aggregation functions of SPL are often designed for sequences, such as filtering, aggregation, grouping, and other operations, without necessarily targeting structured data tables.
Sequences are more fundamental data objects than data tables, and learning SPL requires getting used to using lighter sequences instead of always generating structured data tables.
2. Table sequence
A table sequence is a data table in memory.
SPL inherits the concepts of records, fields, and data tables in SQL, and a table sequence can be understood as a set of records with the same data structure.
The sets in SPL are always ordered, and the table sequence is no exception as a set of records, hence it is called a table sequence.
A table sequence is also a sequence, and members can also be accessed by sequence numbers. It is recommended to use the aforementioned syntax for loops of table sequence members.
The table sequence of SPL also has generalization, and the data structure of records in the table sequence requires to be the same, but there are no restrictions on the values of fields. The values of the same field for different records in the table sequence can be of different data types (although it is not common).
Specifically, the field values can even be another table sequence. This way, it is easy to represent JSON style multi-layer data structures in SPL.
A SPL record is also a type of data object that can be extracted from the table sequence `and used for operations. This is different from the concept of SQL, which does not have a record data type. A single record is actually a data table with only one record.
SPL also allows records in the table sequence to form a new set, and the sequence composed of records is called a record sequence.
Specifically, a subset filtered out by certain criteria from a certain table sequence (the WHERE operation in SQL, the select function in SPL) is a record sequence. The member records that make up this record sequence are the same object as the records in the original table sequence. Changing the field values in these records will cause the corresponding record values in the original table sequence to also change. SPL will not copy these records, which can achieve better computational performance and occupy less memory space. This is very different from SQL, as the result set of the WHERE operation in SQL has nothing to do with the original data table and is a new record that has been replicated.
Record sequence also support genericity, and the records that make up a record sequence can come from different table sequences. That is, records with different data structures may be put together for operation, however this situation is not common.
In SPL, it is common for records to exist separately or form a set after being separated from the table sequence and participate in operations, known as discreteness. This is one of the essential differences between SPL and SQL. This is actually consistent with Java's object referencing mechanism, and Java programmers can easily understand this data type.
In SPL, the field values of a record are allowed to be records from another table sequence, which can achieve association and reference between multiple tables. such as
D=file("department.txt").import@t()
E=file("employee.txt").import@t()
E.switch( DeptID, D:ID) Change the DeptID field value in Table E to the corresponding record in Table D
D.switch( Manager, E:ID ) Change the value of the Manager field in Table D to the corresponding record in Table E
E.select( DeptId.Manager.Gender=="Female" ) Now it is possible to directly reference the field values of records stored in the field values
Similarly, field values can also be another record sequence, which can achieve the structure of the primary and sub tables.
D=file("department.txt").import@t()
E=file("employee.txt").import@t()
D.derive( E.select( DeptID==D.ID ):Employee ) Add the Employee field to the D table, the value is a record sequence composed of the department’s employee records
D.select( Employee.len()>10 ) Use this field whose value is a record sequence
This type of inter table reference relationship achieved through records and record sequences is common in SPL and not found in SQL, but it is easy to achieve this effect in Java.
3. Cursor
External storage is not as suitable for random access as memory, and often can only provide streaming access mechanisms. Therefore, calculation schemes need to be designed under this premise. There are also many operations that do not require all source data to be loaded into memory, as long as the data is gradually read in for cumulative calculations, such as summation. This allows for processing larger amounts of data with smaller memory.
In SPL, data objects that can be streamed in are abstracted as cursors, and the basic conceptual features of SPL cursors are similar to database cursors.
Like database cursors, SPL cursors are also unidirectional and can only read data forwards until all data is retrieved, and cannot be rolled back.
The data read by SPL cursors is usually organized into table sequences or sequences (table sequences are more common), unlike SQL, which can only read one record at a time, SPL provides a method for batch reading data from cursors.
cs=file("Orders.txt").cursor@t()
cs.fetch( 100 ) Read 100 records to form a table sequence
cs.fetch( ;TradeDate<=date("2022-12-31") ) Read records that consistently satisfy the conditions (note that is not to filter all records by the critera)
cs.fetch( ;UserID ) Read the records until the UserID field changes
The latter reading method is very useful in many scenarios and is a unique mechanism of SPL.
For debugging convenience, SPL also provides options
cs.fetch@0( 1 ) reads one record, but the cursor position does not move. You can still read this record again next time.
This way, you can first check if the data in the cursor is normal without disrupting the overall code execution.
The SPL cursor not only has a reading function, but can also be used for additional operations, such as
cs.select( ... ) Filter
cs.groups( ... ) Group
The operation functions on a cursor are very similar in syntax to the corresponding functions on a table sequence (record sequence), ensuring that the codes for in-memory and external storage operations are as consistent as possible. But it should still be noted that these two are not equal.
The operation functions on the cursor can be divided into two categories. One is called delayed calculation. When executing such functions, it is just a mark and does not actually traverse the cursor. Another type is called immediate computation, which essentially traverses the cursor and calculates the result.
cs2=cs.select( ... ) Delayed calculation, only registering on the cursor that there is a filtering action, and the cursor will not be immediately traversed, returning a new cursor
cs2.groups(...) Immediate computation, performs traversal of the cursor and returns the calculation result. Delayed calculations previously registered on the cursor will also be executed.
This is different from table sequence (record sequence), where the operation functions on the table sequence (record sequence) will immediately execute and return the corresponding results.
Cursors can only traverse once and become invalid after completion, unlike in-memory table sequences that can be calculated over and over. Some cursor-based operations require ordered data, such as intersection, union, and difference operations of sets.
SPL provides a mechanism for multi-purpose traversal, which can calculate multiple results in a single traversal process.
SPL also provides a unique multi-cursor that can split data into multiple segments (related to storage schemes) and traverse them in parallel, fully utilizing the concurrency capabilities of multi-core CPUs and SSDs. Database cursors do not have direct parallel capabilities, making writing parallel code very difficult.
IV Understanding operations
SPL provides a rich class library for structured data computing, including conventional set operations such as intersection, union, and subtraction, as well as filtering, grouping, and join of structured data. Some of these operations have a unique understanding and style of SPL.
1. Loop function
Most operations and even processing actions on sets (sequences/table sequences/record sequences) can be solved using loop functions without the need for complex loop statements. This way, the writing is simple and the performance is also better
p=directory("*.csv") List all CSV files in the directory
p.conj( file(~).import@tc() ) Read these files and concatenate them
Develop this good habit.
Using the formula e=1+1/1!+1/2!+1/3!+… to calculate the base e of natural logarithm (use the first 20 items)
=1+20.run(~=~*if(#>1,~[-1],1)).sum(1/~)
Understanding this statement is of great help in understanding the loop calculation logic of SPL.
The Lambda syntax mentioned earlier can be used in loop functions. In the Lamdba syntax of table sequences and record sequences, field names can be directly referenced without the need to write ~.
2. Position and alignment
SPL provides many position-related functions.
T.pmax(age) returns the sequence number of the record with the highest age field value in the table sequence T.
T.pselect@a( age>50 ) returns the sequence numbers of all records with age>50 in table sequence T.
Making good use of it can make the code more concise.
T=file("stock.txt").import@t().sort(Date)
S=T.ptop( 10; -Price ) The positions of the highest 10 days stock price
T.calc( S, Price-Pirce[-1] ) The price increase in those 10 days
And the function of alignment by position, which is equivalent to sorting data in a specified order
T.align( 31, Day ) assigns transaction records to 31 days of the month, and fills in null for dates without transactions to ensure that the result is 31 members.
T.align( ["Sun","Mon","Tue","Web","Thu","Fri","Sat"], WeekDay )
3. Understanding aggregation
Unlike SQL, aggregation operations in SPL are not limited to operations such as SUM/COUNT/MAX/MIN that return a single numerical value. Any smaller scale values calculated from a set can be considered an aggregation calculation.
The result of an aggregation operation may be a small set, such as top, which returns the top N members.
The result of an aggregation operation may also be an object, such as maxp, that returns the record where the maximum value is located.
All aggregation calculations can be used for grouping.
Some examples:
T.top( -3, salary ) Top three highest salaries
T.top( -3; salary ) Top three employees with the highest salary
T.maxp( age ) The oldest employee
T.groups( Dept; maxp( age ) ) The oldest employee in each department
These aggregation operations can also be applied to grouping and aggregation on cursors.
Programmers can also use the iterate function to customize aggregation operations.
4. Grouping
SQL grouping is accompanied by mandatory aggregation for the subsets of the grouping, while SPL does not have this requirement. Grouping and aggregation can be divided into two steps, and the subsets of the grouping can be retained to continue participating in the calculation.
T.group( Birthday ).select( ~.len()>1 ).conj() Someone with the same birthday as others
SQL only has one equivalence grouping, which is to group members with the same key value into the same group. SPL provides more grouping methods besides equivalence grouping:
Ordered grouping
T.sort( Date ).group@i( Price<Price[-1] ).max( ~.len() ) Scan data in order, group when prices fall, and calculate the number of max consecutive rising days
Sequence number grouping
T.groups@n( Day; sum( Amount) ) Calculate the daily sales revenue
Alignment grouping
T.align@a( ["Male","Female"], Gender ) Group by gender to ensure two results
Equivalence grouping is a complete partition, where all original members are divided and only assigned to one grouping subset, without the appearance of empty grouping subsets. SPL can support incomplete grouping, allowing some original members to be discarded, allowing empty subsets of the grouping, and even achieving repeatable grouping, where certain original members are divided into multiple grouping subsets.
The de duplication operation (DISTINCT) is essentially grouping (without aggregation operation), which needs to be solved using grouping thinking.
5. Join
SQL defines JOIN as Cartesian products and then filtering, without distinguishing between equivalence and non-equivalence JOINs, and does not require correlation with primary keys. SPL also provides this free and flexible JOIN operation, but it is rarely used.
Equivalence join, which is the most common form of join operations, is a join that filters based on the equivalence of the corresponding fields in the associated tables. SPL further divides equivalence join into two types:
1) Foreign key join: Equivalence join between a field in the fact table and the primary key in the dimension table
2) Primary key join: Equivalence join between the primary key of Table A and the primary key or partial primary key of Table B
Both joins have primary keys involved. Equivalence joins without primary key participation are mostly caused by errors in business logic and data.
The two types of join operations need to be implemented using completely different functions. When applying join operations, it is necessary to clearly distinguish the type of join and find the participating (logical) primary key.
T.switch( C, D:K ) Perform a foreign key join between the C field of Table T and the primary key K of Table D, and convert K into a reference of the record in Table D
T.join( C, D:K, x:F ) Perform a foreign key join between the C field of Table T and the primary key K of Table D, calculate x to the associated record of Table D as F field and join to table T.
join(T1,K1; T2,K2 ) Perform a primary key join between the K1 field of Table T1 and the K2 field of Table T2, and return a table sequence with the associated T1 and T2 records as fields
The dimension table during foreign key join needs to be randomly accessed and can only be stored in memory in principle. SPL provides specialized functions for huge dimension tables that cannot be loaded in memory.
Cursor based primary key join operations require all participating cursors to be ordered for the primary key.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL