Optimization Skills for SPL Code
SPL is a programming language especially designed for structured data calculation, and esProc is the implementation of SPL in java, which provides the IDE for coding and debugging in a grid-style programming. Also, it is easier to understand and more efficient to develop than Java and SQL due to its concise syntax. This article will illustrate some skills for improving the calculation performance from the implementation principle of esProc.
1. Data types
1.1 Number
There are various number types in SPL, such as Integer, Long, Double, BigDecimal, among which the BigDecimal can represent numbers of arbitrary precision. Its drawbacks, however, are also obvious, like dragging down the calculation and occupying much larger memory compared with other number types. Therefore, we should use other types instead of BigDecimal to boost the calculation efficiency when they can meet the precision requirement.
In practice, the JDBC of some databases will still return BigDecimal for some low-precision numbers when fetching data through JDBC. In that case, we can examine whether they can be converted to other types so as to improve the performance.
1.2 String
The “String” object of Java takes up relatively large space, to be more specific, one string of length 0 occupies more than 40 bytes, while Integer and Long only occupy 16 bytes. And the comparison and hash operations of String are also slower than Integer and Long.
What’s more, during loading data from hard disk and then generating java objects, the size of data as Java object is often several times or even ten times more than the size of the same data in hard disk (the difference can be even larger if the compression technology is used for the hard disk storage). This situation may directly lead to memory overflow even the relatively small data are loaded and converted to java objects, at which point the only way is to perform the external storage calculation if we can not decrease the memory occupation. But the calculation on external storage is generally far more complex than in-memory calculation, resulting in performance degradation.
So is there any method that can not only occupy less memory but also improve the computation efficiency?
A very common way is to sequence-numberize the enumeration strings, for example, the data of a fact table are as follows:
Region |
Gender |
Name |
Beijing |
Male |
Jack |
Beijing |
Male |
Michael |
Shanghai |
Male |
Tom |
Beijing |
Female |
Amy |
Shanghai |
Female |
Anne |
For the enumeration fields like gender and region, we can create a corresponding table to convert the values of gender and region into sequence numbers, 1, 2, ..., so that the gender and region fields in the fact table can be stored as the corresponding sequence numbers. And the converted data are as follows:
1 |
Beijing |
2 |
Shanghai |
… |
… |
1 |
Male |
2 |
Female |
Region |
Gender |
Name |
1 |
1 |
Jack |
1 |
1 |
Michael |
2 |
1 |
Tom |
1 |
2 |
Amy |
2 |
2 |
Anne |
In this way, we can both decrease the memory occupation and improve the efficiency because operations such as comparison and grouping on numbers are much faster than those on strings. We can also convert the sequence numbers back to strings as needed when outputting the results, that is, use the sequence number to locate the corresponding record in the code table directly for replacement.
2. Table sequence structure
2.1 Append rows
A table sequence is similar to the table in the database but in order, and the data of it are stored as a continuous array in memory. In general, more memory will be reserved for the table sequence to cope with the possible data addition, in case that we need to reallocate memory every time new data are appended. But still, we cannot reserve too much space and waste memory.
If we append records to the table sequence frequently, the array will become longer and longer, and the memory originally allocated for the array will have to be expanded. However, it is not that easy to expand the allocated memory, we have to allocate even larger space and copy the data from the original space to the new space. In this processing, both finding new space and copying the data will take up CPU time and usually consume more resources than the calculation itself.
Therefore, if we know the number of rows in advance and create the table sequence all at once, we only need to allocate memory once at the beginning. Even if the field values in the table sequence need some steps to be calculated, we should still create the table sequence with the new function first and then modify the field values of the records, instead of calculating and appending rows one by one. As for modifying the field values of the records, SPL offers a lot of methods.
Suppose we want to generate a table sequence of Fibonacci sequence with 20 rows and 2 columns, the first column “key” is the row number, i.e. 1, 2, 3, ...; the second column “value” is the value. The rule of Fibonacci table sequence is: the 1st and 2nd rows take the value 1, and from the 3rd row onwards, the value is the sum of the previous two rows. This operation needs to be performed step by step, so it is natural to append the data dynamically:
A |
B |
|
1 |
=create(key,value) |
|
2 |
>a=0 |
>b=1 |
3 |
for 20 |
>A1.insert(0,A3,b) |
4 |
>b=a+b,a=b-a |
However, the performance will be better if the table sequence is generated all at once, even the calculation itself still needs to be performed step by step:
A |
B |
|
1 |
=20.new(~:key,:value) |
|
2 |
>a=0 |
>b=1 |
3 |
for A1 |
>A3.value=b |
4 |
>b=a+b,a=b-a |
2.2 Append columns
To expand a table sequence, in addition to appending the data by row, we may also change the data structure by adding fields to each row, also known as appending columns. Column appending is even more complicated than row appending since the table sequence is a big array itself where each row is not only one record but also an array physically. Because the data structure rarely changes, we will not reserve space when generating the array for each row during creating the table sequence, otherwise too much memory will be wasted (for space should be reserved for each row). Based on this principle, the aforementioned situation, reallocating memory, will occur when appending columns, and the operation has to be done on each row of record. After that, we will copy over the original records. All in all, the whole processing will consume considerable time, and even far more than the to-do calculation after appending that column.
SPL offers the functionality of appending columns to a table sequence, which is very convenient, but should be used with caution as far as performance is concerned. If we have no choice but to do so, it is necessary to add all the to-be-appended columns at once rather than appending them one by one. For columns whose field values cannot be calculated at that time, set them as null first and then modify them using other functions.
After retrieving the table sequence from the database, if we know in advance that a new column xxx is needed to be generated with the derive function, we can add “null as xxx” in the SQL statement. In this way, the required fields are generated directly in the query without another derive.
For example, to retrieve fields ORDERDATE and AMOUNT from the data table sales, and sort them by field ORDERDATE, then append one column to calculate the cumulate value of AMOUNT, the natural syntax of loading data first and appending column later is:
A |
|
1 |
=demo.query("select ORDERDATE,AMOUNT from sales order by ORDERDATE") |
2 |
=A1.derive(CUMULATE[-1]+AMOUNT:CUMULATE) |
And the script that use SQL statement to generate the column first is as follows:
A |
|
1 |
=demo.query("select ORDERDATE,AMOUNT, null as CUMULATE from sales order by ORDERDATE") |
2 |
>A1.run(CUMULATE=CUMULATE[-1]+AMOUNT) |
2.3 Reference records
For the optimization logic of adjusting the structure of the table sequence in the above two sections, the starting points are both to decrease the actions of coping field values in the new and derive functions. Apart from those skills, SPL also supports object references where the field value can be another record. Thus most situations in SPL don’t need to copy the fields in the new result set as in SQL, instead, we can use reference in order to keep the whole record together for calculation, which improves the performance as well as occupies less memory.
The above requirement that appends the cumulate value AMOUNT with the derive function can be performed by the new function. It creates a new table sequence where the SRC field references the original records and the CUMULATE field stores the cumulate values, and the script is as follows:
A |
|
1 |
=demo.query("select ORDERDATE,AMOUNT from sales order by ORDERDATE") |
2 |
=A1.new(~:SRC,CUMULATE[-1]+AMOUNT:CUMULATE) |
3. Loop function
3.1 Replace the loop statement with loop function
The grid-style program of SPL provides the loop statement “for” and branch statement “if” to implement some complex calculation ideas. At runtime, since the order of cell execution is interpreted dynamically, the extensive use of loops will result in too many executing cells, thus a lot of time will be spend on the dynamic interpretation of the cells.
Apart from the loop statement, SPL also provides the loop function, suitable for most scenarios that require the “for” statement. The operations with less complex computation steps and high performance requirements should be done with the loop function as much as possible. Similarly, scenarios that can be handled with “if” function should not use “if” statement.
The example given in Section 1.2 for calculating the Fibonacci sequence can be rewritten as follows:
= |
A |
1 |
=20.new(~:key,:value) |
2 |
=A1.run(value=if(#>2,value[-2]+value[-1],1)) |
In the script, # indicates the record currently looped, i.e., the # corresponding to the first record is 1, and the # increases in ascending order. value[-1] represents the value of the previous record, and value[-2] represents the value of the record before the previous one.
Every time being executed, the “eval” function will first parse the expression string specified by parameters into an expression. But if it is executed in a loop, the parsing process mentioned above will take up too much time. And if the expression string is not variable, we can use macro substitution instead of “eval” function.
3.2 Place constant variables outside the loop
It can also be helpful for the performance improvement to place the generation of constant variables outside the loop. For example, to select the sales records of Beijing, Shanghai and Shenzhen, the more “natural” script is:
A |
|
1 |
=demo.query("select * from Suppliers") |
2 |
=A2.select(["Beijing","Shanghai","Shenzhen"].contain(Cities)) |
Because the sequence in SPL can be modified, the expression ["Beijing", "Shanghai", "Shenzhen"] will generate a new sequence every time it is calculated. Therefore, if we put ["Beijing", "Shanghai", "Shenzhen"] in the loop function “select” like above, it will generate sequences as many as A2’s length when being executed. Too many loops will waste a lot of time, so the performance-oriented script should be as follows:
A |
|
1 |
=["Beijing","Shanghai","Shenzhen"] |
2 |
=demo.query("select * from Suppliers") |
3 |
=A2.select(A1.contain(Cities)) |
3.3 Avoid loop in a loop
We should also be wary of executing the loop function within a loop function. The code may look very simple, but the actual computation will be magnified geometrically over several levels of loops, which is essentially common sense but may be overlooked sometimes. So what can be done outside the loop should never be placed in the loop. In particular, we should also be aware of those extremely time-consuming actions such as reading files and accessing databases in the loop.
4. Code habits
4.1 Release memory
The performance of Java will degrade dramatically when memory is running low, so it is necessary to release memory in time. SPL does not have such a statement of deleting variables to release memory, but we can just set the value of the variable or cell as null, or clear an area of cells using the “clear” statement. For example:
A |
|
1 |
=file("PART").cursor@b(P_PARTKEY,P_BRAND,P_CONTAINER) |
2 |
>A1.select(P_BRAND == "Brand#33" && P_CONTAINER == "LG DRUM") |
3 |
=A1.fetch() |
4 |
=file(path+"LINEITEM").cursor@b(L_PARTKEY,L_QUANTITY) |
5 |
>A4.join@i(L_PARTKEY,A3:P_PARTKEY) |
6 |
=A4.groups@u(L_PARTKEY;avg(L_QUANTITY):quantity) |
7 |
>A3=null |
8 |
…… |
The cell started with = is the calculation cell, where the return value of the expression will be saved; the cell started with > is the executable cell, and the return value of the expression will not be saved. cs.select and cs.join is used to add operations on the cursor, which will not create a new cursor so the return value can be discarded. A7 is to release the retrieved PART data, and we can also use the “clear” statement to clear the cell values from A1 to A5, which can be done with replaced code in A7 as follows:
A |
|
7 |
clear A1:A5 |
4.2 Compact code
The code blocks of “for” and “if” can be written directly in the same row instead of different rows like in Java. The grid of SPL is able to split these statements out clearly. Also, it takes time for the interpreter to scan the blank cells. Therefore, for the programs containing the loop statement with a particularly large number of loops, we should compact the code by removing blank rows and columns to decrease the number of cells, thus improving the efficiency of the interpreter.
The following is an example for introducing the rules of code block in SPL: to obtain the first sales record of each day. And “sales” is the sales record cursor parameter, ordered by ORDERDATE.
A |
B |
C |
D |
E |
F |
|
1 |
=[] |
|||||
2 |
for |
if A1.len()>1 |
return A1(1) |
>A1=A1.to(2;) |
||
3 |
else |
=sales.fetch(1000) |
if C3==null || C3.len()==0 |
if A1.len()==1 |
return A1(1) |
|
4 |
break |
|||||
5 |
>A1 = (A1|C3).group@1(ORDERDATE) |
The code blocks of the cell are the rows where the cell is located and the cells directly below it and the lower left cells of it are all blank. So the code block of “for” in A2 is [B2:F5]. The code block of “if” in B2 is [C2:F2], the next row of the “if” code block and the cell B3 in the same column with “if” cell are both “else”, and the cells to the left of B3 are all blank cells, then B3 cell is the “else” branch of B2 cell. So the code block of cell B3 is [C3 : F5]. “else” can also be written in the same row with the corresponding “if”, i.e., the right cell of “if”.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version