Performance Optimization Skills: Program Cursor
Abstract
Many algorithms will lead to an intermediate result of data buffering on HDD, especially for big data amount, thus degrading the performance greatly. In this case, we can use the program cursor to avoid data buffering, and here we’ll introduce it with specific examples.
Calculation aim: The orders table has already sorted by time. We need to DISTINCT these records by date and product and then count the number of records.
The result set after DISTINCT is too large to load in memory, therefore we usually use the groupx function to DISTINCT. The SPL script is as follows:
A |
||
1 |
=file("sales.btx").cursor@b() |
/load the orders records from the bin file |
2 |
=A1.groupx(date(ORDERDATE),PRODUCTID) |
/group to DISTINCT the records by date and product |
3 |
=A2.skip() |
/count the number on cursor |
However, the groupx function will lead to data buffering, resulting in bad performance.
To keep the intermediate result of DISTINCT from buffering, we can generate a program cursor beforehand. And the distinctProduct.dfx script is:
A |
B |
|
1 |
=file("sales.btx").cursor@b() |
|
2 |
for A1;date(ORDERDATE) |
=A2.id(PRODUCTID) |
3 |
return B2 |
A2: Loop through the orders table. Fetch a batch of records with the same date each time because the data has been sorted by date.
B2: DISTINCT the records by product. Note that although the full amount of data is big, the data fetched by date are relatively small, which can be loaded in memory, thus DISTINCT them by id.
B3: Return the DISTINCT result of this batch. Please note that rather than returning all the data in a time, the program cursor waits the caller’s request after each loop, and proceeds to the next loop if the caller continues to request for data, during which the data will not be buffered.
The main program can call the program cursor using the cursor function, and the program cursor is used in a similar way like the ordinary cursor:
A |
||
1 |
=cursor("distinctProduct.dfx") |
/call the program cursor |
2 |
=A1.skip() |
It is obvious that the program cursor can decrease data buffering and improve the performance.
Set operation
The database requires the data to be sorted first when doing set operations. But the sorting will take a lot of time without getting a result quickly if the data amount is too large. In this case, it is more suitable to use the program cursor to generate an ordered intermediate result set without buffering, thus enabling quick output.
For example, there are two tables of the same structure: callRecordA and callRecordB. Both create index on the time field “callTime” with tens of thousands of data per second. Now we want to do a union operation on the data of 2015-01-01, which requires a quick output of the first 500 records (e.g., presenting them on a report quickly).
This time, the program cursor and the calling code are written in one script as follows:
A |
B |
C |
|
1 |
func |
=connect("orcl") |
|
2 |
for 60*60*24 |
=elapse(datetime("2015-01-01 00:00:00"),A2-1) |
|
3 |
=A1.query("select CALLTIME,OUTID,INID,DURATION,CHARGE from"+A1+"where callTime=?",B2) |
||
4 |
=C3.sort(OUTID,INID,DURATION,CHARGE) |
||
5 |
return C4 |
||
6 |
=B1.close() |
||
7 |
=cursor@c(A1,"callRecordA") |
||
8 |
=cursor@c(A1,"callRecordB") |
||
9 |
=[A7,A8].mergex@u() |
||
10 |
=A9.fetch@x(500) |
A1: Use the func to define a program cursor, and the corresponding calling syntax is cursor@c.
B2: Loop through every second of the day.
C3: Query the data of one second from the database, which executes very fast and has a little impact on the database because the data are fetched through the cursor. Please note that A1 is the variable of the table name; the data can be fetched through the program cursor from either callRecordA or callRecordB.
C4: Perform in-memory sorting on the data of one second to generate an ordered result set. And since the data are in the same second, the sorting can be performed only on other fields.
A7A8: Taking the table name as parameter, retrieve 2 program cursors.
A9: Perform ordered merge on two cursors, and @u is used to union. Similarly, @i and @d are used to perform intersection and difference operations.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version