Multi-purpose traversal
Reducing external storage (hard disk) access has always been an eternal topic for improving big data computing performance. We have also discussed methods such as columnar storage and compression that directly reduce access and even storage. In addition to these storage level methods, methods can also be found to reduce external storage access in the algorithm and computational implementation stages.
Traversing is an essential part of big data computing. Sometimes, we may find that in a computing task, there are two (or more) traversal actions involving the same batch of data. If we can find a way to merge two traversals into one, then the total computational load (CPU action) will not differ, but the access to the hard disk will be reduced by half. This can still improve computational performance, and the improvement effect on data intensive computing is quite significant.
In the data structure of simplified account table T, the following fields are included: account A, date D, place of occurrence P, and amount M
Now we want to calculate the balance of accounts a1 and a2, and write it in SQL as follows:
SELECT SUM(M) FROM T WHERE A=a1
SELECT SUM(M) FROM T WHERE A=a2
The calculation of these two statements will result in traversing table T twice. If table T is very large, the calculation efficiency will be very low.
If we write this SQL as follows:
SELECT SUM(CASE WHEN A=a1 THEN M ELSE 0 END),
SUM(CASE WHEN A=a2 THEN M ELSE 0 END)
FROM T
One statement that calculates both of these statistical values makes the statement more complex, and the total computational load of the database slightly increases (with the same number of judgments, more cumulative times, and many more zeros added). However, table T only needs to be traversed once, resulting in much higher computational efficiency.
As a database programmer, you need to learn this skill.
However, not all operations can be handled with CASE WHEN.
We want to separately calculate the total amount for each day and the total amount for each location. The SQL statement is:
SELECT D,SUM(M) FROM T GROUP BY D
SELECT P,SUM(M) FROM T GROUP BY P
Different WHERE can be wrapped using CASE WHEN, but different GROUP BYs cannot be merged anymore, and table T can only be traversed twice.
In theory, SQL can also traverse once and be written as
SELECT D,P,SUM(M) FROM T GROUP BY D,P
Then program and calculate the operations for GROUP BY D and GROUP BY P based on this intermediate result. But it’s really too complicated to implement, and these are just two field grouping, if there are more fields, the intermediate result set may be too large to fit in memory, and it will actually be much slower.
The SQL system cannot solve this problem anymore. We need to design new concepts and syntax to implement multi-purpose traversal operations.
The concept of channel is introduced in the cursor of SPL. When a cursor traverses data and performs a certain operation, it pushes the data into a channel, and another operation can be defined on the channel. This way, One data traversal can obtain two calculation results of the cursor itself and the attached channel. The above operation can be written as follows:
cs = T.cursor(...)
ch = channel(cs).groups( P; sum(M) )
dg = cs.groups( D; sum(M) )
pg = ch.result()
channel(cs) binds a channel ch to the cursor cs and defines a grouping operation by P. Then, the cursor cs continues to traverse and implement the grouping operation by D. After the traversal is completed, the relevant results are taken from the channel ch.
The task of aggregation on different conditions mentioned earlier can also be written using cursor and channel mechanisms:
cs = T.cursor()
ch = channel(cs).select( A==a2 ).sum(M))
m1 = cs.select( A==a1 ).sum(M)
m2 = ch.result()
The code structure is the same.
Of course, multiple channels can also be attached to one cursor. For example, the two tasks mentioned earlier (conditional aggregation and different grouping) can also be done in one traversal:
cs = T.cursor()
ch1 = channel(cs).select( A==a2 ).sum(M))
ch2 = channel(cs).groups( P; sum(M) )
ch3 = channel(cs).groups( D; sum(M) )
m1 = cs.select( A==a1 ).sum(M)
m2 = ch1.result()
dg = ch2.result
pg = ch3.result()
Let’s take another example of calculating the median.
When calculating the median, sorting is required, but in general, sorting operations only focus on the sorting itself and do not care about counting. After sorting is completed, it is not even known how much data there is in total. In this case, to find the median, it is necessary to do another COUNT to traverse the data, which wastes time. If there is a channel mechanism, we can implement the counting while sorting.
cs = T.cursor()
ch = channel(cs).count()
s = cs.sortx(M) // Implement the counting on the channel during the traversal sorting process
k = ch.result()
m = s.skip( (k-1)\2 ).fetch@x(2-k%2).avg(M) //Find the one or two number in the middle
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version