Performance Optimization - 1.1 [In-memory search] Binary search
Performance Optimization - Preface
The table sequence T in memory has a field K, given the search value k, find the record with the value of field K in T as k. Field K is called the to-be-searched (‘TBS’ for short) key, and the found record is called the target value or target record.
Conventional sequential search will traverse each member of T in turn for comparison. If there are N records in T, the average comparison number is N/2 and the complexity is O(N). If T is just ordered for K (sequence T.(K) or T.(-K) is an increasing sequence), binary search can be used.
Before introducing the algorithm further, let’s explain: most of the algorithms discussed in this book are general for table sequence and record sequence, and even valid for sequence. But sometimes it is relatively convenient to use a certain term to describe it. If you understand the principle, you can extend it to other similar scenarios. When applying these algorithms, it is not necessary to strictly abide by the premises and scenarios described in the book, instead, you need to understand the principle and then confirm whether the facing scenario is suitable for the algorithm, and sometimes it needs to make a few modifications to the algorithm.
Binary search is a common algorithm. Taking T.(K) in ascending order as an example, the basic logic is as follows:
A | B | C | |
---|---|---|---|
1 | =1 | =T.len() | |
2 | for | =(A1+B1)\2 | =T(B2) |
3 | if k==C2.K | return C2 | |
4 | else if k>C2.K | >A1=B2+1 | |
5 | else | >B1=B2-1 | |
6 | if A1>B1 | return null |
Compare k with the middle member of T, if they are not equal, exclude half of the members according to the comparison result, and continue to perform this action in the subset composed of the remaining half of the members until the target value is found or the remaining subset is empty. Because half of the members can be excluded each time, you can find the target value (or confirm that it cannot be found) by comparing log2N times at most. The complexity is O(logN). When N is large, it will have a very obvious advantage over sequential search. When N is small, because the process of binary search is relatively complex, it may be slower than sequential search. Usually, when N>=8, binary search will have an advantage.
If T.(K) has duplicate values, there may be multiple target values. After finding one target value using binary search, you need to compare more records continuously forward and backward. When the average number of repetitions is M, each search will make an average of M more comparisons, and the total complexity is O(logN+M). In most cases, there are only occasional duplicate values, that is, M≪log2N is satisfied, the complexity can still be considered as O(logN). If there are many duplicate values, then M cannot be ignored when analyzing complexity. The complexity of sequential search is always O(N) whether there are duplicate values or not. When there are duplicate values, the complexity of binary search is generally still lower, but in practice, a larger N is needed to show the advantage over sequential search.
SPL provides the binary search option, which can be used directly:
A | B | |
---|---|---|
1 | =10000.new(string(~,"00000"):id, ~\3:K).keys(id) | |
2 | =A1.find@b("01234") | |
3 | =A1.select@b(K:1234) | =A1.select@b(K-1234) |
4 | =A1.pselect@b(K:3210) | =A1.pselect@b(K-3210) |
find@b()can only search according to primary key, so it needs to establish a primary key (keys(id)), Since the primary key is unique, find@b will return once it finds one record and does not consider duplicate values. select@b()doesn’t require the establishment of a primary key, and it will handle as if there are duplicate values; pselect@b() will also consider duplicate values, but it only returns one by default, and will find the first one.
It should be noted that logical expressions cannot be used in the parameters of select@b()and pselect@b(), but a colon separator or expressions that return a numeric value should be used, because logical expressions can only calculate out true and false values, and cannot determine the search direction of the next step in case of inequality; A numerical expression of 0 indicates that it is found, and <0 and >0 are used to indicate the search direction of the next step.
You can use SPL to compare the performance difference between binary search and sequential search.
It should also be reminded that the use of binary search requires order in advance. Sorting is a very slow action, and you can’t sort first just to temporarily use a binary search, it will only be even slower. Usually, we need to search in a table sequence many times. In this case, we can sort the table sequence first, and then the following multiple searches can get better performance.
Binary search can also support interval search, that is, find the records where the TBS key value falls within a continuous interval:
A | B | |
---|---|---|
1 | >=10000.new(string(~,"00000"):id, ~\3:K) | |
2 | =A1.select@b(between(K,123:1234)) | |
3 | =A1.pselect@b(between(K,321:3210)) |
When the TBS key is smaller than the left limit of the interval, the between function will return -1, when it is larger than the right limit of the interval, it will return 1, and when it is within the interval, it will return 0, which just meets the requirements of select@b and pselect@b.
Performance Optimization - 1.2 [In-memory search] Sequence number positioning
Performance Optimization - Preface
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL