Performance Optimization - 3.1 [Search in external storage] Binary search

 

Performance Optimization - 2.6 [Dataset in external storage] Data update and multi-zone composite table

If the records of the data table stored in the file of external storage are in order to the to-be-searched key (TBS key), as long as the file format can support relatively arbitrary segmentation, we can implement external storage binary search to avoid traversing the whole file sequentially.

First divide the file into two segments, get the first record of the second segment, compare it with the search key, and then decide whether to continue searching in the first half or the second half according to the comparison result. Then divide the file into four segments, get the first record of the 2nd or 4th segment, continue the comparison, and then divide the file into eight segments, … , finally find the target value.

If there are duplicate values, it will be more complex. To search forward, we can directly continue to traverse; To search backward, we need to continue to do binary search in the previous part until we find the first searched value, which is much more troublesome than in-memory search.

Binary search is also applicable to interval search, that is, to find the records of the TBS key in a certain range. Just find the first starting searched value, and then traverse forward until all the ending searched values are found. There is no need to search on either side.

This process is not difficult to understand, but it is cumbersome to implement. SPL has done the encapsulation and it can be used directly:

A B
1 =file("data.txt")
2 for 1000 =10000.sort(rand()).to(9000).sort()
3 >A1.export@at(B2.new(A2*10000+~-1:ID,…))
4 =A1.iselect@t(123456,ID)
5 =A1.iselect@t(123456:234567,ID)

A1-B3 generates 9 million lines of text file with orderly ID. Using iselect() function, we can find the target value in an ordered file using binary search, and it also supports the searched value in a certain interval. You can compare the speed difference with traversing using a cursor.

Bin files can also use this function.

A
1 =file(“data.btx”))
2 =A1.export@b(file(“data.txt”).cursort@t())
3 =A1.iselect@b(123456,ID)
4 =A1.iselect@b(123456:234567,ID)

Composite tables are OK, of course, but will use another function.

A B
1 =file("data.ctx") =file("data.btx").cursor@b()
2 =A1.create(#ID,…) >A2.append(B1)
3 =A1.open()
4 =A3.find(123456)
5 =A3.cursor(;ID>=123456 && ID<=234567)

When using find(), the TBS key is required to be the primary key of the composite table, that is, it is unique in the composite table. If you perform an interval search (multiple target values will be returned), you can directly generate a cursor and add conditions. SPL will automatically jump to the appropriate data block to read the data according to the order of the key, instead of actually traversing.


Performance Optimization - 3.2 [Search in external storage] Hash index
Performance Optimization - Preface