Performance Optimization - 3.2 [Search in external storage] Hash index


When using binary search, we still need to read the original file many times to locate the target value, and many readings are redundant in the process. If we can get the physical position of the target value efficiently, we can read the target value directly.

Establish an association table between the to-be-searched key (TBS key) value and the physical position of the target value, quickly obtain the physical position from the association table with the search value, and then read the target value. This association table is the external storage index.

There are many TBS key values in the index, so it is necessary to perform the same search action for the index as for the data table. The difference is that we can specially design the index structure to facilitate fast search.

The hash index will calculate the hash value of the TBS key value in the original data table, store the TBS key value and its corresponding record physical position according to the hash value, and store the records with the same hash value together.

Usually, the original data table is very large, and the index is also too large to be stored in memory. At this time, a small index should be established for the index itself. The range of hash value is known in advance and can be divided into several segments for storage. For example, the hash value of 1-100 is stored as a segment, and 101-200 is stored as another segment to ensure that each segment is small enough to be loaded into memory. A small index with a fixed length is saved in the index header to record the hash value range stored in each segment and the physical position and length of this segment. When the search value is determined, calculate the hash value, read out the small index first, and find out which segment to search in the next step. At this time, the in-memory search technology can be used. Generally, the hash value of each segment is stored in order, and binary search can be used. If the number of hash values stored in each segment is fixed, sequence number positioning can also be used. Then read out this segment, find the corresponding physical position of the target value (this is also an in-memory search technology), and then read it out from the original data table.

To sum up, there are three steps: firstly, read out the small index in the header, secondly, read out the index segment where the search value is located, and thirdly, read out the target value using the physical position. In this way, the target value can be found after reading a total of three times, and the use of index has obvious advantages over binary search. Moreover, the small index in the header can be always loaded in memory after being read once, and it is not necessary to be read again for the following search, and the amount of reading can be further reduced.

When the amount of data is very huge, the small index in the header may be too large to be loaded into memory. At this time, it may be necessary to create another smaller index for the small index. That is, the index may be divided into multiple levels. We use the loading order to represent the index level. The first loaded minimum index is called the first-level index. Find the appropriate second-level index through the first-level index, and then find the third-level and fourth-level (four levels are large enough). The last level can locate the physical position of the target value. The maximum number of reads per search is the total number of index levels, plus the one that reads the target value. The first-level index can also reside in memory without repeated reading.

Another problem with hash index is that the hash function may have bad luck, resulting in too many records corresponding to a slice of hash values to be stored in one index segment. At this time, we need to do the second round of hashing, and change a hash function to redistribute these records. With bad luck, there may be a third round, a fourth round, …. As a result, the level of hash index is not fixed, and there may be more levels for some search values.

The whole process is not simple. The external storage task is indeed much more troublesome than the in-memory task.

SPL provides indexing function for composite tables:

1 =file(“data.ctx”).create(ID,…)
2 =1000.sort(rand())
3 for A2 =10000.sort(rand())
4 >A1.append(*10000+~-1:ID,…).cursor()))
5 =file(“data.ctx”).open()
6 =A5.index(IDH:1000000;ID)
7 =A5.icursor(;ID==123456).fetch()

First generate a disordered composite table. Do not add # before the ID field in A1.

A6 creates an index for composite table A5, the TBS key is ID, and IDH is the name given to this index. When creating a hash index, we also need to give the range of hash values, here is 1000000. We can use this index to search in A7. There is no need to specify the index name, SPL will automatically find the appropriate index to use. The index name is mainly used for maintenance and deletion.

Different from the previous binary search, the index search of the composite table always assumes that the amount of returned data may be large, so a cursor is used.

Theoretically, we can also establish index according to the record sequence number and physical position, so that we can logically realize the effect of sequence number positioning. However, through the above analysis, we can see that the time cost of external storage search is mainly in the processing of index segmentation and related multi-level indexes. Even if we use sequence number positioning, we also need to go through the same process, and the time saved by sequence number positioning in in-memory operation cannot be felt any more, so the method of sequence number positioning is no longer specifically proposed for external storage search.