Performance Optimization - 3.3 [Search in external storage] Sorting index


Sorting index is more common in external storage.

The hash index can only do equivalence search for the search value, that is, the judgment condition is equal, not interval search, that is, the judgment condition is that the to-be-searched key (TBS key) is in a specified interval. Moreover, the hash index may have two rounds of hash when it is unlucky, and its performance is not stable enough. Sorting index can solve these problems.

The principle of sorting index is to store the TBS key values in order in the index, and then use binary search to find the physical position of the target value corresponding to the searched value. This scheme can do equivalence search or interval search.

In practice, we cannot simply do binary search to the index directly. It is still similar to hash index, we need to divide the index into several segments and establish a small index for these segments. First read out the small index, find the index segment corresponding to the searched value using binary search, and then read out the index segment to find the physical position of the target value. The small index can also reside in memory to reduce the amount of reading.

Similarly, when the amount of data is huge, the index will be divided into levels. However, the sorting index can determine the number of records stored in each index segment in advance, and the second round of hash caused by bad luck with hash index will not happen here. If 1K values are stored in each level, four levels can correspond to the huge scale of 1K*1K*1K*1K=1T records, which is more than enough. Generally, three levels are enough.

Sorting index is usually a little slower than hash index for equivalence search, but the difference is not great. Because it has a wider adaptability, most external storage indexes are sorting indexes. Hash index will be used only in individual scenarios that have extreme performance requirements and only need to realize equivalence search. The external storage index mentioned later in this book refers to the sorting index if not stated otherwise.

The SPL default index is the sorting index.

1 =file(“data.ctx”).open()
2 =A1.index(IDS;ID)
3 =A1.icursor(;ID==123456).fetch()
4 =A1.icursor(;ID>=123456 && ID<=654321)

If the hash range is not specified in the index() function, it is considered to establish the sorting index. Interval search can also be performed based on the sorting index, and it is not necessary to specify which index to use.

Why do we make a sorting index instead of sorting the original data table directly by the TBS key?

This is mainly a size problem. There are only the TBS key and physical position in the index, which is equivalent to a data table with two fields. The original data table often has dozens or even hundreds of fields, which is much larger than the index. If the whole table is sorted, it will occupy at least twice as much storage space.

Moreover, even if the data is ordered, the performance of performing binary search to the original data table is still not as good as that of using index. As mentioned earlier, direct binary search has no multi-level index, cannot locate precisely, and has many futile reading actions. Simple binary search is usually used in scenarios where performance requirements are not very high and you do not want to maintain indexes. Although the index is smaller than the original data, it still occupies considerable storage space, and it needs to be updated at the same time when the data is appended.

The essence of index is sorting, which is also the principle in database. Therefore, it is very slow to build an index for a big data table in the database, and the addition, deletion and modification of a table with index will be much slower, because an orderly index should be maintained at the same time. After understanding this principle, indexes should be consciously established and avoided in database project.

Sometimes we need to use two fields as the TBS key. Theoretically, this is no different from a single field, only the comparison action is a little complicated. However, after understanding the essence of index is sorting, it can also guide the establishment of this index.

For example, there are two fields A and B as the TBS key, so how do we build the index? Do we need to build an index for A and B respectively, or do we build a joint index of (A,B)? The cost of building an index is not low, so we should build as few indexes as possible.

If it is ordered for A, it is generally not ordered for B. While if it is ordered for (A,B), it must be ordered for A, not necessarily for B. We can draw the following conclusions:

1)Building separate indexes for A and B will be helpful for (A,B) joint search, but only one index can be used. For example, using the index of A will traverse the conditions of B in the records that meet the relevant conditions of A, but cannot use the index of B after meeting the relevant conditions of A.

2)The joint index of (A,B) can also be used for the search of A, but it cannot be used for the search of B, let alone for the search of (B,A). Indexing is built in a clear order.

If we are allowed to build two indexes, we should build indexes for (A,B) and (B,A). In this way, the search conditions for A, B, (A,B) and (B,A) can all use the indexes. The effect of building indexes only for A and B is much worse.

SPL implements these strategies to automatically find the best index. These principles are also valid for databases. Good commercial databases can all intelligently find the most appropriate index.