Performance Optimization - 3.7 [Search in external storage] Search that returns a set
Sometimes it is necessary to find multiple target values for one search value, such as the user’s transaction records through the user ID.
Establishing index for user ID can avoid full traversal and improve search performance, but it is still not enough. If the storage positions of multiple target values in the external storage data table are discontinuous, even if the positions of the target values can be quickly found by using the index, the actual reading will still jump through different positions in the hard disk. Moreover, because there is a minimum unit for hard disk reading, usually this unit is much larger than the space occupied by one record, resulting in most of the read contents being wasteful (even more serious in case of columnar storage). Moreover, too much jumping on the mechanical hard disk will also consume a lot of time. In particular, this search is often accompanied by concurrency, which will further aggravate the jumping of the hard disk.
The solution is to sort and store the data according to the to-be-searched key (TBS key) in advance to ensure that the target values corresponding to the same search value are physically continuously stored. In this way, almost all contents of the data block read out are the required target values, and it will not cause the hard disk to jump, and the performance improvement is quite obvious.
Regenerate a composite table with ordered IDs. There may be several records under each ID. After creating an index, the search method is no different from that before, but it will obtain much better performance, especially in concurrency. Using row-based storage or valued index will also have better performance than using columnar storage directly. You can try to generate a disordered composite table to compare the performance difference.
Review again: even for a data table that is ordered by the TBS key, it still makes sense to establish an index. The index can directly locate the target position, unlike binary search of external storage, which will read out adjacent data to try.
The generation order of such transaction data is usually time, while the TBS key is usually not time. This requires adding new data generated continuously (in time order) to the existing data table, and maintaining the order by the TBS key, so as to obtain high-performance search results. This will use the method discussed in the previous chapter of using supplementary file to implement appending data to form an ordered composite table.