Performance Optimization - 1.4 [In-memory search] Hash index

 

Performance Optimization - 1.3 [In-memory search] Position index

Hash index can be understood as an extension of sequence number positioning.

Use a function to calculate the value of to-be-searched key (TBS key) to a natural number between 1… M, which is called the hash value of the record. This function is called the hash function. After grouping the records of table sequence T according to the hash value, a two-layer sequence H with length M (i.e. group@n() operation of SPL, where there may be empty members) can be obtained. Now we can use the sequence number positioning for H, after calculating the hash function of the search value, we can locate the corresponding grouped subset, and then use the sequential or binary search in this grouped subset to further find the target value. The result of this grouping operation to be done in advance is called hash index.

A
1 =20000.sort(rand()).to(10000)
2 =A1.new(~:id, …)
3 =A2.group(id%100+1).(~.sort(id))
4 =A3(2345%100+1).select@b1(id-2345)

In A3, the remainder plus 1 is used as the hash function to divide 10000 records into 100 groups. In A4, use sequence number positioning first and then use binary search, the amount of data involved is only 1/100 of the whole data set, and the comparison times will be log2100 times less.

The remainder is also a common hash function for searched keys of integer types.

If you are lucky, the number of members of a member in H (a member of H is a set) is relatively average (close to N/M). Since the complexity of hash value calculation and sequence number positioning is O(1), the search complexity after establishing hash index will be reduced by M times, which is equivalent to search a table sequence with length of N/M. If you are unlucky, the distribution of the number of members of a member in H is very uneven, and the hash index search may not improve the performance in extreme cases, which depends on the design of the hash function and the distribution of the TBS key values in the original dataset.

However, luck is usually not too bad. In most cases, hash index will get better performance improvement. It can even have more advantages than using binary search after sorting when the original data is out of order.

The disadvantage of hash index is that it needs to occupy memory space to store the index. If you want to speed up, you need a larger index and it will occupy more space.

SPL provides a hash index for the primary key, which will automatically use the system’s built-in hash function.

A
1 =20000.sort(rand()).to(10000)
2 =A1.new(~:id, …).keys@i(id)
3 =A2.find(12345)
4 =A1.new(~:id, …).keys@i(id;100)
5 =A4.find(23456)

When setting the primary key, you can use @i to create an index at the same time. If there is an index in place, the find()function will use it automatically. When creating an index, you can also use parameters to control the length of the hash index (i.e., M value). You can try the impact of different hash index lengths on performance.


Performance Optimization - 1.5 [In-memory search] Multi-layer sequence number positioning
Performance Optimization - Preface