Performance Optimization - 1.5 [In-memory search] Multi-layer sequence number positioning

 

Performance Optimization - 1.4 [In-memory search] Hash index

As mentioned earlier, we cannot directly use the sequence number positioning to locate the ID number. But in some specific cases, there are flexible means.

We still use the ID number to find people as an example. If the people to be searched have some common characteristics, for example, the birth dates of children who are about to enter school will fall within two or three years, and their regions are also close. These characteristics will lead to the relatively centralized distribution of ID numbers, and not cover the range of all 18-digit natural numbers. Using this characteristic, a multi-layer sequence number positioning can be constructed to reduce the space occupied in the implementation of sequence number positioning.

A B C
1 =T.align@a(999999,int(mid(ID,9,6)))
2 =A1.run(~=if(~.len()>0,~.align@a(999999,int(left(ID,6))),null))
3 =A2.run(~.run(~=if(~.len()>0,~.align@a(999,int(mid(ID,15,3))),null)))
4 ’110108200101231234
5 =int(mid(A4,9,6)) =int(left(A4,6)) =int(mid(A4,15,3))
6 =A3(A5)(B5)(C5)

Let T be the table sequence of people to be searched, and ID is the ID number field. A1 divides the members of T into 999999 grouped subsets according to the birth year, month and day digits of their ID. We only use last two digits of the year digits, and there will be no duplication; A2 then divides each non empty grouped subset into 999999 smaller grouped subsets according to the first 6 digits of the ID, that is, the region number; A3 further divides the grouped subset of the grouped subset of A2 into 999 subsets according to the last three digits of the ID (excluding the 18th checking digit). These three lines of code look a little complex, but it is not difficult to understand. The result is a three-layer sequence.

A4 is the search value. In the same way, get the three parts of numbers, that is, A5,B5,C5, and then directly use the three-layer sequence number positioning to locate the target value. The complexity of three times sequence number positioning is still O(1), and it can run very fast.

What is the difference between this multi-layer method and the sequence number positioning mentioned earlier? We take this three-layer sequence as an example to discuss the space it occupies.

In the first layer, if the birth dates of this population are concentrated in two or three years, only about 1000 sequences among the sequences with a length of 999999 are non-empty, while for empty members, there is no next layer sequence, that is, there are only about 1000 sequences in the second layer. The length of each grouped subset in the second layer is also 999999, so the total number of members is about 1000*999999; Similarly, because the regions are close, only about 1000 grouped subsets of each grouped subset in this layer are non-empty, thus, only about 1000*1000 grouped subsets in the third layer will be non-empty. The length of each grouped subset in the third layer is 999 and the total number of members is about 1000*1000*999. The total number of members of the three-layer sequence is less than 2G, and the occupied space is a few G or 10G, which can be loaded in modern computers. As we have discussed before, current computer memory cannot bear single layer sequences.

To sum up: take the multi-layer sequence as a tree structure. If the data meets the above-mentioned characteristics of concentrated distribution, it seems that most branches of the tree will not extend to the deepest level, but break at the shallower level, so it will no longer occupy space. The space occupied by the whole tree may be small enough to be loaded into memory.

Multi-layer sequence number positioning can only be used when there are many empty values in the first few layers after layering. If there are few or no empty values, the space occupied by multi-layer sequence number structure is larger than that of single layer.

In addition to using multi-layer sequence as above, SPL also provides a data type for processing multi-layer sequence numbers, called serial byte, which can support up to 16 layers of sequence numbers. The sequence number range of each layer is 0-255, that is, each byte of two long represents one layer.

A B
1 func return k(int(left(A1,2)),int(mid(A1,3,2)),int(mid(A1,5,2)),int(mid(A1,7,4))-1900,int(mid(A1,11,2)),int(mid(A1,13,2)),int(mid(A1,15,2),int(mid(A1,17,1))))
2 =T.run(ID=func(A1,ID)).keys@is(ID)
3 ’110108200101231234
4 =A2.find(func(A1,A3))

A1 defines a function that can convert the ID number to a serial byte. You can adjust the order of layers in the serial byte according to the known data distribution characteristics (here, we get the digits according to the ID number from the left to the right directly, not necessarily the best scenario). A2 converts the ID numbers in the table to the serial byte and then creates an index using @s, that is, to construct the above-mentioned tree structure. Then you can use it to find() at high speed.

The use conditions of multi-layer sequence numbers are relatively complex. When there are many layers or the operation of generating a serial byte is complex, sometimes it is not more advantageous than ordinary hash index.


Performance Optimization - 2.1 [Dataset in external storage] Text file segmentation
Performance Optimization - Preface