Performance Optimization - 3.9 [Search in external storage] Full-text searching

 

In structured data query, we often query records where a string field contains a certain substring. If the condition is like(X,“abc*”), that is, the substring is located in the front of the searched field, we can use the index of the field to locate quickly. Because records that meet the condition like(X,“abc*”) must be stored continuously in the sorting index of X, we can quickly locate the first record using the substring abc (or we can quickly judge that none of them exist), and then continue to traverse until the original condition is no longer satisfied. This mechanism is implemented in SPL, when this condition is found, it will check whether X has a sorting index and take advantage of it.

If the substring is located in the middle of the field to be searched, that is, the condition in the form of like(X,“*abc*”), we can no longer use the various indexes mentioned above, and we need to establish an index for text.

Search oriented full-text retrieval can certainly solve this problem, but its index is too large (because it also faces a larger amount of data), and it may also involve very professional natural language understanding and word segmentation technology, and is not suitable to be introduced into structured data computing tasks.

The scale of structured data is also much smaller than the web pages faced by search, and the search scope will generally be limited to a certain field, so the amount of data involved will be smaller. Moreover, the string as the field content is usually a few characters to a sentence, and it is rarely a whole article like a web page. In this case, a simpler technology can be used.

Take apart the characters of the field (string type) and get all combinations of the two (or three) characters that have appeared. For example, from the string abcd, we can get the combination of ab, bc and cd. Then, establish a sorting index for these extracted combinations. The to-be-searched key value is these character combinations, corresponding to the record of the field where the character combination has appeared.

The index built in this way will not be very large. There are about thousands of commonly used characters (Chinese characters and English characters). If only two characters are taken, the maximum number of such combinations is thousands by thousands, about one million to ten million, which is not very large. If we get three characters, it will be several billion to ten billion, and modern better servers can support it.

Let’s look at how to do the search. Take the combination of two characters as an example, and continue to use the symbols in the previous section.

It is easy to prove that S(like(X,“*abc*”))ÍS(like(X,“*ab*”))ÇS(like(X,“*bc*”)). Using this index, we can easily find S(like(X,“*ab*”)) and S(like(X,“*bc*”)). Then we can find S(like(X,“*ab*”))ÇS(like(X,“*bc*”)) by using the index merging technology in the previous section, and then further check whether the condition like(X,“*abc*”) is true in this set to find the final target values (the target values are included in this set, and they not necessarily equal). But in any case, this can make the computation much less than hard traversal, and does not need high-cost full-text retrieval technology.

This simplified full-text indexing is implemented in SPL:

A
1 =file(“data.ctx”).open()
2 =A1.index@w(IDW;X)
3 =A1.icursor(like(X,“*abc*”))

Using index@w() will establish this index. At present, the two-character scheme is used, which is also sufficient in many scenarios. When searching, you can directly use like as a condition.