Performance Optimization - 4.1 [Traversal technology] Cursor filtering

 

Records can be searched at a high speed by using index or order, however, the cost of establishing and maintaining indexes, as well as keeping data in order is not low. Since it is impossible for us to pre-build indexes for all query conditions, we will use, if necessary, the sequential search, i.e., the traversal.

While searching the external storage data table in traversal manner, the easy way to think of is to read out the data and generate records one by one, and then calculate whether the search conditions are satisfied according to the records, so as to decide to keep or discard the records. Regrettably this simple method will produce some unnecessary actions. For example, when we want to find the information of a person over the age of 18 including his/her name, gender, age and address, we will first read out these fields and generate a record, then judge whether the person is over the age of 18, as a result, the actions of reading fields and generating records for those people under the age of 18 are not essential because these records will be discarded. In contrast, we can omit a lot of actions (especially in cases of many records that do not meet the conditions) if we are able to immediately judge whether the condition is true following the read of the age field. The reason is, we will read other fields and generate records only when the condition is true, and skip directly and no longer read other fields and generate record in case of false condition.

SPL implements this mechanism for the composite table. While creating a cursor on the composite table, a filter condition can be attached, in this way, SPL will read out the field values used to calculate the condition only. If the condition is not satisfied, the next step will be canceled. Only when the condition is met will other required fields be continuously read out, and the record be created.

Using select() function on a created cursor cannot achieve this effect. Since SPL does not know the data source of this cursor, and cannot apply the filter condition before the cursor creates records, the only way is to judge after the records are fetched out.

A
1 =file(“persons.ctx”).open()
2 =A1.cursor(sex,age;age>=18).groups(sex;avg(age))
3 =A1.cursor(sex,age).select(age>=18).groups(sex;avg(age))

The calculation results of A2 and A3 are same, while A2 uses the above-mentioned pre-cursor filtering technology, its performance is much better.

SPL has not yet implemented this optimization to text files and bin files.

Sometimes there may be multiple filter conditions. For example, the filter condition for searching female persons over 18 years old is:

age>=18 && sex==“Female”

As the commutative law is a law of && operation, the writing order of these two conditions does not affect the operation result but the operation performance. We should consider which condition is more likely to be false and put it in the first place. The reason for doing so is that, if the former condition is calculated as false, the subsequent condition does not need to be calculated, and the entire logical expression is surely determined as false, therefore, this record can be skipped directly, and the field used to calculate the subsequent condition is not necessarily read out any more.

For example, if we know roughly that less than half of people in this data is female, and most of people are over 18 years old, then the condition of women is easier to be calculated as false, in this case, we should write it as:

sex==“Female” && age>=18

This writing method will have better calculation performance than the previous one.

The writing method is exactly opposite for ||-related conditions. In this situation, we should put the condition that is easily calculated as true in the first place. If true is determined after calculation, the subsequent condition needs not to be calculated.

Normally the amount of data during traversal is very large, and the filter conditions may be calculated many times, in this case, the calculation speed is quite important, thus we should try to choose a syntax that makes it calculation faster while writing the code.

For the example given above, and according to the knowledge we discussed earlier, we can get to know that sex==“female” is a string operation, and its operation speed is not as fast as that of age>=18. In case that the possibility of false result calculated by these two conditions is almost the same or it cannot be judged, then we should put the calculation speed section in the first place enabling the conditions to calculate a definite result as soon as possible.

Of course, a better way is to use the data type conversion method mentioned earlier to convert the sex field into a small integer, and judge it with an integer operation like sex==1.

In addition, try not to put some unnecessary calculation expressions in the filter conditions, but try to calculate them in advance.

A
1 =file(“persons.ctx”).open()
2 =A1.cursor(;year(now())-year(birthday)>=18)

The filter condition in A2 is not satisfactory. Since now()returns an instant information, and a prior calculation can not be preformed, year(now()) will be calculated provisionally during every operation, resulting in a relatively poor performance. If written as:

A
1 =file(“persons.ctx”).open()
2 =year(now())
3 =A1.cursor(;A2-year(birthday)>=18)

It will be much better. In this case, year(now()) only needs to be calculated only once. A better writing method is:

A
1 =file(“persons.ctx”).open()
2 =year(now())-18
3 =A1.cursor(;year(birthday)<=A2)

In this way, subtraction can also be omitted. Furthermore, using the conversion data type discussed in the previous chapter to convert birthday into a small integer, the performance will be further improved.

Sometimes it is necessary to determine whether a field value is in a certain set, for example, specify a name list and find out the people with these names.

A
1 =file(“persons.ctx”).open()
2 [John,Alice,Mary,Steven,…]
3 =A1.cursor(;A2.contain(name))

Usually the sequential search method is used by contain(), but the performance is not very well. If the list is a little long (10 or more), the binary search can be used to improve the speed.

A
1 =file(“persons.ctx”).open()
2 [John,Alice,Mary,Steven,…]
3 =A2.sort()
4 =A1.cursor(;A3.contain@b(name))

In this way, the traversal speed will be much faster.

In Chapter 8, we will introduce one more efficient method used for performing this set membership judgment in traversal.