How to Speed up Queries Containing a WHERE Clause That Has Multiple IN Conditions?
The IN condition in a WHERE clause is used to get records where values of a specified field are included in a given set of enumerated values. There are finding customers coming from certain cities and getting orders of certain types, to name a few.
Databases handle IN condition filtering operations by comparing each value of the target field with all members of the given set. By comparing them in their orders, there will be one to n comparisons (n is the length of the given set of values). Even when the given set of values is ordered and the binary search can be used, a lot number of comparisons is needed. The larger the amount of data and the bigger the size of the give set of values, the more the number of comparisons and the slower the execution of IN condition filtering operation.
Performance will be significantly increased if we are able to scrap the comparison actions.
First, we nail down a list of possible values the IN condition can have. Generally, the number of possible values is limited and values are usually stored in an item list. If there isn’t such an item list, we can traverse the original records to get all the possible values and save them as an item list, convert values of the field specified in the IN condition in the original records into the ordinal numbers (positions) of the records corresponding to values of the item list, and save the ordinal numbers as a new list.
To check whether each substitute meets the IN condition, we generate a set of boolean values with the same length as the item list. The ith value in the set of booleans is determined by whether the item list’s ith member is included in the set of possible IN values. It is true if the ith member is included, and it is false if it isn’t.
Then we traverse the new data and get members of the set of booleans corresponding to the IN field values (which are the IN field ordinal numbers). If the boolean value is true, the corresponding record meets the filtering condition, and if it is false the latter is ineligible.
The approach eliminates comparisons by completely converting “comparison of values of sets” into “references of ordinal numbers”, increasing performance considerably. The computation time is irrelevant to the sizes of sets of values and won’t increase as the number of enumerated IN values grows.
Yet, the optimization simply remains an idea with SQL because the language does not support getting members of a set according to their ordinal numbers (positions).
esProc SPL supports ordinal-number-based record reference. We can use it to achieve the optimization method conveniently.
1. Preprocess data to convert values of the target field into corresponding ordinal numbers.
The code is =cs.run(dim1.pos@b(f1):f1). It traverses the original records to get the ordinal number of the record whose f1 field matches the corresponding value in item list dim1 using pos function, replace f1 field with the ordinal number, and save records with replacements as a new table. dim1 is already ordered by corresponding f1 values, so we can perform searching using binary search to speed up the preprocessing process.
2. Perform IN filtering on the preprocessed data.
Suppose the set of values passed in is arg_F1, the code for generating the set of boolean values is as follows:
b1=dim1.(arg_F1.contain@b(~)), where arg_F1 is ordered and binary search is used.
Now we can perform the optimized IN condition filtering:
The code is =file("T.ctx").open().cursor(…;b1(f1) && …). It gets a corresponding member of the set of booleans according to the ordinal number in f1. The current record satisfies the filtering condition if the member is true, otherwise it is ineligible. As comparisons are eliminated, performance is notably increased.
Tests show that, under the same hardware environment, SPL is nearly one hundred times faster than Spark SQL in achieving the optimization algorithm.
IN Find detailed information in SQL Performance Enhancement: IN Operator in WHERE Clause.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version