SQL Performance Enhancement: Queries on Highly Concurrent Accounts
[Abstract]
The essay analyzes the fundamental reason behind slow SQL statements and provides the solution to make improvements through code samples. Click SQL Performance Enhancement: Queries on Highly Concurrent Accounts for details.
Problem description
Highly concurrent account query benefits many situations, for example, query the flows of mobile banking, query the service details of an online hall, query the charge records of a mobile game account. The common features of these situations include: the history data involves in abundant accounts and the data volume is huge (tens of or even hundreds of millions) which needs to be stored externally; the data per account is not big (from a few to thousands) which only needs simple queries without much computation; the abundant accounts and the high frequency makes the queries highly concurrent; they all require extreme performance with response in seconds or even faster.
Suppose the account detail T table contains the fields of account number “id”, date “tdate”, amount of money “amt” and so on. Then the SQL for account query is as:
select id,amt,tdate,… from T
where id='10100'
and tdate>= to_date('2021-01-10', 'yyyy-MM-dd')
and tdate<to_date('2021-01-25', 'yyyy-MM-dd')
and …
To improve the response speed, an index is usually created on the field id of table T as follows:
create index index_T_1 on T(id)
The database first find the corresponding original-table position of account id from the index and then retrieve the data from the original table. The first step is usually very fast, whereas the latter is not. This is because the database cannot ensure the corresponding data of the same account id are stored continuously in physics, instead, they are very likely to be separated. However, the hard driver has the minimum read unit which is generally far bigger than the space a record occupies. Therefore, the database reads more than it requires when reading the discontinuous data, resulting in slow queries. Although there are only a few to thousands of data per account id, the overall performance will be poor when each query is a little slower during highly concurrent access.
Therefore, for scenarios like highly concurrent account queries that seek extreme performance, a solution with higher speed is required.
Solution
1. Presorting and row-based storage
We need to sort the original data by account id and store them in the hard disk using row-based form. By doing this, the data of the same account id will be stored consecutively and almost all the blocks read from the disk will be the desired target data, which will obtain a significant performance improvement.
The reason for giving up the columnar storage is that each column is stored consecutively and the fields of a certain account id are scattered in different columns, which will still cause the hard driver to read discontinuous data. Especially when there are concurrent tasks, the degree of discontinuous access to the disk caused by columnar storage is much more serious than that of row-based storage. Each additional concurrent computation task will result in several more concurrent access requests to disk(involving the number of columns), while row-based storage will only result in one more concurrent request to disk.
Moreover, when building an index for row-stored data, the position of the whole record can be simply represented by one number in the index; while in a record of columnar storage, each column has its own position. And if all of them are recorded, then the index will be almost as big as the original table, resulting in high-cost access and large space occupation.
2. Newly-added data
The preprocessed data is sorted by account id, and the new data may contain any account id. If it is appended directly to the end of the old ordered data, it will not be able to keep the overall order of the accounts. And it will be very time-consuming to directly do an ordinary full sorting of the existing data and the new data together.
Instead, we can sort the newly-added data first and then, along with the already ordered data, perform a low-cost order-based MERGE algorithm to generate a new ordered data table. In this way, the overall degree of complexity equals reading and writing the whole data once, which avoids the frequent temporary external buffering in ordinary full sorting, thus improving the performance greatly. The corresponding index should also be created after the new ordered data table is generated.
Furthermore, we can keep a small-scale ordered table additionally (hereinafter referred to as patch data), which will merge with the newly-added data while keeping the old ordered data unchanged. The patch data will merge with the old ordered data until they accumulate to a suitable size after a while. The indexes should be updated simultaneously while these changes occur in both the ordered data and the patch data. When searching data with the index, we need to read from the old ordered data and the patch data separately, which will degrade the performance slightly compared with handling one table of ordered data, but it is still much faster than handling the discontinuous (even disordered) data.
The time when to merge the patch data with the old ordered data is related to the cycles when data should be added. If new data are added every day, we can do the merge once a month. The patch data store data of one month at most and the existing ordered data contain all data one month ago. That is, the patch data may be far smaller than the old ordered data, so the data to be merged every day is relatively small and the data appending is fast. As we do the whole-data order-based merge only once a month, it is fairly acceptable that it takes a little longer to complete.
3. Index with values
Highly concurrent account queries are searching calculations, and sometimes the system needs to take the traversal performance into account as well. The performance of columnar storage is better than row-based storage for traversing. In this case, the account details can be stored in columnar storage and row-based storage each to cope with different computation requirements. However, this will take up much more disk space. If there are not too many fields to be read in the high concurrency query, we can use the scheme of index with values and columnar storage to achieve high-performance traversal and searching at the same time.
Index with values means to store required field values along with the index itself using the row-based storage when generating an index of the table. We no longer need to read the original ordered data during the searching as long as the fields involved have been saved in the index. In this way, we can store the ordered data in columnar storage to improve the traversal performance. Index with values provides better performance, but it also takes up more storage space than normal index and is less suitable for searching that involves a large number of fields.
Code sample
1. Pre-process the data: row-based storage and ordinary index
A | B | |
---|---|---|
1 | =file(“T-original.ctx”).open().cursor(id,tdate,amt,…) | |
2 | =A1.sortx(id) | =file(“T.ctx”) |
3 | =B2.create@r(#id,tdate,amt,…).append@i(A2) | |
4 | =B2.open().index(file(“T.idx”);id) |
A1: open the original composite table T-original.ctx and create a cursor based on it.
A2: sort the cursor of A1 on external storage by id.
B2: define a new composite table T.ctx.
A3: create the new composite table with @r representing row-based storage and # representing ordered by id. And append the ordered cursor of A2.
A4: open the new composite table and create an index of id, the index file is T.idx.
2. Pre-process the data: columnar storage and index with values
A | B | |
---|---|---|
1 | =file(“T-original.ctx”).open().cursor(id,tdate,amt,…) | |
2 | =A1.sortx(id) | =file(“T.ctx”) |
3 | =B2.create@r(#id,tdate,amt,…).append@i(A2) | |
4 | =B2.open().index(file(“T.idx”);id;tdate,amt,…) |
A1: open the original composite table T-original.ctx and create a cursor based on it.
A2: sort the cursor of A1 on external storage by id.
B2: define a new composite table T.ctx.
A3: create the new composite table in columnar storage with # representing ordered by id. And append the ordered cursor of A2.
A4: open the new composite table and create index with values of id. What follows the last semicolon are the redundant fields in the index.
Either row-based storage + ordinary index or columnar storage + index with values can be chosen as needed, and the subsequent code does not need to change.
3. Pre-load the index
When the system is initiated or the index is changed, the index needs to be loaded into memory to save the time of loading it during each query.
A | B | |
---|---|---|
1 | if !ifv(T) | =file(“T.ctx”).open().index@3(file(“T.idx”)) |
2 | =env(T,B1) |
A1: identity whether global variable T exists, if so, it means the index has already been loaded.
B1: if global variable T does not exist, then open the composite table to load the three-level index. @2 or @3 represents to load level 2 or level 3 index. Level 3 index performs better but requires more memory. Whether to load level 2 or level 3 index is determined by the size of the index and the amount of available memory.
B2: assign global variable T as B1.
4. Query the account
The query code is quite simple because of the pre-loaded composite table with index:
=T.icursor(;id==10100 && tdate>=date("2021-01-10") && tdate<date("2021-01-25") && …).fetch()
icursor is the cursor with index.
In practice, the conditions like account id and time period are the given parameters.
5. Append new data
Suppose the newly-added data for table T are stored in T_new.btx, and have the same field names in the same order as T.ctx have.
A | B | |
---|---|---|
1 | =file(“T.ctx”).open() | |
2 | =file(“T_new.btx”).cursor@b().sortx(id) | |
3 | If(day(now())==1) | >A1.reset() |
4 | >A1.append@a(A2) | |
5 | =A1.index(file(“T.idx”);id;tdate,amt,…) | |
6 | =file(“T.ctx”).open().index@3(file(“T.idx”)) | =env(T,A6) |
A1: open the composite table T.ctx.
A2: define a cursor based on T_new.btx and sort it. As the daily volume of new data is small, the sorting is a fast and in-memory operation though sortx function is used.
A3: identify whether the current date is day 1: if not, execute A4 and use append@a to merge the record with only the patch data; if so, execute B3 and use reset function to merge the existing ordered data with the patch data to generate a new ordered data table. Here the daily new data and monthly merge are used as an example, and different identifying conditions can be modified based on various time cycles.
A5:Recreate the index file.
A6, B6: the composite table with an index that is preloaded in memory needs to be updated manually.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version