Performance Optimization - 2.5 [Dataset in external storage] Order and supplementary file
Even if the amount of storage is not reduced, ordered storage is of great significance for searching and traversal. We will gradually talk about how to use order to improve operation performance later.
When creating a composite table, if the field names in the parameter are prefixed with #, it indicates that the composite table will be ordered by these fields (in the order of fields, and must be the first few fields). However, SPL does not check when appending. Programmers need to ensure that the written data is indeed orderly.
The trouble of orderly storage mainly lies in data appending. Newly added data may not always be orderly connected to the original data. For example, the current data of the composite table is sorted by ID field, while the ID field in the newly added data is usually the same batch of values. To ensure that the whole composite table is orderly by ID, we need to sort all the data by ID again, and we can’t simply append it to the end. While sorting big data is a very time-consuming action.
SPL provides two ways to obtain a wholly ordered new composite table.
append@m() indicates that the merging algorithm will be used to sort the data of the original composite table together with the new data in the cursor to generate a new composite table. The original composite table is already ordered, and if the cursor of new data is also ordered, a low-cost ordered merging algorithm can be used. The overall complexity is equivalent to reading and writing all the data once. It can avoid many temporary files in ordinary large sorting, so as to obtain better performance.
However, for big data, it is quite time-consuming to read and write all over once. SPL also provides a supplementary file mechanism to obtain a logically ordered new composite table.
SPL creates two files for each ordered composite table to store, one is the main file and the other is the supplementary file. When using append@a()to append ordered data, the appended data and the supplementary file will be merged, and the main file remains unchanged. When reset() is used, the main file and the supplementary file will be merged to form a larger main file, and then the supplementary file will be emptied.
As can be seen from the above code example, when the date is 1st, reset will be executed to merge the main file and the supplementary file; for other dates, append@a() will be used to merge on the supplementary file. If we perform such an appending action to the new data once a day, the supplementary file will store the data of one month at most, and all the data before one month will be stored in the main file. In other words, the main file may be very large and the supplementary file is relatively small. In this way, the daily merging amount will not be large, and the data appending can be completed quickly, while it takes more time to sort out the whole amount of data once a month.
When SPL accesses the composite table to obtain data, it will read the data from the main file and supplementary file respectively, and then merge and return (to ensure that the returned result set is still orderly), that is, it logically looks like a composite table. This is a bit slower than accessing a single file, but still takes advantage of order.