Performance Optimization - 2.6 [Dataset in external storage] Data update and multi-zone composite table
We previously only discussed how to append data in the external storage data table, not how to modify it.
We have been trying to store data as compact and continuous as possible in order to reduce the storage capacity of the hard disk and reduce the reading time. After compact storage, the data cannot be updated directly, and the modified data may not be able to fill the empty space of the original data (because there are various compression codes, the same type of data does not necessarily occupy the same number of bytes). If it can’t be filled in, it can’t be stored in the original location. If it can’t fill in enough, it will also cause empty space. Either case will cause the data to be no longer continuous and compact, especially for insertion and deletion. Columnar storage will exacerbate this problem, which is faced by data blocks of all fields.
The OLTP database, in order to adapt to the frequent modification requirements, usually adopts the uncompressed row storage method, which is not suitable for the OLAP services dominated by reading and computing. Moreover, even so, it is still possible to generate many discontinuous data blocks after multiple writings, resulting in a serious decline in accessing performance.
Data modifiability and high-performance computing are a pair of contradictions. It is impossible to realize high-performance operation with a huge amount of data and allowing frequent and large modifications at the same time.
This book focuses on high-performance computing, so we sacrifice the need for frequent and large modifications.
Both text files and bin files can be considered not to support modification (modification means rewriting all the file), and composite tables can support a small amount of modifications.
When the amount of modification is small, SPL writes the modified (also including insertion and deletion) data to a separate data area, which is called the supplementary area. When reading, the supplementary area data is merged with the normal data, so that the existence of the supplementary area cannot be felt during access. Moreover, to ensure high-performance merge processing, the amount of data in the supplementary area should not be too large. These are similar to the supplementary file mentioned in the previous section: it cannot be too large, can be transparent to users, and will have an impact on performance.
To modify, there must be a primary key. SPL requires that the primary key of a composite table must be in order (that is why there is an insert action. Relational databases do not distinguish between insert and append).
Because too many modifications are not allowed, SPL only provides the use of in-memory record sequence as a parameter to perform modification and deletion actions, rather than the use of a cursor during appending. When modifying, if the primary key value does not exist in the composite table, it will be inserted.
After several modifications, the supplementary area will become larger and larger, and the impact on performance will be greater. Similar to the supplementary file, the supplementary area also needs to be cleaned regularly and merged into the original data area to realize compact continuous storage, and clear the supplementary area.
reset@q()will mix the supplementary area into the normal storage area, and rewrite part of the data (starting from the smallest primary key value in the supplementary area, and the previous data will not be changed) to maintain the compact and continuous storage of the whole composite table.
The historical data used by OLAP service generally does not have a large number of frequent updates, but sometimes batch deletion of historical data is necessary. The data of too many years ago has lost the meaning of query analysis. If it continues to be stored in the data table, it will occupy a lot of space and affect the query performance. However, we require data to be stored continuously. Even if the data itself is in chronological order, deleting the data in the initial period will lead to all data rewriting, which is very time-consuming.
SPL provides a multi-zone composite table, which allows multiple files to form a composite table. These files are called the zones of the multi-zone composite table.
When creating a composite table, we can set that the composite table is composed of multiple physical files. Each file will have a number called zone number, that is, the to(12) part of the parameter in A1, indicating that the composite table will be divided into 12 files with zone numbers of 1, 2,…, 12. At the same time, an expression for calculating the zone number from the data is given, here is month(dt). When using append@x()to append data, SPL will calculate month(dt) for each appended record and append it to the file of the corresponding zone number according to the result.
When using, we can select only part of the zones to form a logical composite table. These zone numbers can be discontinuous, but must be orderly. When early data needs to be deleted, just delete the file of the corresponding zone, and then open the composite table without adding this zone to the list. There is no need to rewrite the data in other zones.
More uses of multi-zone composite table will be introduced in the following cluster operations. This mechanism can artificially control the distribution of data in cluster nodes.