Performance Optimization - 2.4 [Dataset in external storage] Composite table and columnar storage

 

Text files and bin files store each record in turn. This method is called row-based storage.

Most operations do not use all fields of the data table. Because the hard disk must be read in blocks, almost the whole record must be read out no matter how many fields are read in the row-based storage mode. The optimization that can be done is not to process (such as generating corresponding data objects) after reading, but the reading time of the hard disk is indispensable.

If columnar storage is used, redundant reading can be avoided.

The so-called columnar storage is to continuously store the values of the same field of each record in the data table. In this way, if only a few fields are used in the operation, only the data corresponding to these fields will be read, which can reduce the amount of reading and effectively improve the performance.

However, considering that the data will be appended, columnar storage will be much more troublesome than row-based storage. Data is usually appended by records. In case of row-based storage, we only need to append records to the end. In case of columnar storage, we can’t do so simply. We need to append the field values behind their respective areas, which requires that there is a reserved space in the field storage area, otherwise it is difficult to ensure the continuity of data of the same field. However, we don’t know how many records there are in total, let alone the space occupied by each field, so we can’t reserve appropriate space.

The general processing method adopts blocking, and the whole data area is composed of some data blocks with a certain size. The values of the same field will be stored in the same data block. When the data block is full, a new data block will be generated to continue storing. In this way, a field will occupy several data blocks, and it is discontinuous between the blocks, but the field values inside the data block are stored continuously. As long as the data block is large enough (more than 1M of modern hard disk is basically enough), the excess reading of hard disk caused by discontinuity between data blocks can be ignored.

Now that blocks are used, we can conveniently make a minmax index on the data block, that is, record the maximum and minimum values of the field values in the block to the block header, which does not take up much space. When performing filtering, if it is found that the field value to be compared is not within the maximum and minimum value range of the current block, the whole block can be directly skipped without reading and parsing, which can further reduce the amount of reading and calculation.

Another trouble with columnar storage is the synchronization of segments.

The number of field values stored in each data block is different. Except for the first data block of each field, any other data block of two different fields cannot guarantee that they store the fields of the same batch of records. The columnar storage realized by simply using data blocks cannot realize segmentation.

The commonly used method in the industry is to batch the data by records. For example, every 1M records are used as a batch, and the columnar storage is used in the above block mode. The segmentation can only be based on the whole batch, and it cannot be carried out within the batch.

This method is relatively effective when the amount of data is very large. Because a batch must be large enough so that the data blocks in columnar storage can continuously store enough data. The number of batches should also be large enough, otherwise the segmentation will be limited, because the segmentation can only be based on each batch. This method is suitable only when the number of records reaches 100 million or even greater.

This problem can be solved by using the double increment segmentation method.

The data is no longer batched, but the a fore mentioned index area is established for the data area (composed of multiple data blocks) of each field, and the starting data block of each segment and the position in the block are recorded in the index area. When adding records, all index areas will be filled and double-increased synchronously to ensure consistency. The number of records (actually the number of field values) corresponding to all the index blocks in the index area is the same. In this way, it can also ensure that the segment with the same serial number of different fields will always correspond to the field values of the same records. Good segmentation effect can be obtained without a large amount of data.

The columnar storage file of SPL is called the composite table, which implements the above data storing, minmax index and double increment segmentation mechanism.

A B C
1 =file(“data.ctx”) =file(“date.btx”)
2 =A1.create(…) =A2.append(B1.cursor@b())
3 =A1.open()
4 =A3.cursor(…;;4:10) =A3.cursor(…;;5:10) =A3.cursor(…;23:100)
5 =A4.fetch(1) =B4.fetch(100) =C4.fetch()

Unlike bin files, a composite table needs to be created before appending data (A2 creates, B2 appends data). When creating a composite table, we need to first specify the data structure (part… in A2), which is somewhat similar to the need to CREATE TABLE for tables in the database. A4, B4 and C4 generate the cursors of different segments. Note that one more semicolon should be written. When creating a composite table, columnar storage will be used by default. When reading, specify the field names to be used in the parameters (… in A4, B4, C4), so as to take advantage of columnar storage to reduce the amount of reading.

The encoding method of structured data is generally not very compact (even if the optimization method mentioned in the previous section is used), so there is often some room for compression. After the data is compressed, the space occupation of the hard disk can be reduced, so as to reduce the reading time, but decompression will increase the amount of CPU computing and consume more computing time. The compression effect is also related to the algorithm used. Algorithms with high compression rate usually occupy more CPU time during decompression. Therefore, whether to compress or not is not a definite choice. It can only be determined according to the actual situation or even after certain tests.

Columnar storage is more conducive to data compression than row-based storage. In structured data, the data type of the same field is generally the same, and even the values are very close in some cases. The data blocks composed of such a batch of data usually have a good compression rate.

Columnar storage and compression are used by default when creating a composite table. SPL provides options for the user to choose row-based storage or no compression.

When column storage is adopted, if a field in the data table is orderly, it means that the field values of adjacent records are more likely to be the same. In this way, only the number of duplicates and one value can be stored instead of storing the same value many times, and the saved space is considerable.

In order to use this scheme, we can sort the data in advance and then store it in the file. However, a data table can only have one sort scheme. When it is ordered by field A, it cannot be ordered by field B at the same time. Which field should be preferred in sorting?

If we do not need to consider other factors, we can generally choose the field with more repetition to be listed first. To be precise, if T.id(A).len()is less than T.id(B).len(), then sort by A first and then by B, the space occupied is usually less.

When the ordered data is appended to the columnar storage composite table, SPL will automatically execute the above scheme, only record the value once and number of repetitions, without user intervention.