Performance Optimization - 2.2 [Dataset in external storage] Bin file and double increment segmentation
Text files use characters to encode data. Although the universality is good, the performance is very poor. To convert characters into computable values, more computation is needed, and date and time data also need a very complex analysis and judgment process. If the data is directly stored in its form in memory, there is no need to do conversion when reading it out, and much better performance can be obtained.
This is the binary format.
The disadvantage of binary format is that it cannot be viewed directly, and there is no common binary file format standard in the industry.
SPL provides a simple binary format called bin file.
A text file can be read out and converted into a bin file. Using the @a option can append data to the existing bin file.
You can try to compare the performance of the same operation using a text file and a bin file, such as simply making a row count:
A bin file cursor does not need to use the @t option, it must have field names.
Similar to text files, parallel computing after segmentation is also an important means to improve the computational performance of binary files. However, unlike text files, binary files do not have separators between each row, and any value bytes may appear in the stored field values. There is no special value that can be used to separate fields or records. In fact, no separator is also an advantage of binary files, because the number of bytes occupied by many data types is fixed, and there is no need to waste storing a separator (an integer may occupy 4 bytes, and if a separator is added to occupy 1 byte, the storage capacity will be expanded by as much as 25%).
So how do we do the segmentation at this time?
Early binary files used fixed length records, that is, each record had the same length, so that the bytes position of the nth record could be calculated by multiplication. This method is also used in early databases, so users are required to specify the length of each field when creating a data table in order to reserve the space. Because the actual number of fields occupied by different records may vary greatly, this method requires to reserve space according to the largest one, which is a serious waste. It is rarely used now. At present, modern databases mostly use variable length varchar instead of char.
Another way is to imitate the text file and artificially set a separator to separate two records. However, any byte value may be actual data. Simply using a character as a separator is likely to make an error. Therefore, usually a series of bytes are used, for example, 16 consecutive 255 indicate the end of the record. Generally, this is not the case for normal data, so it will not be misaligned. But this will also cause a lot of waste, and cannot completely eliminate the possibility of mistakes.
At present, the mainstream method is to divide into blocks. Every N records or enough bytes are written as a block. The data in the block is no longer split, and segmentation is performed in all blocks. As long as the number of blocks is enough, the effect of segmentation and parallel computing will not be poor, and there will be many blocks in the case of large amount of data; If the number of blocks is relatively small, it indicates that the amount of data is small, and there is no need for segmentation and parallel computing.
The trouble of using blocks mainly lies in the need for block index information, recording the total number of blocks, where each block starts, and so on. With the addition of data, the number of blocks will continue to increase, and the length of this index will also change. If it is a database system or big data platform, it can have a set of index management mechanism. However, if it is implemented in an ordinary file system, there needs to be a separate file to store the index, which is inconvenient to use.
SPL adopts a method called double increment segmentation to realize the appendable blocking scheme using single file in the file system.
1024 (or other number) block index information will be stored in the reserved space in the file header, and then the actual data will be appended. Initially, it is considered that one record occupies a block, and each additional record means an additional block. After addition, the block information should be filled in the index area. When 1024 records are appended, all index blocks are filled.
If there is another action of adding records, it is considered that two records occupy one block, halve the 1024 index block, merge the first and second blocks into a new first block, merge the third and fourth blocks into a new second block,…, merge the 1023th and 1024th blocks into a new 512th block, and clear all the indexes from 513th to 1024th block. Because the actual data is stored continuously, just change the starting position of the data block in each index block.
Now there are 512 empty index blocks, and 512*2 new records can be added. After appending to a total of 2048 records, all index blocks are filled again. If you want to add more records, make another adjustment as just now, change it to every four records per block, and also merge the current blocks 1 and 2 into a new block 1…. Another 512 blocks index can be vacated to continue appending.
The appending can be carried out on and on, and the block size will continue to double, and the data in the block is always continuous, and there is no redundant separator to waste space. The total number of blocks varies from 512 to 1024 (except when the number of records is less than 512 at the beginning), which also meets the requirement of sufficient number of blocks. The space reserved for the index at the file header is fixed and will not become larger with the addition of data. A single file can realize the mechanism of blocking and segmentation. Moreover, it is easy to count the total number of records (after multiplication, just count the number of records in the last block).
In fact, SPL will not use double increment segmentation immediately. If the total number of records is small (such as less than 1024), it will not do the segmentation. The segmentation scheme will be implemented only when the number of records exceeds a certain number. In this way, if several records are written to the bin file, there will not be an extra file header that looks a little big. When the amount of data is large, the extra file header is not obvious by comparing to data itself.
For the application programmer, these are transparent. After simple appending, a bin file can implement segmented calculation.
The access syntax is basically the same as that of the text file, but if the number of segments exceeds 1024, it is meaningless for the bin file.