Performance Optimization - 2.1 [Dataset in external storage] Text file segmentation


Performance problems are often related to a large amount of data, and big data usually cannot be loaded into memory. We should consider the operation schemes of external storage data. Database may be the most common external data storage scheme, but we can’t implement optimized storage methods and algorithms in the database, so there’s no need to study this scenario.

Considering the engineering feasibility, we will discuss the external storage data stored in the file system.

Text file is a very common file format. Because of its simplicity and universality, it is often used as a medium for data exchange between various data systems. Text files used to store structured data usually have a title row to identify the field names. Each row is a record. The fields in the row are separated by tab or comma, and the rows are separated by carriage return.

The storing scheme of a text file is definite, and the main means to improve its computing performance is only parallel computing. Modern computers usually have multiple CPUs. If a file can be calculated in parallel, it can obtain almost linear multiple of performance improvement.

To implement parallel computing, we need to be able to segment a file and let each thread (CPU) process one of the segmentations separately. For a text file with different length of each row, we can’t use the record sequence number (i.e., the row number) to segment like in memory. To get the nth row of text, we need to traverse the previous n-1 rows, which completely loses the meaning of improving performance. Moreover, we even can’t know how many rows there are in the file in advance.

The method of bytes location shall be adopted for the segmentation of a text file. The operating system can directly return the number of bytes of the whole file, and also provides a method to quickly locate the specified bytes position in the file. However, a bytes position is not necessarily the beginning of the row (not in high probability). If we read directly from here, we will get a half row record.

The carriage return is used as the separator of the rows (i.e., the records) in the text file, and the carriage return will not appear in the row (record) itself. Using this feature, we can use the method of discarding the head and complementing the tail to realize the random reading of a text file. That is, starting from the specified bytes position, the record is considered to start only after the carriage return is read. When the next carriage return is read, a complete row (record) will be obtained. The characters read before the first carriage return will be discarded, that is, discarding the head; If a segmentation requires that the reading ends at a specified bytes position, it will actually exceed this bytes position until another carriage return appears to ensure the integrity of the row (record), that is, complementing the tail.

SPL has built-in file segmentation reading method, and you only need to specify the total number of segments and the segment number.

1 =file(“data.txt”)
2 =A1.cursor@t(;4:10) =A1.cursor@t(;5:10) =A1.cursor@t(;23:100)
3 =A2.fetch(1) =B2.fetch(100) =C2.fetch()

A2, B2 and C2 define three cursors respectively. A2 divides the file into 10 segments and gets the 4th segment; B2 gets the 5th segment; C2 divides the file into 100 segments and gets the 23th segment. Then read out some records in A3, B3 and C3 respectively.

Using this “discarding the head and complementing the tail” method to segment the text file cannot ensure that the number of records in each segment is the same, but can only ensure that the number of bytes in each segment is relatively average.