Performance Optimization - 2.5 [Dataset in external storage] Order and data appending

 

Performance Optimization - 2.4 [Dataset in external storage] Composite table and columnar storage

Even if the amount of storage is not reduced, ordered storage is of great significance for searching and traversal. We will gradually talk about how to use order to improve operation performance later.

A
1 =file(“data.ctx”).create(#ID,…)
2 =file(“data1.ctx”).create(#K1,#K2,…)

When creating a composite table, if the field names in the parameter are prefixed with #, it indicates that the composite table will be ordered by these fields (in the order of fields, and must be the first few fields). However, SPL does not check when appending. Programmers need to ensure that the written data is indeed orderly.

The trouble with ordered storage mainly comes from data appending, as the new data may not always be concatenated to the end of original data in an orderly manner. For example, the existing data of composite table are sorted by ID field, and the ID field of new data usually contains the same batch of values. In this case, to ensure the data of whole composite table are in order by ID, we need to re-sort all data by ID in principle, rather than simply appending new data to the end of existing data. However, sorting big data is a very time-consuming action.

Fortunately, since the large amount of original data are already in order, we only need to sort new data and then use the low-cost ordered merge algorithm to merge new data with original data, which is equivalent to reading and writing all data only once, and avoids the phenomenon of generating many temporary files in conventional big sorting algorithm, hereby obtaining fully ordered data more quickly.

A
1 =file(“data.ctx”).open()
2 =file(“data_new.ctx”)
3 >A1.create(A2)
4 =A1.cursor()
5 =A2.open()
6 =file(“data_append.btx”).cursor@b()
7 >A5.append([A4,A6].merge())

This action can be simplified as:

A
1 =file(“data.ctx”)
2 =file(“data_new.ctx”)
3 =file(“data_append.btx”).cursor@b()
4 >A1.reset(A2;A3)

Or merge new data file directly onto the original file:

A
1 =file(“data.ctx”)
2 =file(“data_append.btx”).cursor@b()
3 >A1.open().append@m(A2)

For big data, it is very time-consuming even just to read and write all data. To speed up, a dual-file approach can be adopted, that is, store data in two files. One file stores the large amount of historical data and the other stores the recent data. When appending data, the new data are only merged into the latter file in general, and merge all data after an appropriate period of time.

A B
1 =file(“data_history.ctx”).open()
2 =file(“data_recent.ctx”)
3 =file(“data_new.btx”).cursor@b()
4 if (day(now())==1 >A1.append@m(A2.open().cursor())
5 >A1.create@y(A2)
6 >A2.open().append@m(A3)

This code shows that the historical data and recent data are merged on the first day of the month, and on the other days of the month, the new data are merged into the recent data file. If we append data once a day, the recent data file will store data for up to one month, and the historical data file will store all data from a month ago. In other words, the historical data file may be very large and the recent data file is relatively small, which makes the amount of data to merge each day not large, and allows us to append data quickly, and the time-consuming full data merging is done only once a month.

After the adoption of the dual-file approach, it needs to read data from the historical data file and recent data file respectively when fetching data from the composite table, and then merge the two parts of data and return the merged result so as to ensure the returned result set is still in order.

A
1 =file(“data_history.ctx”).open().cursor()
2 =file(“data_recent.ctx”).open().cursor()
3 =[A1,A2].merge(…)

Due to the need to merge data when accessing compsite table, the performance of reading two files will be slightly lower than that of reading one file.

This coding method is a bit complicated. With the advent of multi-zone composite table, SPL can simplify this code.


Performance Optimization - 2.6 [Dataset in external storage] Data update and multi-zone composite table
Performance Optimization - Preface