# Performance Optimization - 2.3 [Dataset in external storage] Data types

After using binary files, we can adopt more optimized coding schemes.

An integer may occupy 4 or 8 bytes on the computer. In principle, it is the same size stored in the file. However, in fact, a considerable number of integers are not large. For example, a person’s age will not exceed 200. If it is expressed as 4 bytes, 3 bytes are 0. If they are all stored as 4 bytes, it will be a waste.

Don’t underestimate this, because the amount of data is often large. If we change the encoding method and store small integers in another encoding method, we can save 2 bytes. If there are 1 billion rows of data, we can save 2G of space and reduce a lot of hard disk access time.

Similarly, there is date time. A complete datetime type will occupy 8 bytes. Often, we only care about the date part, or even the date of recent decades. If the encoding method can be changed, it may be stored in half the space.

However, the coding rules should not be too complex, otherwise decoding will occupy more CPU time. There are many specific codings, which can be designed by yourself after understanding the basic principles.

SPL adopts optimized encoding for small integers, recent dates and short strings when saving bin files. You can try how different is the space occupied by the same number of large integers and small integers.

The storage space occupied by data in external storage is highly correlated with the data type. The simpler the data type, the smaller the space occupied, and the simplest operation in decoding. For example, an integer is a simple value, the occupied space is relatively determined, and the reading and decoding action is very simple; A string also has a length information, which needs more storage; The date and time itself has no other information, but it will involve relatively complex decoding actions.

If we can change the data type without losing data information, it is possible to save a lot of storage space, reduce parsing actions and achieve good performance improvement.

Among all data types, integer is the simplest and has the best storage and parsing performance. If possible, try to convert other data types to integers.

**Date** can be converted to the number of days from a certain day, which does not affect the comparison. 50000 days can represent a time period of more than 100 years, which is sufficient in most scenarios. When it is hoped that the components of year, month and day can be obtained more conveniently (some operations need to group by year and month), simply put, it can be converted into an 8-digit decimal integer of yyyymmdd; SPL also provides a method that saves more space, that is, convert the year and month to the number of months from 1970, and represent the day by 5 binary bits (a month has a maximum of 31 days,and a 5-digit binary number can represent any number between 0 and 31), and it is equivalent to ((yyyy-1970)*12+(mm-1))*32+dd. In this way, you can use small integer to represent a date between 1970 and 2140, which basically meets the requirements.

A | B | C | |
---|---|---|---|

1 | =date(2021,2,12) | =date(2000,1,1) | =year(B1) |

2 | =interval@d(B1,A1) | ||

3 | =year(elapse@d(B1,A2)) | =month(elapse@d(B1,A2)) | =day(elapse@d(B1,A2)) |

4 | =month@y(A1)*100+day(A1) | ||

5 | =A4\10000 | =A4\100%100 | =A4%100 |

6 | =days@o(A1) | ||

7 | =year(A6) | =month(A6) | =day(A6) |

8 | =((year(A1)-1970)+month(A1)-1)*12+day(A1) | ||

9 | =A8\384+1970 | =A8\32%12+1 | =A8%32 |

A1 is the date to be converted and B1 is a base date. A2, A4 and A6 convert A1 into integers in three ways respectively. Lines 3, 5 and 7 inversely calculate the year, month and day components after three coding methods respectively. The integer obtained by A2 method is the smallest, but the amount of inverse calculation is relatively large; The inverse calculation of A4 method and A6 method is less, and the integer obtained by A6 method is smaller. Moreover, SPL provides more convenient basic functions for conversion, and lines 8 and 9 explain the actual execution logic of A6 method.

There are two kinds of **strings**. The text itself generally does not have much room for optimization, such as name, address, etc.; Another kind of string is often a coding method, and its value range is very small, such as gender, educational background, abbreviation of country or region, etc., in this case, it is only necessary to convert it into the serial number of the coding table:

A | B | C | |
---|---|---|---|

1 | […,CN,…,JP,…,UK,…,US,…] | ||

2 | CN | =A1.pos@b(A2) | =A1(B2) |

B2 converts A1 into an integer and C2 calculates it back. Search is required during conversion, which is relatively time-consuming (binary search can be used), and the inverse calculation is equivalent to sequence number positioning, which has very good performance. In this way, it takes more CPU time for conversion during storage, but storage and coding have more advantages, and the efficiency of operation is much higher.

It is mentioned earlier that under the appropriate coding mode, small integers will occupy less storage space than large integers. In addition, for SPL, small integers have greater significance. Because SPL is developed in Java, and Java generates objects very slowly, SPL generates all 16-bit integers (0-65535) in advance. If the read integer is found to be within this range, no new object will be generated, which can reduce a lot of computation.

The third method of date conversion just mentioned can convert dates within more than 50 years to 16-bit integers, and the amount of inverse calculation is not large. Although the second method also has a small amount of inverse calculation, the range of converted integers is much larger. It is no longer a small integer, and the storage and operation performance will be poorer.

In fact, the conversion of these data types is also meaningful for in-memory operations.

Integer operation is also less complex than the operation of date and string. It is faster when comparing. Including the calculation of hash function, the calculation of integer is much faster than that of string.

In the previous chapter, we did not pay attention to data types in search operation. There is no difference in complexity analysis, but there is a great difference in practice. When searching, if the to-be-searched key (TBS key) can be converted to integer to store first, for each search, the additional conversion cost of the search value is much less than the benefit of comparison and hash calculation due to the simplification of data type in the search process, and this is also a common engineering means in performance optimization.

The optimization of data types is also reflected in the case of multiple fields.

When we discussed the search algorithms in the previous chapter, all searches were for the TBS key of one field. In fact, these algorithms are also suitable for multi field search. Similarly, there is no difference in algorithm complexity between multi field and single field in theory, but there are great differences in practice. Even two fields need to be represented by a set, and there will be length information. The space occupied and the amount of computation will not be twice that of a single field (assuming that the data types of the fields are the same), but more.

Therefore, if possible, we can consider spelling multiple fields into one field. However, whether this means is effective or not requires specific analysis and measurement. Combining multiple fields into one field can avoid set operation, but it will make the integer larger, or even convert the integer into a string (to be spelled with another string). Sometimes, the performance loss caused by these cannot offset the benefits of avoiding set operation.

Simply put, if two strings that cannot be converted into integers are put together, or the combination result of two small integers is still not too large (for example, combining two 2-digits will get a 4-digit), merging fields will bring benefits. More complicated situations are not necessarily the case.