RaqForum 23 No.
1,142 View •
What You Possibly Don’t Know About Column Storage
The column storage technique has been proven effective in handling many computing scenarios and is thus commonly adopted by data warehousing products. The technique is usually a synonym of high-performance in the industry.
Yet the strategy has its own weaknesses. A google shows that criticisms surrounding it are mainly about data modification. There are few discussions of its application to the read-only data analysis and computing, which will be taken care of in the following.
The idea behind columnar storage is simple. To retrieve data stored row-wise on disk, all columns will be scanned. As a result, a query involving only a few columns will retrieve a lot of irrelative data. Plus, the disk respond time suffers as head jumps between tracks picking up data. The column storage strategy enables only the retrieval of the useful columns, significantly reducing the amount of to-be-accessed data on most of the occasions. For the big data computing particularly, the disk scanning is time-consuming and a decrease in the amount of data to be accessed will greatly speed up the overall computing process. Moreover, chances are big that there are a lot of duplicate values in one column. It’s easier to compress data into a smaller size under the columnar storage to enhance performance.
According to its working principle, the columnar storage increases performance by reducing disk accesses. But it can’t reduce the amount of computations. If data is already in the memory, the technique is unnecessary. The structured data processing is row-based, and storing in-memory data column-wise complicates the construction of records and thus hinders performance. Except for professional vector computing (which is column-based and commonly used in data mining), columnar storage strategy isn’t the right choice in handling of relational-style in-memory computing (including in-memory databases).
Columnar storage stores data column by column, making the simultaneous accesses to multiple columns random and discontinuous. The more columns accessed, the more random the accesses. Random accesses to the HDD will seriously affect performance due to time-consuming head jumps, even worse than the performance with row-based storage, which enables continuous column accesses, when a lot of columns are being accessed or the total number of columns is small. Concurrent tasks (and parallel processing) will further exacerbate the random access problem. Though the disk retrieval suffers from head jumps with concurrency computing in row-based storage, the degree is much smaller. With columnar storage, one concurrent task will generate multiple concurrent access requests (the number of involved columns); with row-based storage, one concurrent task only generates one concurrent access request.
One solution is to increase the buffer area for storing the retrieved data to reduce the proportion of the seek time. But setting buffer area for each column consumes a large amount of memory space if there are a lot of involved columns. The other is to add more hard disks and store columns on different hard disks. As columnar storage is generally applied to scenarios involving a large number of columns, the number of columns is usually far more than that of hard disks that can be installed on a machine and disk access conflict often occurs.
Yet the columnar storage is more friendly with the SSD that hasn’t the seek time problem.
The commonly-used indexing technique is intended to locate desired records from a large data set according to key values. The nature of indexing is sorting, as mentioned in a previous article. An index records addresses of both the ordered key values and their corresponding records in the original data table. With row-based storage, the position of a record can be represented by one number. But with columnar storage, each column of a record has its own position and, in principle, should be all recorded, making an index table that is almost as large as the original table. Accesses become much more complicated and space consumption is large. It’s no better than the method of copying the original table and then sorting it.
You might say we can store only the address of one column for a record and then calculate the addresses of the rest of the columns. Sizes of field values of certain data types, like string, are unfixed, and those of values of other data types, like integer and date, which generally have fixed sizes, may turn unfixed thanks to compression technique often used with columnar storage. If all field values are stored in fixed sizes, the index becomes simple and access speeds up, but the data amount increases, leading to a less cost-effective traversal, which is the operation for which the columnar storage strategy is mainly used.
A commonly-used real-life method is dividing the original table into a number of data blocks. For each block, data is stored in a column-based format. The index is created on the blocks to quickly locate the block where the data sits. Then only the in-block data scanning is needed.
This block-based index has a lower performance than the row-based index because of the extra in-block scanning. If the original data table is ordered by index key values (usually the index key is the original table’s primary key), it’s easy to locate the blocks (most of the time only one block) holding the target data, with bearable performance loss. This index type applies to scenarios where records are located according to unique key values. If the original data table is unordered by the index key, the block-based index is useless. It’s possible that the target data falls in nearly all blocks, causing a similar performance to the full table scanning.
Multithreaded parallel processing is a must to make full use of the capability of multi-core CPU. Having data segmented is necessary for performing parallel processing.
Two basic requirements for data segmentation: almost equal data amount in each segment (making a balanced task allocation among threads); dynamic segmentation (because the number of threads can’t be predicted). It’s easier to segment data with the row-based storage. We can make a mark at the end of each record (or of every N records) to evenly divide data into K segments according to the number of bytes. The ending mark in each segment is the starting point of the next segment. This is unfeasible with columnar storage because the sizes of field values may be unfixed. Segmenting points within columns may not fall in the same record, resulting in mismatched data.
Block-based segmentation strategy is a common solution to the columnar storage segmentation problem. The unit of segmentation is block, where data won’t be further divided to be processed in parallel. Here is a dilemma. On one hand, the number of blocks should be large enough to enable a dynamic segmentation (because you can’t get 10 segments from only 5 blocks). As a contemporary computer generally has many CPU cores, nearly 100 blocks are needed to achieve a flexible, balanced segmentation. On the other hand, too many blocks means the columnar data is physically divided into many discontinuous blocks, making the traversal code very complicated and causing the retrieval of useless data between two blocks. For an HDD, there’s also the seek time problem. The more the blocks data is divided into, the more serious these problems become. Only when the space the columnar data in a block used is much larger than the buffer area for storing retrieved data, the proportion of the time spent in retrieving useless data and the seek time will become relatively small. This requires that there should be a sufficient large number of records in each block. In other words, data amount is crucial for processing data in columnar storage format in parallel. With an HDD (including a disk array that contains a large group of HDDs), normally there should be at least one billion records in one table on a single computer, with the data amount reaching above 100G. The parallel processing won’t give a noticeable increase to the performance if the data amount isn’t large enough. This is the problem the multidimensional analysis, for which the columnar storage strategy is particularly suitable, faces. Besides, the size of each block is pre-determined, but neighboring blocks can’t be physically combined while data is continuously added. Consequently, the number of blocks is ever- increasing, bringing management challenges, which demand a special scalable space to store the index of the blocks.
These problems, however, can’t deny the great advantage of the columnar storage in handling read-only computing scenarios, but any impetuous use of the technique need to be avoided. For data warehousing products, it’s appropriate to allow the system administer, or the user, to decide whether or not the columnar storage is used, how to divide columnar data into blocks, which part of data needs to be stored in a columnar format, and which part of data needs to be handled with both row-based storage and columnar storage to achieve higher performance through data redundancy.