Is Columnar Storage Always Faster?

 

The columnar storage brings higher performance in many scenarios and is utilized by not a few data warehouse products. When large data sets are involved in computations, disk scanning and reading accounts for a good part of the computation time. Columnar storage enables reading the desired columns only when there are a great many columns, but the computation involves a very small number of them, minimizing disk accesses and boosting performance. Yet sometimes the technology is not as fast as expected; quite the contrary, it leads to lower performance than row-oriented storage.

 

The indexing technique is often used to achieve high-performance searching. An index table stores pairs of key-value and positions of the corresponding record in the original table. The position of the row-oriented storage data record can be represented by a numerical value. When columnar storage is used, each column in a row has its own position. Storing all column positions in the index table will result in costly accesses and heavy space usage. And reading each record of the original table involves multiple column data zones. Hard disks have the smallest reading unit, implying that the column data to be read greatly outnumbers data when it is stored row by row. This causes lower search performance of columnar storage than that of row-oriented storage.

 

Multithreaded processing can give full play to a multi-CPU (core) machine’s computational ability. Dividing the original table into multiple segments is thus necessary. The popular way of dividing columnar data for data analysis is partition-based strategy – which splits data into partitions where data is stored in the columnar structure and that data is then segmented based on partitions. The number of partitions should be as many as possible to try to ensure an even division, and each partition should be as big as possible to allow columnar storage full play. This creates a contradiction. Once the segmentation is not as successful as expected, performance of columnar storage will be damaged during the multithreaded processing, poorer than that of row-oriented storage.

 

In view of this, we’d better choose a storage format suited to the conditions of the current computational scenario. Using columnar storage or row-oriented storage uniformly is not a wise choice. It’s a pity that many data warehouse products use too transparent mechanism to let users have their own choices.

 

Faced with complicated needs in real-world data analysis work, we have the open-source esProc SPL, which is highly flexible. It supports both columnar storage and row-oriented storage and lets users select the most suitable storage format – row-oriented storage to achieve high-performance search and columnar storage when only a small number among many columns needs to be traversed. SPL supports creating the value-attached index table, which stores certain fields redundantly, to make use of both storage formats. The original table using the columnar storage is used for traversal and the index table in row-wise storage for data searching.

SPL also offers an innovative segmentation technique to divide a table stored in a columnar structure conveniently, making parallel processing on such a able easily and giving full play to the performance advantages of multi-CPU(core) machines.

Let’s look at some typical SPL code examples:

1. To generate columnar data and row-oriented data:

Columnar data: file("T-c.ctx").create(…).append(cs)

The unique segmentation technique used at data generation makes it convenient to read the columnar data for multithreaded processing

2  Row-oriented data: file("T-r.ctx").create@r(…).append(cs)

2. To perform traversal operations:

Use row-oriented storage for traversal when a computation involves many columns or a relatively small number of columns:

=file("T-r.ctx").open().cursor@m().groups(f1,f2,f3,…;sum(amt1),avg(amt2),max(amt3+amt4),…) The code creates cursor for row-oriented data, perform grouping & aggregation using multithreaded processing.

Use columnar storage when the number of columns is large while only several of them are involved in a computation:

=file("T-c.ctx").open().cursor@m(f1,amt1).groups(f1;sum(amt1))

SPL’s original segmentation technique allows easy multithreaded processing on the columnar data.

3. To perform index-based searching:

Use row-oriented storage in most cases:

Create the index: file("T-r.ctx").open().index(index_id;id)

Pre-load the index table: T-r=file("T-r.ctx").open().index@3(index_id)

Loading index in advance makes query lighter and increase performance.

Use the index to search desired data: =T-r.icursor(…;id=10232 && …,index_id)

Use columnar storage and value-attached index table for performing searching and traversal respectively when only a few fields will be retrieved:

Create the index: file("T-c.ctx").open().index(index_id1;id;tdate,amt)

The code stores tdate field and amt field into the index table.

Pre-load the index table: T-c=file("T-c.ctx").open().index@3(index_id1)

Use the index to search desired data: =T-c.icursor(tdate,amt;id=10232 && …,index_id1)

Obtain high performance by retrieving tdate and amt directly from the index table at query without the need of accessing the original table. The original table in columnar structure is used for traversal and there is no need to generate row-oriented data, reducing hard disk usage.