Which Columnar Storage Scheme Is Best Suited to Parallel Processing?
There will be time-consuming hard disk scanning and reading when the volume of data to be processed is large. Columnar storage, used when there are a lot of columns but only a few will be retrieved for the computation, can reduce hard disk accesses and enhance performance. That’s why many data warehouse products use columnar storage.
Yet there is the issue of non-synchronized segmentation with columnar data when we are trying to handle it with multithreaded processing. Dividing a table into almost even segments is a prerequisite for parallel processing. It is simple to segment a table using row-oriented storage. We just divide it according to the data size as evenly as possible and mark the border between two segments with the record ending marker. It is a different thing with a table using the columnar storage, where columns are individually stored and thus need to be divided separately. Moreover, the ending markers of corresponding segments for different columns may not locate the same record because field values have indefinite lengths and data is compressed, causing mismatched data retrieval.
The popular way of segmenting columnar data for data analysis is partition-based strategy – which divides data into partitions where data is stored in columnar structure and won’t be further divided, and that data is then divided based on partitions. The number of partitions should be as many as possible to try to ensure an even and flexible division that is critical to high-performance multithreaded processing, and each partition should be as big as possible to avoid splitting each column into a lot of discontinuous partitions, which will cause reading the extra small amounts of useless data between partitions and more HDD seek time; and these problems will become more serious as the number of partitions increases. This creates a contradiction or dilemma. It is the inability to dissolve the contradiction that keeps many data warehouse products and big data platforms from making good use of parallel processing to boost performance.
The open-source esProc SPL offers double increment segmentation technique and applies it in its column-wise stored composite table to successfully get out of the above dilemma, allowing parallel processing to give full play to its performance advantage.
The technology aims to transform the static (physical) segmentation into a dynamic (logical) one. There are a set of steps. Create an index area of fixed size (say there are 1024 index partitions) for each column, where each index partition stores the start position of a record and defines one partition per record. Then append records to the index partitions until the index area is filled up and rewrite the index area by discarding partitions at the even positions, pushing those at the odd positions forward and emptying the second half. This is equivalent to halve the number of partitions to 512 with two records constituting one . We then repeat the same actions to rewrite the index area again and again. As the volume of data increases, the size of partition (the number of records in each partition) doubles. Index areas of all columns will be concurrently filled and rewritten to stay consistent forever. In essence, the double increment segmentation technique divides a table using a columnar storage based on the number of records instead of bytes, ensuring concurrent partitioning and avoiding mismatching even if columns are segmented separately.
The dynamic segmentation technique can be as flexible as expected by maintaining the number of dynamic partitions between 512 and 1024 (except that the number of records is below 512). The dynamic partitions for columns always contain the same number of records, making even segmentation achievable. The technique can achieve desired segmentation effect whether the original data table is large or small.
SPL gives built-in support for the double increment segmentation technique on its composite tables. It is rather simple to generate the segmented columnar data in SPL:
file("T.ctx").create(…).append(cs)
It is also simple to code multithreaded processing on columnar data:
=file("T.ctx").open().cursor@m(f1,amt1).groups(f1;sum(amt1))
The statement opens the table in columnar structure, creates a multicursor and performs grouping & aggregation with parallel processing.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version