A Brief Overview of Data Segmentation Techniques

Contemporary computers often feature multiple CPU cores. The increasingly widely used SSDs also have strong concurrency capabilities. These hardware abilities are powerful guarantees for parallel computing. But a convenient data segmentation technique is also important for performing the parallel computing. The technique is about dividing to-be-computed data intoa number of segments for being handled separately by differentthreads (or processes;here we discuss the segmentation strategy in a multi-threaded environment; the case in a multi-process environment is similar).

There are 4 targets to hit when designing a data segmentation strategy:

1.Almost same amount of data in each segment.

The time taken to execute a computation in parallel is determined by the slowest thread. Threads on one machine have almost equal processing abilities. Dividing data as evenly as possible can get nearly same computing time for each thread.

2.Flexible and dynamic number of segments.

In data preparation stage, often the number of CPUs of the machine where the computation will be executed isn’t known. Even if it is known in advance, the number of threads can’t be decided simply according to the number of CPU cores, because hard disk concurrency is not as good as the CPU concurrency. Besides, the number of CPU coresthat will be used in handling the computingtask can’t be known beforehand. It would be best to be able to dynamically set the number of computing threads, possibly from several to scores of,according to the computing scenario being handled, which requires a technique that can split data into any number of segments the computation needs.

3.Contiguously and integrally stored data for each segment.

Because hard disks respond slowly to random, high-frequencyaccesses (even SSDs perform poorly on random, high-frequency, small-amount accesses),it would be bestifdata to be processed by eachthread is stored as continuously as possible, avoiding frequent seek actions of a hard disk drive.

4. Data appending allowed.

Data isn’t changeless but increases over time. It isconvenientifthere’s no need to reorganize all the data every time new data is added but simply write it to the existing data.

Storing data in text file format can achieve all the 4 targets. We just need to divide a file into multiple segments according to the number of bytes and make each thread read one of them.

Files use the carriage return to separate records (rows). This won’t cause ambiguitybecauseit’s impossiblethat the text file data contains acarriage return character. It is likely that a segmentation point will fall among a record when a file is divided by bytes. We use the method of“complementing the tail record but discarding the head record” to handle the issue. With an offset segmentation point, the retrieval of a segment of data won’t stop until it hits the firstcarriage return after the segmentation point, and the retrieval of next segment starts from the firstcarriage return after the segmentation point, ensuring that each segment contains complete records (rows). The method is also commonly used by Hadoop.

The problem is text parsing is extremely slow, so binary storage needs to be considered.

Binary data doesn’t have a special character, like the carriage return, to separate records. At any byte position it is the data itself, making it impossible to know where a record ends. To specifically create a separator, it should besufficiently long to avoid the duplication with real data. But adding a number of bytes to each record will increase a lot of meaningless data and reduce the performance. The point is that the method can only reduce error risk but can’t put an end to it.

A better alternative is block-based segmentation strategy, which stores data in a number of fixed-size blocks and divides it by blocks. If there is enough number of blocks, the distribution of them among threads will be relatively balanced, which covers target 1 and target 2 but not target 3. The size of blocks is decided before data is stored in them. So it is likely that a block can’t precisely match the length of certain number of complete records. Making each block store complete recordsmay cause waste of space (the leftover space will be wasted if it can’t accommodate a complete record). When the size of blocks is small but the fields of records are many, there will be serious waste of space, reducing theintegrity target 3 covers. If a record is allowed to be split and stored in two blocks, the block-based segmentation becomes invalid because it is possible that a segment will process only part of a record.

In this case, we can borrow the“head discard and tail complement” adjustment method used for the text data to allow one record to be split and stored in two blocks. When retrieving the first block in a segment, skip the first record, if it is incomplete; and when retrieving the last block, continue to read data into the next block until the last record in the current block isretrieved completely. This way we can ensure the data integrity in each segment. This method requires marks to indicate whether the first record in a block is the continuation of the last record in the previous block and whether the last record in the block is complete or not. The marks don’t occupy much space but data traversal will be frequently interrupted by them, causing troubles and affecting performance.

Databases also use the block-based segmentation strategy. But since they store data in all tables in an overall way, their block allocation algorithm can’t ensure the data continuity in different blocks for one table. In order to improve data continuity, the size of blocks needs to be bigger, which conflicts with the rule of dividing data into as many blocks as possible. Taking the target of making data appendable into consideration, an ever-increasing index table is needed to manage these blocks. When there are a large number of blocks, it isn’t easy to ensure data continuity for the index table itself (thelength of the index tablecan’t be predicted, it increases dynamically with datacontinuously appended).