An Easy Way of Handling Large Text Files with Parallel Processing

Key words: Large text file  Parallel processing

Though the multicore CPU in contemporary computers offers hardware strength to speed up large file processing with parallel processing, writing a parallel program with a programming language is not easy.

Parallel processing means division of the source file and each thread handles a part. In a text file usually a line makes a record. But the lengths of lines may vary. So division by number of lines is infeasible because each division action requires a traversal from the beginning, which can compromise performance gains. Division by bytes doesn’t need traversal but it triggers another problem. The breaking point of a subsection may happen to fall in a line and thus the line will be split and put into different sections. This will lead to data inconsistency. A solution to this problem is a segmentation method that automatically restoring the head line to the previous section. That is, the ending line of a section will be wholly retained and the beginning line of a section will be given up. This method ensures that each section covers complete lines and that data is always consistent.

Threads control and management is also a problem. Mismanagement always results in out-of-bound error.

No more division and threads management problems if we could use esProc SPL to do the job. The Structured Process Language encapsulates multithreaded algorithm to produce short and easy to understand program. It brings high performance while enabling programmers to focus more on overall computational than being distracted by technical details. Below is an example of SPL parallel processing program:


A

B

C

1

=file(“data.txt”)

/Source file

2

fork 4

=A1.cursor@t(amount;A2:4)

/Divide the file into 4   sections and create cursor on them

3


=B2.groups(;sum(amount):am)

/Traverse cursor to sum   amounts

4

=A2.conj().sum(am)

/Concatenate results of   threads and calculate total

 Often it takes much longer to parse a file than to process it. So parallelly processing the parsing takes priority. SPL provides a built-in option to retrieve data with parallel processing. Writing code for order-irrelevant operations, such as grouping and sum, thus becomes rather easy:


A

B

1

=file("orders.txt").cursor@mt()

/@m option auto-chooses the   number of multiple threads according to system configurations

2

=A1.select(month(Date)==10)

/Filtering

3

=A2.groups(ID;sum(COST*WEIGHT):VALUE)

/Group and aggregate with   serial processing

In real-world businesses, there are a lot of large file processing scenarios. You can always handle them conveniently with esProc SPL. More examples can be found in Structured Text Computations with esProc.

esProc is the file processor that can conveniently handle data loading, database export and mixed computations over various types of files, including TXT, Excel, XML, JSON, CSV and INI. The desktop tool is ready to use, simple to configure and convenient to debug. It allows setting a breakpoint and step-by-step execution during which you can view the result of each step. Based on powerful yet simple syntax that agrees with human way of thinking, esProc is more convenient to use compared with high-level languages. Read Data File Processor to learn details.

SPL is integration-friendly with a Java program. Read How to Call an SPL Script in Java to learn details.

About how to work with esProc, read Getting Started with esProc.