Parallelly Process a Large File with a Two-line Script in Java

Key words: Java  File retrieval  parallel processing

To speed up file retrieval, BufferedReader is a good choice. You can also use MappedByteBuffer to make it faster, only slightly. 

The bottleneck lies in IO. Reading data from hard disk is always time-consuming and converting data to objects takes time too.

Parallel processing can speed it up. But writing multithreaded processing in Java is inconvenient. You also need to consider how to divide a file to make each segment contain complete records.

An example: sum orders amount for each customer. Below is a sample of the source data:

O_ORDERKEY   O_CUSTKEY     O_ORDERDATE   O_TOTALPRICE

10262     RATTC   1996-07-22     14487.0

10263     ERNSH   1996-07-23     43818.0

10264     FOLKO   2007-07-24     1101.0

10265     BLONP   1996-07-25     5528.0

10266     WARTH   1996-07-26     7719.0

10267     FRANK   1996-07-29     20858.0

10268     GROSR   1996-07-30     19887.0

10269     WHITC   1996-07-31     456.0

10270     WARTH   1996-08-01     13654.0

...

Expected result:

undefined

Java handles the multithreaded processing in the following way:

...

final int DOWN_THREAD_NUM = 8;

CountDownLatch doneSignal = new CountDownLatch(DOWN_THREAD_NUM);

RandomAccessFile[] outArr = new RandomAccessFile[DOWN_THREAD_NUM];

try{

long   length = new File(OUT_FILE_NAME).length();

long   numPerThred = length / DOWN_THREAD_NUM;

long   left = length % DOWN_THREAD_NUM;

for   (int i = 0; i < DOWN_THREAD_NUM; i++) {

outArr[i]   = new RandomAccessFile(OUT_FILE_NAME, "rw");

...

if   (i == DOWN_THREAD_NUM - 1) {

new   ReadThread(i * numPerThred, (i + 1) * numPerThred + left,   outArr[i],keywords,doneSignal).start();

...

}   else {

new   ReadThread(i * numPerThred, (i + 1) *   numPerThred,outArr[i],keywords,doneSignal).start();

...

}

}

}

...

Programming the parallel processing would be rather easy if we could use esProc to do it. esProc encapsulates the Java multithreaded processing to direct offer the large file segmentation feature. This makes parallel processing programming easy and requires comparatively less skills. For the above computing goal, esProc uses a two-line script to get it done. Users don’t need to set the parallel task count because esProc’s parallel processing option @m will use the number of cores as the default value.


A

1

=file("/workspace/orders.txt").cursor@mt()

2

=A1.groups(O_CUSTKEY;sum(O_TOTALPRICE):AMOUNT)

With the help of esProc SPL, users can conveniently process large file with parallel processing in Java, not only the retrieval, but grouping, sorting, joins, etc. Find more examples in:

Structured Text Computing with esProc

Structrued Text File Processing in SPL (II)

esProc is integration-friendly. Read How to Call an SPL Script in Java to see how we can easily embedded an SPL script into a Java program.

Read Getting Started with esProc to download and install esProc, get a license for free and find related documentation.