Parallelly Process a Large File with a Two-line Script in Java
Key words: Java File retrieval parallel processing
To speed up file retrieval, BufferedReader is a good choice. You can also use MappedByteBuffer to make it faster, only slightly.
The bottleneck lies in IO. Reading data from hard disk is always time-consuming and converting data to objects takes time too.
Parallel processing can speed it up. But writing multithreaded processing in Java is inconvenient. You also need to consider how to divide a file to make each segment contain complete records.
An example: sum orders amount for each customer. Below is a sample of the source data:
O_ORDERKEY O_CUSTKEY O_ORDERDATE O_TOTALPRICE 10262 RATTC 1996-07-22 14487.0 10263 ERNSH 1996-07-23 43818.0 10264 FOLKO 2007-07-24 1101.0 10265 BLONP 1996-07-25 5528.0 10266 WARTH 1996-07-26 7719.0 10267 FRANK 1996-07-29 20858.0 10268 GROSR 1996-07-30 19887.0 10269 WHITC 1996-07-31 456.0 10270 WARTH 1996-08-01 13654.0 ... |
Expected result:
Java handles the multithreaded processing in the following way:
... final int DOWN_THREAD_NUM = 8; CountDownLatch doneSignal = new CountDownLatch(DOWN_THREAD_NUM); RandomAccessFile[] outArr = new RandomAccessFile[DOWN_THREAD_NUM]; try{ long length = new File(OUT_FILE_NAME).length(); long numPerThred = length / DOWN_THREAD_NUM; long left = length % DOWN_THREAD_NUM; for (int i = 0; i < DOWN_THREAD_NUM; i++) { outArr[i] = new RandomAccessFile(OUT_FILE_NAME, "rw"); ... if (i == DOWN_THREAD_NUM - 1) { new ReadThread(i * numPerThred, (i + 1) * numPerThred + left, outArr[i],keywords,doneSignal).start(); ... } else { new ReadThread(i * numPerThred, (i + 1) * numPerThred,outArr[i],keywords,doneSignal).start(); ... } } } ... |
Programming the parallel processing would be rather easy if we could use esProc to do it. esProc encapsulates the Java multithreaded processing to direct offer the large file segmentation feature. This makes parallel processing programming easy and requires comparatively less skills. For the above computing goal, esProc uses a two-line script to get it done. Users don’t need to set the parallel task count because esProc’s parallel processing option @m will use the number of cores as the default value.
A |
|
1 |
=file("/workspace/orders.txt").cursor@mt() |
2 |
=A1.groups(O_CUSTKEY;sum(O_TOTALPRICE):AMOUNT) |
With the help of esProc SPL, users can conveniently process large file with parallel processing in Java, not only the retrieval, but grouping, sorting, joins, etc. Find more examples in:
Structured Text Computing with esProc
Structrued Text File Processing in SPL (II)
esProc is integration-friendly. Read How to Call an SPL Script in Java to see how we can easily embedded an SPL script into a Java program.
Read Getting Started with esProc to download and install esProc, get a license for free and find related documentation.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL