Process a Large File by Group with the Cursor

Question

I have a large (~100GB) text file structured like this:

A,foobar

A,barfoo

A,foobar

B,barfoo

B,barfoo

C,foobar

 

Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines vary in length. Define a group where all lines have a common first value, that is, all lines starting with "A," is a group and all lines starting with "B," is another group.

The entire file is too large to fit into memory, but if you took all lines from any individual group, they will always fit into memory.

I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways: 1)Scan the file using a `BufferedReader which accumulates the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line. 2) Scan the file using a BufferedReader. Whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine()is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.

I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.

 

Answer

The algorithm’s logic is clear, which is while (thereAreGroupsRemaining) {String s = readNextGroup(); process(s); }. Java needs to handle a lot of details and the code is not simple. Here I express the algorithm with SPL:

A

B

1

=file("e:\\bigfile.txt").cursor@c()

2

for A1 ;_1

3

=A2.select(like(_2,"foo*"))

A1: Open bigfile.txt with cursor.

A2: Read in the file according to the value of column 1, a group at a time, and put these records with same value in column 1 to the memory. _1 represents column 1.

B3: Perform the desired query by finding out records starting with “foo” from each group.

The code descripts the algorithm well and efficient. A2 is equivalent to while (thereAreGroupsRemaining) and readNextGroup(). B3 is equivalent to process(s), where s is A2, records of the current group.

The SPL script is integration-friendly with a Java program. Details are explained in How to Call an SPL Script in Java.