Text File Grouping & Aggregation

Question

I have a file in the following format:

45  a

36  a

58  a

21  a

31  b

12  b

23  b

21  c

32  c

45  c

I need to first count lines containing value a, output the result to a target file and write values of the first field in these lines after that. Then I go on to handle lines containing value b and then, lines containing value c, in the same way.

Expected result:

a 4

45

36

58

21

b 3

31

12

23

c 3

21

32

45

What I did is to traverse the file to count lines respectively containing a, b, and c, hold the results in a temporary file, and then re-traverse the file and read the temporary file to write the desired values to target file in the specified format. I am wondering if there’s any simple method to do this, like traversing the source file once only.

 

Answer

It’s simple grouping & aggregation. High-level languages lack class libraries that handle SQL-like calculations. So it’s very inconvenient to write the code directly based on a text file. It’s easy to handle such a calculation in SPL (Structured Process Language), however.

If your file can fit into the memory, we just need to traverse it once. Below is the SPL script:

A

B

1

=file("E:\\source.txt")

=file("E:\\result.txt")

2

=A1.import().group(_2)

3

for A2

>B1.write@a(A3._2+"\t"+string(A3.len()))

4

>B1.export@a(A3,_1)

A1,B1: Open the to-be-handled file and the target export file.

A2: Import source.txt and group and sort records by the second column.

undefined

A3-B4: Loop through each member of A2’s sequence (which is A3’s current value) to count its records, generate a new record to write it to result.txt, and export values of A3’s first column to result.txt.

If the file is too large to be loaded into memory, you need a temporary file to hold the intermediate result. SPL has the built-in mechanism to auto-generate a temporary file. The script for handling a large file is similar:

A

B

1

=file("E:\\source.txt")

=file("E:\\result.txt")

2

=A1.cursor().sortx(_2)

3

for A2;_2

>B1.write@a(A3._2+"\t"+string(A3.len()))

4

>B1.export@a(A3,_1)

A1,B1: Open the to-be-handled file and the target export file.

A2: Import source.txt as a cursor and sort its records by the second column. sortx() function will generate a temporary file to hold the sorting result.

A3-B4: Loop through A2’s sequence to read in records having same second column value at a time, generate a new record to write it to result.txt, and export values of A3’s first column to result.txt.