Structured Data Processing over Text Data - Data Search, Grouping & Concatenation

Question

Really struggling to work this out…

I have a text file with data like this (17000 lines of it):

45226 1

45226 1

45226 1

45226 3

45226 5

23470 1

45226 5

45226 5

29610 4

37417 2

37417 3

37948 1

 

What I want to do is sorting the text file (using Java) so all the left numbers are grouped if the right value is 1; or the left value is grouped if the right is not equal to 1 (so any other number). For example (but doesn’t have to be like this):

 

3 x 45226 1

4 x 45226 MIXED

1 x 23470 1

1 x 29610 MIXED

2 x 37417 MIXED

1 x 37948 1

 

Do I need to use array? or some sort of sort? I just can’t work it out. Any help, code or suggestions will be greatly appreciated!

 

Answer

This is a simple structured computation. First, you get records where second column values are 1 and those where second column values are not 1respectively; second, group them respectively by the numbers in the first column and count records in each group; third, add a “1” column to records whose second values are 1, and add “MIXED” to records where second column values are not 1; last, concatenate all these records.

Simple as the algorithm, it’s not easy to realize it in Java, because the language lacks related functions to perform a series of structured computations. You have to write data search, group, column append and concatenate actions manually. In this case, you can handle this in SPL (Structured Process Language), as shown below:

A

B

1

=file("E:/lines.txt").import()

2

=A1.select(_2==1)

=A1.select(_2!=1)

3

=A2.groups(_1;count(_1):value)

=B2.groups(_1;count(_1):value)

4

=A3.new(concat(value)+"x"+concat(_1):value,"1":tag)

=B3.new(concat(value)+"x"+concat(_1):value,"MIXED":tag)

5

=A4|B4

 

esProc SPL is intended to process structured data, and an SPL script can integrate with a Java application via esProc JDBC. For more information, see How to Call an SPL Script in Java.