Group & Aggregate & Union

Question

I am trying to select top values in a column based on the variable/field in another column. It is a very large tab delimited file.

Input:

Names col2 col3 col4

A    A1   def  10

A    A1   BBB  10

A    A1   CED  10

A    A1   fff  7.5

B    B1   def  7.5

B    B1   dff  5.5

B    B1   fff  4.5

C    C1   ggg  6.5

C    C3   iii  6.5

C    C2   hhh  5.0

C    C4   toi  6.5

D    D1   xyz  10.0

D    D2   ikj  7.5

D    D3   abc  7.5

... 

 

Output:

Names col2 col3 col4 

A    A1   def  10  

A    A1   BBB  10 

A    A1   CED  10 

B    B1   def  7.5 

C    C1   ggg  6.5 

C    C3   iii  6.5 

C    C4   toi  6.5 

D    D1   xyz  10.0 

 

Basically, I want all the rows with values 10 and the top values for each of the names in column1. Any inputs to solve this by Perl, Awk or sed are well appreciated. Thanks.

 

Answer

Your question is one of the typical structured computations. Shell takes a roundabout route to do this. A faster and more direct way is using SPL (Structured Process Language):

A

1

=file("file.txt").import@t()

2

=A1.group(Names)

3

=A2.((a=~.max(col4),~.select(col4==a||   col4==10)))

4

=A3.union()

 

A1: Read in content of file.txt.

undefined

A2: Group records by Names.

 

undefined

A3: A2.(…) runs a loop to process each group of data in A2. (a=~.max(col4),~.select(col4==a|| col4==10)) calculates expression in order and returns result of the last expression. a=~.max(col4) returns the top values in col4 in each group to variable a. ~.select(col4==a|| col4==10) selects records where col4 value is a or 10, which are the final result.

undefined

A4: Union groups in A3.

undefined

You can also execute the SPL script at the command line under Unix/Linux.