"【 Question 】 I am trying to select top values in a column based on the variable/field in another .."

blackduckie RaqForum 28 No.
540 View • 4 Years ago

Group & Aggregate & Union

text(125)

【Question】

I am trying to select top values in a column based on the variable/field in another column. It is a very large tab delimited file.

Input:

Names col2 col3 col4

A A1 def 10

A A1 BBB 10

A A1 CED 10

A A1 fff 7.5

B B1 def 7.5

B B1 dff 5.5

B B1 fff 4.5

C C1 ggg 6.5

C C3 iii 6.5

C C2 hhh 5.0

C C4 toi 6.5

D D1 xyz 10.0

D D2 ikj 7.5

D D3 abc 7.5

...

Output:

Names col2 col3 col4

A A1 def 10

A A1 BBB 10

A A1 CED 10

B B1 def 7.5

C C1 ggg 6.5

C C3 iii 6.5

C C4 toi 6.5

D D1 xyz 10.0

Basically, I want all the rows with values 10 and the top values for each of the names in column1. Any inputs to solve this by Perl, Awk or sed are well appreciated. Thanks.

【Answer】

Your question is one of the typical structured computations. Shell takes a roundabout route to do this. A faster and more direct way is using SPL (Structured Process Language):

	A
1	=file("file.txt").import@t()
2	=A1.group(Names)
3	=A2.((a=~.max(col4),~.select(col4==a\|\| col4==10)))
4	=A3.union()

A1: Read in content of file.txt.

undefined

A2: Group records by Names.

undefined

A3: A2.(…) runs a loop to process each group of data in A2. (a=~.max(col4),~.select(col4==a|| col4==10)) calculates expression in order and returns result of the last expression. a=~.max(col4) returns the top values in col4 in each group to variable a. ~.select(col4==a|| col4==10) selects records where col4 value is a or 10, which are the final result.

undefined