Group & Get the Record Holding the Max Value

Question

I have an input file:

1A   Traes_1AS_6052071D9.1      99.01      101  99.0

1A   Traes_1DS_6BA87D1DA.1   96.04      101  99.0

1A   Traes_1BS_480915AD0.1     94.06      101  99.0

1B   Traes_1AS_49D585BA6.2    99.01      101  72.0

1B   Traes_1BS_47F027BBE.2     98.02      101  89.0

1B   Traes_1DS_3F816B920.1      97.03      101  92.0

1C   Traes_1AS_3451447E0.1      99.01      101  97.0

1C   Traes_1BS_9F243CEA6.2     92.93      99    97.0

1C   Traes_1DS_2A6443F45.1     89.90      99    97.0

 

I need to:

1. Group and iterate inside each line[0];

2. Sort line[4] from lowest to highest value and take the highest value;

3. If they are similar, print the results choosing the one that has highest value in line[2], so that my output file looks like this:

Required output:

1A  Traes_1AS_6052071D9.1   99.01   101 99.0   

1B  Traes_1DS_3F816B920.1   97.03   101 92.0   

1C  Traes_1AS_3451447E0.1   99.01   101 97.0   

 

This is my try, but it only takes according to the highest line[4]:

 

import csv

from itertools import groupby

from operator import itemgetter

with open('my_file','rb') as f1:

with open('out_file', 'wb') as f2:

    reader = csv.reader(f1, delimiter='\t')

    writer1 = csv.writer(f2, delimiter='\t')

    for group, rows in groupby(reader, itemgetter(0)):

        seen = set()

        rows = sorted(rows, key=lambda r: float(r[4]))

        for row in rows:

            max(rows, key=lambda r: float(r[4]))

            writer1.writerow(row)

 

A solution:

Just have the key function for max return a tuple of (r[4], r[2])

Slightly simplified example (without output file)

with open('data.txt','rb') as f1:

    reader = csv.reader(f1, delimiter='\t')

    for group, rows in groupby(reader, itemgetter(0)):

        best = max(rows, key=lambda r: (float(r[4]), float(r[2])))

        print best

 

Answer

Try using SPL (Structured Process Language) to do this:

A

1

=file("e:\\data.txt").import().group(#1).(~.maxp([#5,#3]))

A1: #1 represents the first column; the sign ~ represents a group after records are divided by #1; maxp() function returns the record containing the max value in column 5 or, if column 5 has same values, the max value in column 3.