Group & Filter & Concatenate

Question

I have a file with a large number of rows. Each row contains 5 columns delimited by tabs. I want to find all rows that have same values for the first 4 columns but different values for the 5th column.

name       age  address    phone      city

eric  5     add1       1234       City1

jerry       5     add1       1234       City2

eric  5     add1       1234       City3

eric  5     add1       1234       City4

jax   5     add1       1234       City5

jax   5     add1       1234       City6

niko 5     add1       1234       City7

 

The result for this table should be

eric      5      add1      1234     City1

eric      5      add1      1234     City3

eric      5      add1      1234     City4

jax       5      add1      1234     City5

jax       5      add1      1234     City6

 

I tried using uniq -u -f4 after sort but that ignores the first 4 fields which in this case would return all the rows.

 

As I want to get it done with Ruby, the following solution isn’t what I need.  

use strict;

use warnings;

use Text::CSV_XS qw(csv);

 

my @csv_files = @ARGV;

 

# Parse all the CSV files into arrays of arrays.

my $data1 = csv(in => $csv_files[0], sep_char => "\t" );

 

# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.

my $data2 = csv(in             => $csv_files[1],

                 sep_char       => "\t",

                 headers        => ["code", "num1", "num2"],

                 key => "code"

            );

my $data3 = csv(in             => $csv_files[2],

                 sep_char       => "\t",

                 headers        => ["CODE"],

                 key            => "CODE"

            );

 

for my $row1 (@$data1) {

    my $row2 = $data2->{$row1->[0]};

    my $row3 = $data3->{$row1->[1]};

 

    if($row2 && $row3) {

        print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};

        print "\n";

    }

}

 

Answer

For the question you just need to group the rows by the first columns, find groups having more than one member, and then concatenate them. But since both Awk and Ruby lacks functions for processing structured data, the code is complicated and runs relatively slow. Here I use SPL (Structured Process Language) to express the algorithm. The code is simple and easy to understand:

A

1

=file("d:/file1.txt").import@t()

2

=A1.group(name,age,address,phone).select(~.len()>1).conj()

A1: Read in content of file1.txt.

undefined

A2: Group rows by name,age,addres,phone, find groups where member count is greater than 1, and concatenate rows in these groups.

undefined