"【 Question 】 I have a file with a large number of rows. Each row contains 5 columns delimited by .."

blackduckie RaqForum 28 No.
358 View • 4 Years ago

Group & Filter & Concatenate

text(125)

【Question】

I have a file with a large number of rows. Each row contains 5 columns delimited by tabs. I want to find all rows that have same values for the first 4 columns but different values for the 5th column.

name age address phone city

eric 5 add1 1234 City1

jerry 5 add1 1234 City2

eric 5 add1 1234 City3

eric 5 add1 1234 City4

jax 5 add1 1234 City5

jax 5 add1 1234 City6

niko 5 add1 1234 City7

The result for this table should be

eric 5 add1 1234 City1

eric 5 add1 1234 City3

eric 5 add1 1234 City4

jax 5 add1 1234 City5

jax 5 add1 1234 City6

I tried using uniq -u -f4 after sort but that ignores the first 4 fields which in this case would return all the rows.

As I want to get it done with Ruby, the following solution isn’t what I need.

use strict;

use warnings;

use Text::CSV_XS qw(csv);

my @csv_files = @ARGV;

# Parse all the CSV files into arrays of arrays.

my $data1 = csv(in => $csv_files[0], sep_char => "\t" );

# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.

my $data2 = csv(in => $csv_files[1],

sep_char => "\t",

headers => ["code", "num1", "num2"],

key => "code"

);

my $data3 = csv(in => $csv_files[2],

sep_char => "\t",

headers => ["CODE"],

key => "CODE"

);

for my $row1 (@$data1) {

my $row2 = $data2->{$row1->[0]};

my $row3 = $data3->{$row1->[1]};

if($row2 && $row3) {

print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};

print "\n";

}

【Answer】

For the question you just need to group the rows by the first columns, find groups having more than one member, and then concatenate them. But since both Awk and Ruby lacks functions for processing structured data, the code is complicated and runs relatively slow. Here I use SPL (Structured Process Language) to express the algorithm. The code is simple and easy to understand: