Group & Filter & Concatenate
【Question】
I have a file with a large number of rows. Each row contains 5 columns delimited by tabs. I want to find all rows that have same values for the first 4 columns but different values for the 5th column.
name age address phone city
eric 5 add1 1234 City1
jerry 5 add1 1234 City2
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
niko 5 add1 1234 City7
The result for this table should be
eric 5 add1 1234 City1
eric 5 add1 1234 City3
eric 5 add1 1234 City4
jax 5 add1 1234 City5
jax 5 add1 1234 City6
I tried using uniq -u -f4 after sort but that ignores the first 4 fields which in this case would return all the rows.
As I want to get it done with Ruby, the following solution isn’t what I need.
use strict;
use warnings;
use Text::CSV_XS qw(csv);
my @csv_files = @ARGV;
# Parse all the CSV files into arrays of arrays.
my $data1 = csv(in => $csv_files[0], sep_char => "\t" );
# Parse the other CSV files into hashes of rows keyed on the columns we're going to search on.
my $data2 = csv(in => $csv_files[1],
sep_char => "\t",
headers => ["code", "num1", "num2"],
key => "code"
);
my $data3 = csv(in => $csv_files[2],
sep_char => "\t",
headers => ["CODE"],
key => "CODE"
);
for my $row1 (@$data1) {
my $row2 = $data2->{$row1->[0]};
my $row3 = $data3->{$row1->[1]};
if($row2 && $row3) {
print join "\t", $row1->[0], $row1->[1], $row2->{num1}, $row2->{num2};
print "\n";
}
}
【Answer】
For the question you just need to group the rows by the first columns, find groups having more than one member, and then concatenate them. But since both Awk and Ruby lacks functions for processing structured data, the code is complicated and runs relatively slow. Here I use SPL (Structured Process Language) to express the algorithm. The code is simple and easy to understand:
A |
|
1 |
=file("d:/file1.txt").import@t() |
2 |
=A1.group(name,age,address,phone).select(~.len()>1).conj() |
A1: Read in content of file1.txt.
A2: Group rows by name,age,addres,phone, find groups where member count is greater than 1, and concatenate rows in these groups.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL