Get Rows Having Values That Included in a Column in Another Data Set

 

Question

I need to extract all hits from one list (list.txt) which can be found in one of the columns of another (here in Data.txt) into a third (output.txt).

 

Data.txt (tab-delimited):

some\_data more\_data other\_data here yet\_more_data etc

 

A B 2 Gee;Whiz;Hello 13 12

 

A B 2 Gee;Whizz;Hi 56 32

 

E 4 Btm;Lol 16 2

 

T 3 Whizz 13 3

 

List.txt:

Gee

 

Whiz

 

Lol

 

Ideally output.txt looks like:

some\_data more\_data other\_data here yet\_more_data etc

 

A B 2 Gee;Whiz;Hello 13 12

 

A B 2 Gee;Whizz;Hi 56 32

 

E 4 Btm;Lol 16 2

 

So I tried a shell script:

for ids in List.txt

 

do

 

grep $ids Data.txt >> output.txt

 

done

except I typed out everything (cut and paste actually) in List.txt in said script.

Unfortunately it gave me an output.txt including the last line, I assume as ‘Whizz’ contains ‘Whiz’.

I also tried cat Data.txt | egrep -F “List.txt” and that resulted in grep: conflicting matchers specified -- I suppose that was too naive of me. The actual List.txt contains a sorted list of 985 words, and Data.txt has 115576 rows with 17 columns.

 

Answer

To get your task done, you can convert the here field in Data.txt into a set and then perform set operation over it and List.txt. But it’s a little complicated to handle set operations in Shell. Here I do this in SPL (Structured Process Language):

A

1

=file("/Data.txt").import@t()

2

=file("/List.txt").read@n()

3

=A1.select(here.array(";")^A2!=[])

The “^” sign represents getting intersection; and “[]” represents an empty set.