Find Difference between Two Text Files - Case 1

 

Question
I have a 15k-line text file, a for loop grepping for instances of each line in a second file of millions of lines and returning count at 5-6 seconds per line.

I should have run this in a screen.

 

Answer
The performance of grep command-line utility is not good, so it takes long to process your data, especially when one of the text files is large. In this case, try using esProc SPL (Structured Process Language) to handle the data. Suppose file1.txt is large and can’t be loaded to the memory at one time but file2.txt is relatively small and can be imported into memory wholly. To find data in file1.txt but doesn’t exist in file2.txt, you can use the following SPL script:

 

A

1

=file("e:\\file1.txt").cursor()

2

=file("e:\\file2.txt").import().keys(_1).index()

3

=A1.select(!A2.find(~._1))

4

=file("E:\\result.txt").export(A3)

Generally millions of lines can be all put into the memory. esProc SPL offers a rich library of functions for processing memory data, such as associated computation, multi-file query and merge query, to enable users to implement complicated algorithms and logics. To learn more about esProc, you can refer to Tutorial