Fast Read & Group over a CSV in SPL

Question

I am writing a script in Perl, but got stuck in one part. Below is a sample of my CSV file.

 

"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"

"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"

"MH","919921990805","20150806125003","prepaid","prepaid","2G",

"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"

"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"

"MH","919921990805","20150806125003","prepaid","prepaid","2G",

"MP","918120197922","20150806125004","prepaid","prepaid","2G",

"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"

"MUM","918652624178","20150806125005","prepaid","prepaid","2G","3G"

"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"

 

Now I need to take unique records on the basis of 2nd column (i.e. mobile numbers) but considering only the latest value of 3rd column (i.e. timestamp). Eg: for mobile number "918120197922":

 

"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"

"MP","918120197922","20150806125004","prepaid","prepaid","2G"

"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"

 

It should have selected the 3rd record as it has the latest value of timestamp (20150806125005). Please help.

Additional Info: Sorry for inconsistency in data. I have rectified it now. Yes data is in order, which means latest timestamp will appear in the latest rows. One more thing, my file has the size of more than 1 GB so is there any way to do this efficiently? Will Awk work faster than Perl in this case?

 

Answer

The logic is simple, that is, to find the record having the latest value of timestamp (maximum value). Sine text parsing is slow you’d better do it with multiple threads. Besides, a hash method will make the grouping faster than a traverse of all records. It’s complicated to do this in Perl. So I choose to handle your question in SPL. The Structured Process Language generates simple and easy code:

A

1

=file("d:\\source.csv").cursor@qmc()

2

=A1.groups(#2;top(-1;#3):a)

3

=A2.(a).conj()

4

=file("d:\\result.csv").export@c(A3)

A1: Read in content of the source.csv without quotations and return a multicursor.

undefined

A2: Group each subcursor of A1’s multicursor by the 2nd column and then get the record from each group containing the maximum value in the 3rd colulmn.

undefined

A3: Concatenate the selected records from all groups.

undefined

A4: Export A3’s result set to the target file result.csv.