Sample Programs of Performing Distinct on a File

 

【Abstract】

This article explains methods of performing distinct operation on small files and big files, and offers sample program in esProc SPL. Those methods include removing or retaining duplicates by rows or by key filed.Sample Programs of Performing Distinct on a File!

Sometimes during data processing we need to remove duplicate data from a file or retain only duplicates in a file. Now lets look at how we can get these done over small files or big files through sample programs in esProc SPL. The methods include row-wise deletion or deletion by key field. esProc is the professional data computing engine. The SPL language it uses boasts a set of all-round function libraries for handling set-based operations. That makes it easy and simple to perform distinct over files.

 

1. Small files

1.1 Row-wise DISTINCT

In a text file, each row is a string. We want to retain only one of each set of duplicate rows. To do this, we read in each row in the file as string to form a set of strings and perform distinct over the set.

Example: paint.txt records the IDs and names of students who register in the painting class. Some students have registered more than once, so we need to delete the extra duplicate records and write data to paint1.txt. Below is part of paint.txt:
  20121102-Joan
  20121107-Jack
  20121113-Mike
  20121107-Jack

esProc SPL script:


A

Comment

1

=file("e:/txt/paint.txt").read@n()

Read in each row of paint.txt to form a set

2

=A1.id()

Delete duplicates from A1s set

3

=file("e:/txt/paint1.txt").write(A2)

Write A2, the deduplicated set, into paint1.txt

 

1.2 Comparison by key field

We have a file that has columns of data. The first row contains column headers and detailed data starts from the second row. We want to compare values in the key field, and delete rows containing duplicate key values or retain them only.

Below is a part of the orders table of 2018 (order_2018.xlsx):

 undefined

1.2.1. Removing duplicates

Example 1: Get the unique IDs of all customers that place an order in 2018 and write them into 2018c.xlsx.

esProc SPL script:


A

Comment

1

=file("e:/txt/order_2018.xlsx").xlsimport@t()

Import orders data of 2018

2

=A1.id(CustomerId)

Get IDs of all different customers

3

=file("e:/txt/2018c.xlsx").xlsexport(A2)

Export unique IDs to 2018c.xlsx

Example 2: Find the different products each customer bought in 2018, and write key fields CustomerId and ProductId to 2018c_p.xlsx.

esProc SPL script:


A

Comment

1

=file("e:/txt/order_2018.xlsx").xlsimport@t(CustomerId,ProductId)

Import key fields of orders table of 2018

2

=A1.group@1(CustomerId,ProductId)

Group by the key fields; @1 option enables getting the first record of each group only

3

=file("e:/txt/2018c_p.xlsx").xlsexport@t(A2)

Export A2s result to 2018c_p.xlsx

1.2.2. Retaining duplicates only

Example: Get records where the customer bought same product many times and export them to 2018c_rebuy.xlsx.

esProc SPL script:


A

Comment

1

=file("e:/txt/order_2018.xlsx").xlsimport@t()

Import orders data of 2018

2

=A1.group(CustomerId,ProductId)

Put records where same customer buys same product into one group

3

=A2.select(~.count()>1).conj()

Get groups where the number of records is greater than 1 and concatenate their records together

4

=file("e:/txt/2018c_rebuy.xlsx").xlsexport@t(A3)

Export A3s result to 2018c_rebuy.xlsx

 

2. Big files

A big file cannot be wholly loaded into the memory. So we cant import all records to make comparison to find the duplicates. Instead we need to do the retrieval and comparison batch by batch. esProc SPL has the cursor mechanism that makes it convenient to handle big file processing, including the distinct operation.

2.1 Row-wise DISTINCT

In a big text file, each row is a string. We want to retain only one of each set of duplicate rows. To do this, we read in each row in the file as string to form a set of strings with the cursor and perform distinct over the cursor.

Example: all.txt records information of real property owners nationwide, including their IDs and names. Below is part of the file:
  510121198802213364-Joan
  110113199203259852-Jack
  201264197206271113-Mike

Some owners have more than one house in different states and so they appear multiple times in the file. Now we want to remove the duplicates and retain only one registration record and export the result to all2.txt.

esProc SPL script:


A

Comment

1

=file("e:/txt/all.txt").cursor@s()

Create the cursor; @s option enables generating a one-field table sequence with values being rows

2

=A1.groupx(_1)

Group the one field in the cursor to remove the duplicate rows

3

=file("e:/txt/all2.txt").export(A2)

Export the result to all2.txt.

 

2.2 Comparison by key field

We still take the orders table as an example. It is orders.xlsx, a big file that contains orders of all years.

2.2.1. Removing duplicates

Example 1: Get the unique IDs of all customers that place an order and write them into customers.xlsx.   

esProc SPL script:


A

Comment

1

=file("e:/txt/orders.xlsx").xlsimport@tc()

Create cursor of the orders table

2

=A1.groupx(CustomerId)

Group by CustomerId to get different customers

3

=file("e:/txt/customers.xlsx").xlsexport@t(A2)

Write customer IDs to customers.xlsx

Find the different products each customer bought, and write key fields CustomerId and ProductId to c_p.xlsx.

esProc SPL script:


A

Comment

1

=file("e:/txt/orders.xlsx").xlsimport@tc()

Create cursor of the orders table

2

=A1.groupx(CustomerId,ProductId)

Group by key field to get the IDs and products of different customers

3

=file("e:/txt/c_p.xlsx").xlsexport@t(A2)

Export result to c_p.xlsx

 

2.2.2. Retaining duplicates only

Example: Get records where the customer bought same product many times and export them to c_rebuy.xlsx.

esProc SPL script:


A

Comment

1

=file("e:/txt/orders.xlsx").xlsimport@tc().sortx(CustomerId,ProductId)

Create cursor of the orders table and sort it by key field

2

=A1.group(CustomerId,ProductId)

Put records where same customer buys same product into one group

3

=A2.select(~.count()>1).conj()

Get groups where the number of records is greater than 1 and concatenate their records together

4

=file("e:/txt/c_rebuy.xlsx").xlsexport@t(A3)

Export A3s result to c_rebuy.xlsx 

 

Find more examples in SPL CookBook