Remove Duplicate Data from a CSV File

Problem description & analysis

Below is CSV file csv.csv:

a,123,value1,a@email.com

a,123,value1,a@email.com

a,123,value1,a@email.com

a,123,Value7,a@email.com

b,567,Value5,b@email.com

b,567,Value6,b@email.com

b,567,Value6,b@email.com

We are trying to delete the duplicate records from the CSV file. Below is the desired result:

a,123,Value7,a@email.com

a,123,value1,a@email.com

b,567,Value5,b@email.com

b,567,Value6,b@email.com

Solutions:

Method: Through table sequence

Write the following script p1.dfx in esProc:

A

1

=file("csv.csv").import@c()

2

=A1.group@1(#1,#2,#3,#4)

3

=file("result.csv").export@c(A2)

Explanation:

A1   Import the CSV file as a table sequence.

A2  Group records by all columns, get the first record of each group, and return all the eligible records as a record sequence. This is equivalent to a distinct operation by all columns.

A3  Export result to result.csv.

Method 2: Through strings

Write the following script p1.dfx in esProc:

A

1

=file("csv.csv").read@n()

2

=A1.id()

3

=file("result.csv").export(A2)

Explanation:

A1   Read each row of the CSV file as a string and return a sequence of strings.

A2  Perform distinct operation on the sequence of strings using id() function.

A3  Export result to result.csv.

Method 3: Through sequence of sequences

Write the following script p1.dfx in esProc:

A

1

=file("csv.csv").import@w()

2

=A1.id()

3

=file("result.csv").export@c(A2)

Explanation:

A1   Import the CSV file as a sequence of sequences.

A2  Perform distinct operation on the sequence of sequences using id() function.

A3  Export result to result.csv.

Read How to Call an SPL Script in Java to learn about the integration of an SPL script with a Java program.

Q & A Collection

https://stackoverflow.com/questions/62313801/compare-two-columns-in-csv-and-write-only-unique-values-to-another-csv