Remove duplicates from a CSV file
Sometimes during data processing we need to remove duplicate data from a file or retain only duplicates in a file. The methods include row-wise deletion or deletion by key field.The esProc SPL language it uses boasts a set of all-round function libraries for handling set-based operations. That makes it easy and simple to perform distinct over files.
Row-wise DISTINCT
In a text file, each row is a string. We want to retain only one of each set of duplicate rows. To do this, we read in each row in the file as string to form a set of strings and perform distinct over the set.
Example: paint.txt records the IDs and names of students who register in the painting class. Some students have registered more than once, so we need to delete the extra duplicate records and write data to paint1.txt. Below is part of paint.txt:
20121102-Joan
20121107-Jack
20121113-Mike
20121107-Jack
esProc SPL script:
+ | A | B |
---|---|---|
1 | =file(“e:/txt/paint.txt”).read@n() | Read in each row of paint.txt to form a set |
2 | =A1.id() | Delete duplicates from A1’s set |
3 | =file(“e:/txt/paint1.txt”).write(A2) | Write A2, the deduplicated set, into paint1.txt |
Comparison by key field
We have a file that has columns of data. The first row contains column headers and detailed data starts from the second row. We want to compare values in the key field, and delete rows containing duplicate key values or retain them only.
Below is a part of the orders table of 2018 (order_2018.xlsx):
+ | A | B | C | D | E |
---|---|---|---|---|---|
1 | ID | CustomerId | ProductId | OrderDate | Amount |
2 | 10248 | 20108 | 1 | 2018-7-4 | 428.00 |
3 | 10249 | 20806 | 3 | 2018-7-5 | 1842.00 |
4 | 10250 | 30712 | 2 | 2018-7-8 | 1523.49 |
5 | 10251 | 10588 | 4 | 2018-7-8 | 624.94 |
6 | 10252 | 20108 | 1 | 2018-7-9 | 3559.49 |
7 | 10253 | 20806 | 2 | 2018-7-10 | 1428.00 |
Removing duplicates
Example 1: Get the unique IDs of all customers that place an order in 2018 and write them into 2018c.xlsx.
esProc SPL script:
+ | A | B |
---|---|---|
1 | =file(“e:/txt/order_2018.xlsx”).xlsimport@t() | Import orders data of 2018 |
2 | =A1.id(CustomerId) | Get IDs of all different customers |
3 | =file(“e:/txt/2018c.xlsx”).xlsexport(A2) | Export unique IDs to 2018c.xlsx |
Example 2: Find the different products each customer bought in 2018, and write key fields CustomerId and ProductId to 2018c_p.xlsx.
esProc SPL script:
+ | A | B |
---|---|---|
1 | =file(“e:/txt/order_2018.xlsx”).xlsimport@t(CustomerId,ProductId) | Import key fields of orders table of 2018 |
2 | =A1.group@1(CustomerId,ProductId) | Group by the key fields; @1 option enables getting the first record of each group only |
3 | =file(“e:/txt/2018c_p.xlsx”).xlsexport@t(A2) | Export A2’s result to 2018c_p.xlsx |
Retaining duplicates only
Example: Get records where the customer bought same product many times and export them to 2018c_rebuy.xlsx.
esProc SPL script:
+ | A | B |
---|---|---|
1 | =file(“e:/txt/order_2018.xlsx”).xlsimport@t() | Import orders data of 2018 |
2 | =A1.group(CustomerId,ProductId) | Put records where same customer buys same product into one group |
3 | =A2.select(~.count()>1).conj() | Get groups where the number of records is greater than 1 and concatenate their records together |
4 | =file(“e:/txt/2018c_rebuy.xlsx”).xlsexport@t(A3) | Export A3’s result to 2018c_rebuy.xlsx |
The above script is written on the condition that a csv file is not too big. If it is not, look at how we can get done over big files Sample Programs of Performing Distinct on a File
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_Desktop/
SPL Learning Material 👉 https://c.scudata.com
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProcDesktop
Linkedin Group 👉 https://www.linkedin.com/groups/14419406/