* Remove Duplicates from a Big File by Lines


It’s simple to remove duplicate lines with SQL’s SELECT DISTINCT … FROM statement. But as SQL can’t handle a file directly, you need to first store data in a database table. Of course you can write a program to get this done directly over the file. You open the file, read in data line by line while comparing each line with the unique values stored in buffer. Give up the duplicates and append others to the buffer and, after all lines are traversed and compared, export the unique records in the buffer to the original file.

It sounds easy, but actually it is only suitable for handling relatively small files. To deal with a file that cannot fit into the memory, you need to create external file buffer or sort the file before deleting duplicates. But it’s not easy and simple to implement external write or sort a huge file.

There would be no more trouble if you could use esProc SPL to handle this. A one-liner is enough:


You can also write a SQL directly over the file with esProc:

$select distinct #1 from d:/urls.txt


Besides the distinct operation over a file, esProc can deal with a lot of more operations. Most of them can be written in SQL. More examples can be found in Structured Text Computations with esProc.

esProc is the file processor that can conveniently handle data loading, database export and mixed computations over various types of files, including TXT, Excel, XML, JSON, CSV and INI. The desktop tool is ready to use, simple to configure and convenient to debug. It allows setting a breakpoint and step-by-step execution during which you can view the result of each step. Based on powerful yet simple syntax that agrees with human way of thinking, esProc is more convenient to use compared with high-level languages. Read Data File Processor.

SPL is integration-friendly with a Java program. Read How to Call an SPL Script in Java to learn details.

About how to work with esProc, read Getting Started with esProc.