How to Compare Two Large CSV Files in Java

Question

Source:https://stackoverflow.com/questions/69357566/how-to-compare-two-large-csv-file-in-java

I need to compare two large csv files and find differences.

First CSV file will be like:

c71f55b6c18248b8915d8a26

64b7d2d4eab74d7999a967c0

ceb792ad21054fe0a27ec410

95319566f9424c57ba2145f9

682a4fe26c154050b8f5c6f1

88e0209e2af74049ad9bf2bd

5c462b42763d41d7bb67029f

0ee74c227fc84e39a9ecc1da

66f7ab6f56374ba08d2fb92d

3ed793e35f9441b58562c9ba

baad81ac8ba54188afe63fb8

...

Each row has just one id, and total row count is approximately 5 million. The Second CSV file will be like First one with total row count 3 million.

I need to remove ids of the second csv from the first csv and put them into a MongoDB. When I take all lines into memory then compare both CSV files, I got out of memory error. I have 512Mb memory space and I will get at least 30 requests in a day. Rows of CSV is changing 1Million-10Million. I can receive two requests at same time and do same things simultaneously.

Is there any other way on this?

Thanks.

Answer

You need to delete data from the first CSV file that also exist in the second CSV file. As both CSVs are very large, they cannot be wholly loaded into the memory. Java will produce a very long piece of code to do this.

It is rather simple to get this done in SPL, the open-source Java package. Only one line of code is sufficient:

A

1

=file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx(~)].mergex@d())

 

SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as diff.splx and invoke it in Java as you call a stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

st=con.prepareCall("call diff()");

st.execute();

Or execute the SPL string within a Java program as we execute a SQL statement:

st = con.prepareStatement("==file(\"result.csv\").export([file(\"csv1.csv\").cursor@i().sortx(~),file(\"csv2.csv\").cursor@i().sortx(~)].mergex@d ())");
st.execute();

View SPL source code.