How to Query and Compute TXT/CSV Files in Java?

There are many class libraries for parsing txt/csv files on GitHub. You can parse your target file using one of them, import the parsed data to the database using Hibernate or compose an INSERT statement manually to load data in, and perform queries in the database. The method makes use of SQL’s powerful computational capability, but it has a very complicated process and ridiculously low realtime capabilities. Indeed, there are tools that have simple frameworks and strong realtime capabilities. Tablesaw simulates features of Python Pandas, and CSVJDBC can execute SQL-style statements directly on files. But both are poorly-developed because they do not support even many basic computations.

As a Java open-source library, esProc SPL becomes an ideal choice for its great computational capabilities and the advantage of being able to compute files directly.

An SPL script can be invoked through the JDBC driver, which is easy to learn. To group and summarize the orders stored in a txt file, for instance, we have the following SPL code:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/data/Orders.txt\").groups(year(OrderDate);sum(Amount))";
ResultSet result = statement.executeQuery(str);
…

SPL has rich library functions to achieve common query computations directly. Here are some examples:

//Conditional query

str="=T(\\"D:/data/Orders.txt\\").select(Amount>1000 && Amount<=3000 && like(Client,\\"\*S\*\\"))

//Sorting

str ="=T(\\"D:/data/Orders.txt\\").sort(Client,-Amount)";

//Distinct

str="=T(\\"D:/data/Orders.txt\\").id(Client)";

//Grouping & aggregation

str ="=T(\\"D:/data/Orders.txt\\").groups(year(OrderDate);sum(Amount))";

//Join

str ="=join(T (\"D:/data/Orders.txt\"):O,SellerId; T(\"D:/data/Employees.txt\"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate,   E.Name,E.Gender,E.Dept)";

SPL also supports SQL syntax in an effort to reduce the costs of learning for database programmers. So, the above grouping & aggregation operation can be rewritten as follows:

str="$SELECT  year(OrderDate),sum(Amount) from D:/data/Orders.txt group by year(OrderDate)"

SPL allows saving an algorithm in a separate script file independent of the Java program, making it unnecessary to recompile the code when there are any modifications. This is really suitable for computations that are complicated multi-step ones or whose code may undergo frequent changes, by creating a loosely-coupled environment. Here is one example. We are trying to find employees in every other department whose ages are below the average age of the current department. First, we save the SPL algorithm as a script file:


A

1

=T("Employee.csv")

2

=A1.group(DEPT; (a=~.avg(age(BIRTHDAY)),~.select(age(BIRTHDAY)

3

=A2.conj(YOUNG)

Then we call the script file from Java as we call a stored procedure:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call getYoung()");
...

 

Besides those commonly seen computations in our daily analysis work, there are the less common, peculiar and thorny ones. With SPL’s convenient-to-use functions and flexible syntax, we can deal with them effortlessly.

SPL can parse various special separators in a simple and intuitive way. To perform conditional query on a CSV file separated by the double horizontal line (--), SPL has the following statement:

= file("D:/Orders.txt").import@t(;,"--").select(Amount>1000 && Amount<=3000)

It is convenient and simple to handle text files of uncommon formats in SPL. Suppose we have a text file where every three lines correspond to one record and line 2 contains multiple fields, and are trying to arrange the file into a standard two-dimensional table sorted by the 3rd and 4th fields. We can handle the task using the following SPL code:


A

1

=file("D:\\data.txt").import@si()

2

=A1.group((#-1)\3)

3

=A2.new(~(1):OrderID, (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate   )

4

=A3.sort(_3,_4)

Sometimes the computing logic is so complicated that it is hard to achieve even in database stored procedure. SPL, with its powerful computing ability, can realize the algorithms conveniently. For example, the dates when goods are moved out and moved in are usually not continuous, and we are trying to get the inventories of a number of products in a specified continuous time period. To do this, SPL only needs a short piece of code, as shown below:


A

1

=file("d:/inout.txt").import@t()

2

=periods(argBeginDate,argEndDate)

3

=A1.group(product;~.align(A2,date):g)

4

=A3.news(g;A3.product,A2(#):date,ifn(in,0):in,   ifn(out,0):out, stock[-1]+in-out:stock)

SPL has a professional IDE with a complete set of debugging functionalities and letting users to observe the intermediate result of each step in the form of a table, making it particularly suitable for achieving computations with complex logics:

undefined 

SPL supports various data sources, including Excel files, XML data, databases, NoSQL databases, and RESTful data, as well as high-performance big file processing. There are many more SPL features waiting for you to find out.