Is There Any Java Open-source Library for SQL-style Computations?

The so-called SQL-style computations are actually, in a more professional way, structured data computations. They include filtering, grouping, sorting and joins. There are many open-source Java libraries to handle these operations, but few are convenient to use. Stream is a Java insider that, unfortunately, has unsatisfactory computing ability. Scala is better at this, but, intended for big data handling, it has complicated framework and thus is difficult to be called by Java. There is also a pile of free libraries that either copy Python Pandas or that make users write SQL directly on files. In short, none is mature and stable.

Sufficiently mature and stable, powerful computational capability, simple framework, and invocation-friendly. It seems that such an open-source library can only be esProc SPL.

It is convenient for SPL to compute structured data. To read all records from Orders.txt, for example:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/data/Orders.txt\")";
ResultSet result = statement.executeQuery(str);
…

The core code is just a single T(…).

For basic structured data computations, SPL offers a wealth of ready to use functions:

//Conditional query

str="=T(\\"D:/data/Orders.txt\\").select(Amount>1000 && Amount<=3000 && like(Client,\\"\*S\*\\"))";

//Sorting

str ="=T(\\"D:/data/Orders.txt\\").sort(Client,-Amount)";

//Grouping & aggregation

str ="=T(\\"D:/data/Orders.txt\\").groups(year(OrderDate);sum(Amount))";

//Join

str ="=join(T (\"D:/data/Orders.txt\"):O,SellerId; T(\"D:/data/Employees.txt\"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate,   E.Name,E.Gender,E.Dept)";

SPL has its own syntax to achieve these operations. For users accustomed to SQL, esProc is considerate enough to provide corresponding SQL syntax. You can handle a grouping & aggregation operation using the following SQL statement:

str="$SELECT  year(OrderDate),sum(Amount) from Orders.txt group by year(OrderDate)"

All these examples use a file as the data source, though you can access all types of relational databases, NoSQL databases, even Restful data source in SPL.

 

A SPL script, like a SQL query, can be embedded into a Java program as the above shows. It can also be saved as a separate script file (which is like a stored procedure) independent of the Java program when the computation is a complicated multi-step one, may undergo frequent modifications, and needs a loosely-coupled environment. Here is an example. We are trying to find the first n big customers whose orders amount takes up at least half of the total and sort the records of them by amount in descending order. First, we save the SPL script as a file:


A

B

1

= T("D:/data/sales.csv").sort(amount:-1)

Retrieve records and sort them in descending order

2

=A1.cumulate(amount)

Get the sequence of cumulative amounts

3

=A2.m(-1)/2

The final cumulative amount is the total

4

=A2.pselect(~>=A3)

Get the position of the records where the amount takes up at least half of the total

5

=A1(to(A4))

Get the records according to A4’s position

Now we call the script file from Java in the way of calling a stored procedure:

…

Class.forName("com.esproc.jdbc.InternalDriver");  
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");  
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call bigCustomer ()");

...

 

Sometimes the computing logic is so complicated that it is hard to express even in Scala or with the database. SPL, with its rich computing functions and syntax, can significantly simplify the computing logic. To calculate the largest number of days when a given stock rises in a row, for example, SPL just needs two lines of code:


A

B

1

=T("d:/AAPL.xlsx")

Retrieve records from the Excel file, reading the first row as column headers

2

=a=0,A1.max(a=if(price>price[-1],a+1,0))

Count the longest consecutive rising days

 

Apart from reducing couplingness, an independent and separate script file enables users to edit and debug it on a special IDE. The SPL IDE is equipped with a complete set of debugging functionalities, and lets users observe the result of each step, making it convenient to achieve algorithms with rather complex logic:

undefined 

Apparently, esProc SPL outperforms all the other open-source libraries in computing structured data. The tool is also excellent in many other aspects, such as data source support, diverse/multiple sources mixed computing, big data computing and parallel processing. You can find relative discussions in other essays.