Is There Any Java Open-source Library for Data Query & Analysis

The established Python Pandas used for data query and analysis never lacks imitators. Many are based on Java. Among which Tablesaw and Joinery are famous, but they are not nearly so good as the Python library in a lot of aspects, such as maturity, computational ability, and richness of functions. Besides, the Java libraries, need to be recompiled whenever their code is modified (while Pandas and SQL do not). It would be better to create a table in SQLite, read and import the source data, perform the query analysis in SQL when the size of source data is relatively small and if you don’t mind the inconveniences. The whole process could be performed in Java.

Yet as an open-source Java library written purely in Java, esProc SPL is much easier and simpler to use than SQLite.

SPL uses easy to master JDBC driver. To perform a conditional query on Orders.txt, for instance, SPL has the following code:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/data/Orders.txt\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))";
ResultSet result = statement.executeQuery(str);
…

That is convenient. SPL also boasts a great number of functions for performing basic computations, Here are some examples:

//Sorting

str ="=T(\\"D:/data/Orders.txt\\").sort(Client,-Amount)";

//Distinct

str="=T(\\"D:/data/Orders.txt\\").id(Client)";

//Grouping & aggregation

str ="=T(\\"D:/data/Orders.txt\\").groups(year(OrderDate);sum(Amount))";

//Join

str ="=join(T (\"D:/data/Orders.txt\"):O,SellerId; T(\"D:/data/Employees.txt\"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate,   E.Name,E.Gender,E.Dept)";

For database programmers, esProc supplies corresponding SQL syntax. They can handle a grouping & aggregation operation using the following SQL statement:

str="$SELECT  year(OrderDate),sum(Amount) from Orders.txt group by year(OrderDate)"

SPL supports a variety of data sources, including text files, relational databases, NoSQL databases, and RESTful data.

A SPL script, like a SQL query, can be embedded into a Java program as the above shows. It can also be saved as a separate script file (which is like a stored procedure) independent of the Java program when the computation is a complicated multi-step one, may undergo frequent modifications, and needs a loosely-coupled environment. Here is an example. We are trying to find employees in every other department whose ages are below the average age of the current department. First, we save the SPL script as a file:


A

1

=T("Employee.csv")

2

=A1.group(DEPT; (a=~.avg(age(BIRTHDAY)),~.select(age(BIRTHDAY)<a)):YOUNG)

3

=A2.conj(YOUNG)

Now we call the script file from Java as we call a stored procedure:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call getYoung()");
…

 

Sometimes the computing logic is so complicated that it is hard to phrase even in database stored procedure. SPL, with its rich computing functions and syntax, can significantly simplify the computing logic. To calculate the largest number of days when a given stock rises in a row, for example, SPL just needs two lines of code:


A

B

1

=T("d:/AAPL.xlsx")

Retrieve records from the Excel file, reading the first row as column headers

2

=a=0,A1.max(a=if(price>price[-1],a+1,0))

Count the longest consecutive rising days

 

The independent and separate script file can be edited and debugged on a special SPL IDE. The SPL IDE is equipped with a complete set of debugging functionalities and lets users to observe the result of each step, making it particularly suitable for achieving algorithms with rather complex logics:

undefined 

Obviously, esProc SPL outperforms all the other open-source libraries in handling data query and analysis. It also shows outstanding performance and capability in data source support, diverse/multiple source mixed computing, big data computing and parallel processing. There are detailed discussions of them in relative essays.