How to Query and Compute XLS in Java?

POI does not have further query and computing capabilities after it parses xls. Programmers need to import data and hardcode a specific computing task. Embedded databases are exactly the opposite with their query and computing abilities and the inability to parse. Besides, they boast time-consuming data loading process and heavy frameworks. A fast and lightweight solution is to execute SQL directly on an xls file. But there are only a few open-source, rudimentary libraries for achieving the solution. Their practical uses are rather limited. DataFrame-style open-source libraries are more mature, having no need to interpret SQL and at the same time no possibility of decoupling, but they do not have as good computing capabilities as SQL.

To have an open-source library with powerful computing capabilities that can process xls directly and that decouples computing code from Java code, esProc SPL is the best choice.

SPL provides easy-to-use parsing functions to read xls conveniently and effortlessly. Suppose we have an Excel file where each row is an order record and where the first row contains column headers. To parse the file as a to-dimensional table, SPL just needs an extremely simple line of code:

=T("D:/data.xls")

The convenient T function reads data from the first sheet by default. If your parsing task is more complicated, like reading the sheet named "Orders", you can use xlsimport. The function has more parsing functionalities.

=file("D:/data.xlsx").xlsimport@t(;"Orders")

It can skip the title row, read rows from the Nth to Mth, and import data from a big file to name a few.

SPL also offers a variety of library functions to handle common computations, for example:

//Conditional query

=T("D:/Orders.xls").select(Amount>1000   &&   Amount<=3000 && like(Client,"\*S\*"))

//Sorting

=T("D:/Orders.xls").sort(Client,-Amount)";

//Distinct

=T("D:/Orders.xls").id(Client)";

//Grouping & aggregation

=T("D:/Orders.xls").groups(year(OrderDate);sum(Amount))";

//xlsx joins txt

=join(T("D:/Orders.xlsx"):O,SellerId;   T("D:/data/Employees.txt"):E,EId).new(O.OrderID,O.Client,O.SellerId,O.Amount,O.OrderDate,     E.Name,E.Gender,E.Dept)";

SPL offers JDBC driver to be conveniently invoked by Java. Take a conditional query as an example:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*S*\"))";
ResultSet result = statement.executeQuery(str);
…

So, SPL is an interpreted computational language. Like SQL, it can decouple itself from Java code. Actually, SPL supports SQL statements for what they are in order to reduce data programmers’ costs of learning. The above conditional query can be rewritten as:

str="$select * from   d:/Orders.xls where Client like'%S%'or (Amount>1000 and   Amount<=2000)"

The SPL code can be saved separately independent of the Java code, and does not need to be recompiled when there are any changes, loose couplings at root. The library is particularly suitable for handling computational scenarios that are complicated or that may undergo frequent modifications. In contrast, embedded libraries usually do not support stored procedures and large-scale databases need to compile stored procedures. One example is to find employees in every other department whose ages are below the average age in the current department. First, we save the SPL algorithm as a script file:


A

1

=T("Employee.xls")

2

=A1.group(DEPT; (a=~.avg(age(BIRTHDAY)),~.select(age(BIRTHDAY)<a)):YOUNG)

3

=A2.conj(YOUNG)

Then we invoke the above SPL script file from Java as we call a stored procedure:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call getYoung()");
...

With xls files of irregular content, SPL has agile syntax and rich functions to handle them conveniently while other class libraries are unable to deal with. For example, an Excel sheet contains many strings in the form of “key=value” and we are trying to arrange them into a standard two-dimensional table for further computations. SPL has the following code to do this:


A

1

=file("D:/data/keyvalue.xlsx").xlsimport@w()

2

=A1.conj().(~.split("="))

3

=A2.new(~(1),~(2))

It is even hard for both SQL and the stored procedure to achieve some computations of complex logic. SPL can implement them effortlessly with its exceptional computational capabilities. To calculate the largest number of days when a given stock rises in a row, for instance, SPL only needs two lines of code:


A

1

=T("d:/AAPL.xlsx")

2

=a=0,A1.max(a=if(price>price[-1],a+1,0))

SPL’s professional IDE, equipped with a complete set of debugging functionalities and letting users observe the intermediate result of each step, is fit for programming computations with complex logic.

In summary, esProc SPL stands out from other open-source libraries with its outstanding computational capabilities and loosely-coupled, lightweight framework.