What open source Java packages can do structured data processing?

Almost every Java programmer will create and process collections. They are the basis of programming. But if you often deal with SQL-like operations, for example, “find out classes with an average English score of less than 70”, “select CLASS, avg(English) as avg_En from students_scores group by CLASS having avg(English)<70”, most databases only operate in such a simple way of declaration. You need to use loops over and over and implement those in SQL with simple operations. Secondly, you also need to consider how to handle very large collections only on this machine. It is also necessary to use a multi-core architecture to speed up processing, but writing parallel code is difficult and error-prone. If you could use Open-esProc, this problem is very easy. But Open-esProc is different from general Java packages. It encapsulates the data types and calculation methods in a scripting language called SPL and then calls the SPL script in the Java program to return a ResultSet object.

For example: find classes with an average English score of less than 70. The SPL code is as follows:


A

1

=file(“E:/txt/Students_scores.csv”).import@tc()

2

=A1.groups(CLASS;avg(English):avg_En)

3

=A2.select(avg_En<70)

To complete the above tasks, it is obvious that the code is very lengthy to use Java directly. But, it is very simple to use SPL to do structured calculations, filter, sort, group, and join. Script files (such as condition.dfx) are stored together with Java and are called in Java through the JDBC interface. The usage is similar to stored procedures. In this way, the SPL calculation process is independent, and it is very convenient to change when the demand changes.

…
 ResultSet result = statement.executeQuery("call condition.dfx");
…

SPL also supports SQL-like usage without script files, directly embedding it in Java.

…
 ResultSet result = statement.executeQuery("
=file(\“E:/txt/Students_scores.csv\”).import@tc()
.groups(CLASS;avg(English):avg_En).select(avg_En<70)");
…

SPL also provides a method of querying data with SQL, convenient for programs familiar to us directly. For example, state, department, and employee information are stored in three text files (the same for replacing the three files with three SPL table sequence objects) and query employees in New York state whose manager is in California.


A

1

$select   e.NAME as ENAME
from   E:/txt/EMPLOYEE.txt  as e
     join E:/txt/DEPARTMENT.txt as d on   e.DEPT=d.NAME
     join E:/txt/EMPLOYEE.txt  as emp on d.MANAGER=emp.EID
where   e.STATE='New York' and emp.STATE='California'

When the data is large, the memory size will be a processing bottleneck. SPL can read through cursors, similar to the cursors in the database, reads in small batches, and then binds calculations on the cursors to achieve sorting, association, and grouping calculation to use small memory to process the big data.


A

1

=file("E:/txt/Employees.txt").cursor@t().sortx(EId)

2

=file("E:/txt/Orders.txt").cursor@t().sortx(SellerId)

3

=joinx(A2:O,SellerId; A1:E,EId)

4

=A3.groups(E.Dept;sum(O.Amount))

Big data processing often needs to add parallel computing to improve computing efficiency. Each thread processes a piece of data and finally summarizes each thread’s processing results.

A
1 =file(“E:/txt/user_info_reg.csv”).cursor@tcm(;4)
2 =A1.groups(id_province;count(~):cnt)

It is very easy to use parallel speedup in SPL. @m means parallel computing, and parameter 4 means 4-way parallel. Compared with single-threaded code, there is only one more cursor option and parameter, making it very convenient for users to use parallelism.

Using SPL can greatly simplify the calculation of structured data in Java programs. Examples are summarized as follows:

Loop operations

Accessing members of data set by sequence numbers

Locate operations on ordered sets

Alignment operations between ordered sets

TopN operations

Existence checking

Membership test

Unconventional aggregation

Alignment grouping

Select operation

More calculation examples: Use SPL in applications