Is There Any Open-source Library That Can Achieve Cross-database Computations?

 

Some databases intrinsically support cross-database computations, such as DBLink and Linked Server, but they are not open-source and complicated to configure. And on most occasions, data should be loaded to the local for computation, resulting in low performance. Among the open-source databases that support cross-database computations, Scala can make most use of databases’ computing abilities and offer excellent performance, but it is heavy-weighted, produces lengthy code, and is error-prone when data cannot fit into the memory. Both Calcite and Tablesaw are simple to configure and light-weighted, but they are underdeveloped and support too few functions.

SPL, the Java-based open-source library that also supports cross-database computations, becomes the best alternative.

SPL offers concise code, is simple to configure and light-weighted, and has integration-friendly JDBC interface. SPL also provides special optimizations for handling data that cannot fit into the memory and scenarios that involve sub-databases, greatly simplifying code while ensuring stability and efficiency.

 

SPL can make full use of databases’ computing abilities. The remote database will first execute SQL or the stored procedure, then SPL retrieves the smaller result set to the local and performs cross-database computation. To perform a join between MySQL where data is grouped and summarized, and Oracle, for instance:


A

1

=orcl.query("select EId,Name from employees")

2

=mysql1.query("select SellerId, sum(Amount) subtotal from Orders group by SellerId")

3

=join(A1:O,SellerId; A2:E,EId).new(O.Name:name, O.Dept:dept, E.subtotal:amt)

SPL offers a rich library of functions to achieve a computation with simple and intuitive code after the cross-database computation. To get records according to page number after a cross-database computation, for instance, SPL has the following code, where parameter p_Size is the number of records per page and parameter p_No is the page number:


A

3

…// Cross-database computation

4

=A3.to(p_Size*(p_No-1),p_Size*p_No)

SPL offers JDBC driver to be conveniently invoked by a Java program. For instance:

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call pageQuery(?, ?)}");
statement.setObject(1, 10);
statement.setObject(2, 3);
statement.execute();
…

More examples:


A

B

3


4

=A2.select(amt>1000 && like(dept,\"*S*\"))

//Search

5

=A2.sort(dept,-amt)"

//Sort

6

=A2 .id(dept)

//Distinct

7

=A2.groups(dept;sum(amt))

//Grouping & aggregation

 

SPL provides cursor mechanism to handle data whose size exceeds the available memory space with simple and stable code. For instance, Employees and Orders are large tables and stored in different databases. We first perform a cross-database join and then grouping & aggregation:


A

1

=orcl.cursor("select EId, Dept from Employees order by EId")

2

=mysql1.cursor("select SellerId, Amount from Orders order by   SellerId")

3

=joinx(A2:O,SellerId; A1:E,EId)

4

=A3.groups(E.Dept;sum(O.Amount))

The algorithm performs the join using the order-based merge to obtain higher performance.

 

With computations between sub-databases of same structure, SPL provides parallel processing of simple and concise syntax:


A

B

C

1

=[connect("mysql1"),connect("mysql2"),connect("mysql3")]


/ Connect to multiple mysql databases

2

select * from orders where   amount>=10000


/SQL

3

fork A1

=A3.query@x(A2)

/Execute SQL with parallel processing

4

=A3.conj()


/Concatenate result sets

Apart from databases, SPL also supports file sources, such as CSV and XLS, and various NoSQL sources, as well as mixed computations between different types of data sources or databases.