Cross-database Computing Methods
【Abstract】
If data to be processed is divided to be stored in multiple databases, of different types sometimes, the mixed cross-database computing is needed. This essay compares four methods, DBLink, esProc SPL, Scala and Calcit, for performing cross-database computing in terms of cross-base coding design, deployment and configuration, and computing performance.
For some reasons, data is sometimes divided into segments to be stored into different databases. This will involve mixed cross-database computing. There are various methods based on different theories for handling such a computation. Now let’s look at them through the aspects of coding design, deployment and configuration and computing performance.
DBLink
Since Oracle DBLink handles cross-database computations using Oracle SQL, it generates simple and easy to understand code.
We have an example here. The order table Orders is stored in MySQL. We need to first sum the sale amounts for each SellerId and then join the result set with employee table Employees in Oracle database to correspond each salespeople’s name to their total sales amount.
You just need to write the following SQL with DBlinks: select e.name, o. subtotal from Employees e, (select SellerId, sum(Amount) subtotal from Orders@mysql1 group by SellerId) o where o.Sellerid=e.EId
mysql1 in the above SQL is the MySQL data source name. The relationship between a table and the source holding it is simply represented by the at sign, @。
Though DBLink coding is simple, configurations are complicated. Users need to install components from different vendors, make sure their versions are compatible, and do a lot of other configurations. Except for a few ones, you need to edit files for all other configuration items. Below are the main steps:
1. Configure Oracle Client and MySQL ODBC to ensure that both databases can be accessed independently;
2. Install MySQL database gateway, that is, Oracle Database Gateway for ODBC;
3. Launch Oracle Heterogeneous Service (HA) and related agents;
4. Create heterogeneous database access ODBC configuration file and enter MySQL ODBC-related information into it;
5. Configure MySQL data source in listener.ora and tnsnames.ora respectively. Make sure their configurations are consistent.
6. Use DDL command to create a DB link pointing to MySQL (let’s name it mysql1, for example).
But the performance of DBLink is unsatisfactory. The primary cause is that the method cannot make good use of other databases’ computing capabilities. Probably to ensure syntax compatibility, Oracle designers prescribe that remote table computations be translated into two parts. One part covers three types of SQL queries to be executed in a remote server. They are non-functional conditional queries, calculated column generation and full data set retrieval. The other part includes most of the computations to be executed locally. They are functional queries, calculated column generation, sorting, grouping & aggregation, etc. Many computations that can be efficiently handled remotely are transformed to the local, yet the execution of them is slower. Even a grouping & aggregation operation that returns a result set smaller than the source data size takes up transmission resources to retrieve data to the local for computation.
In the above join example, the subquery select SellerId, sum(Amount) subtotal from Orders@mysql1 group by SellerId involves the computation of remote tables. A small result set will be transmitted to the local if the computation is performed remotely. But, since Oracle translates the query into two parts, only select SellerId, Amount from Orders, the whole source data retrieval, is executed on the remote machine. The remotely retrieved data is stored locally in a temporary table ($Orders, for instance) and the second part of the query, select SellerId, sum(Amount) subtotal from $Orders group by SellerId, is then executed on the local machine.
Similar cross-database computation handling methods are also employed by DB2 (Federated Database) and MSSQL (Linked Server). All work similarly and has the same pros and cons.
Scala
Scala was initially intended to be a general-purpose programming language, yet it is now mostly employed to compute structured data. The language supplies three data types (DataFrame, DataSet and RDD) and supports two types of syntax (DSL and SQL). Both DSL and SQL can handle cross-database computations and switch from one to the other. For example, the above example computing goal can be achieved with the following Scala program:
package test
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
object testJoin {
def main(args: Array[String]): Unit = {
//create spark session on local
val spark = SparkSession.builder()
.master("local")
.appName("example")
.getOrCreate()
//load Employees from oracle
val Employees = spark.read
.format("jdbc")
.option("url", "jdbc:oracle:thin:@127.0.0.1:1521:ORCL")
.option("query", "select EId,Name from employees")
.option("user", "scott")
.option("password", "password")
.load()
//load Orders group from MySql
val O = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/mysql1")
.option("query", "select SellerId, sum(Amount) subtotal from Orders group by SellerId")
.option("user", "root")
.option("password", "password")
.load()
//join using dataframe
val join=O.join(Employees,O("SellerId")===Employees("EId"),"Inner")
.select("Name","subtotal")
join.show()
}
}
Though the program is long, the core code is simple. Scala runs on JVM, so Java and Scala can call each other with simple code.
The configurations for Scala’s cross-database computing method are much easier than those for Oracle’s. Users just need to place the database driver JARs in the class path. This shows that Scala’s method for handling cross-database computations is more open and professional. Oracle’s, on the other hand, is closed and immature.
The performance of Scala’s method is better, too because it can make good use of database’s computing ability. In the above Scala program, the remote table computations are actually executed remotely, such as the query select SellerId, sum(Amount) subtotal from Orders group by SellerId, which is executed on the MySQL side. Rather than whole source data, the remote computations return a few result sets and thus transmission speed becomes much faster.
Here the program is for pure in-memory computations in cases when the size of result sets returned by database computations is relatively small. If the returned data set size is relatively large, O. persist()function should be used to start the auto-interchange between memory and external disk storage. Though Scala is competent to handle in-memory computing, it does not have a well-developed external memory computing mechanism and malfunctions are not uncommon. It would be better that O. persist() function is not used.
Calcite
Calcite is an open-source project licensed by Apache Software Foundation. Its query planning engine is intended to be “one size fits all” for all data computing platform and data sources. The framework supports various SQL styles (not the SQL dialects executed actually). To implement the above cross-database computation in MySQL style, for example:
package Test;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class testCalcite {
public static void main(String[] args)throws Exception {
Properties config = new Properties();
config.put("model", "d:\\ds.json");
//using built-in lexical policy of the MySQL style
config.put("lex", "MYSQL");
String sql ="select e.name, o. subtotal from orcl.Employees e, (select SellerId, sum(Amount) subtotal from mysql1.Orders group by SellerId) o where o.Sellerid=e.EId";
Connection con = DriverManager.getConnection("jdbc:calcite:", config));
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sql));
if(con!= null) con.close();
}
}
In the program, both orcl and mysql1 are data source names that point to Oracle and MySQL. Calcite syntax is simple and easy to understand and is seemingly similar to the ordinary JDBC.
The Calcite configurations are simple. Besides putting database jars in the class path, users only need to edit a data source file, which is ds.json in the above program.
Same as Oracle, Calcite cannot use the computing abilities of databases of different structures. So it’s performance is unsatisfactory, too. A Calcite style SQL query will be ultimately divided into two parts after multi-layer translations. One part includes the translation of a SQL dialect (such as MySQL’s SQL implementation) and related execution in the database. The other part is the local computations after result sets of database computations are retrieved. The dialect translation is hard and is done roughly because Calcite needs to cater for different types of databases. For example, group by is translated into record retrieval instead of the actual grouping and aggregation. The database computing ability is thus wasted. The translation performed locally is relatively manageable and does not need to take the compatibility issue into account, so an operation can be translated into what it truly is.
There is another problem about Calcite. The framework only supports pure in-memory computations. Memory overflow will happen if the data set size is too large.
esProc SPL
esProc is a specialized open-source structured data computation engine and is also made to be “one size fits all”. It is lighter, thinner, and simpler. For example, to handle the above cross-database computation with esProc SPL, we have the following script file (cross.dfx):
A |
|
1 |
=orcl.query("select EId,Name from employees") |
2 |
=mysql1.query("select SellerId, sum(Amount) subtotal from Orders group by SellerId") |
3 |
=join(A1:O,SellerId; A2:E,EId) |
4 |
=A3.new(O.Name,E.subtotal) |
The script can be executed or debugged on esProc IDE, and can be embedded in to a Java program through JDBC, as shown below:
package Test;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class test1 {
public static void main(String[] args)throws Exception {
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call cross()");
if(connection != null) connection.close();
}
}
Users are allowed to write a query with the Java program if the computation is simple, without editing an extra script file, as the following shows:
ResultSet result = statement.executeQuery("=join(orcl.query(\"select EId,Name from employees\"):O,SellerId; mysql1.query(\"select SellerId, sum(Amount) subtotal from Orders group by SellerId\"):E,EId).new(O.Name,E.subtotal)");
In terms of performance, SPL and Scala are equal. Both can make most use of the database computing ability to achieve high performance. In the above code, MySQL just needs to return a few small result sets instead of the whole source data.
esProc also offers the cursor mechanism to handle computations where the size of source data or the result sets is too large. The mechanism, well-developed and stable, is made for handling huge data volume. To Group Employees table by Dept filed and summarize Amount field of Orders table, for example:
A |
|
1 |
=orcl.cursor("select EId, Dept from Employees order by EId") |
2 |
=mysql1.cursor("select SellerId, Amount from Orders order by SellerId") |
3 |
=joinx(A2:O,SellerId; A1:E,EId) |
4 |
=A3.groups(E.Dept;sum(O.Amount)) |
The above SPL code increases performance by doing the join through an order-based merge.
For special cross-database computing scenarios, such as those involving sub-databases, SPL= can use parallel processing to considerably enhance the performance.
A |
B |
C |
|
1 |
=[connect("mysql1"),connect("mysql2"),connect("mysql3")] |
/ Connect to multiple mysql databases |
|
2 |
select * from orders where amount>=10000 |
/SQL |
|
3 |
fork A1 |
=A3.query(A2) |
/ Execute SQL with parallel processing |
4 |
=A3.conj() |
/ Concatenate result sets returned from the multiple threads |
Configurations for SPL cross-database computation handling method is simple, too. Users only need to add the database jars in the class path and set database data source name.
In a nutshell, all the four methods have simple syntax; all, except for DBLink, boasts convenient configurations; Scala and SPL have more excellent performance; and esProc SPL outperforms the other three by supporting more class libraries and having more mature computing mechanisms.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version