Java Embedded Data Engine: From SQLite to SPL

 

Though a lot of embedded data engines for Java are available, the competent and effective ones are rare. Redis has poor computing ability and can only handle simple query scenarios. Spark has complex and heavy framework and is inconvenient to deploy and maintain. Embedded databases like H2\HSQLDB\Derby have lightweight frameworks but they lack sufficient computing abilities – they even do not support the basic window functions.

SQLite achieves the best balance between a good framework and satisfactory computing ability. It is now a widely-used Java embedded data engine.

SQLite suits regular basic computing scenarios

SQLite has lightweight framework. The library encapsulates its C-based core well and presents itself as a compact Jar package, which can be integrated in a Java program conveniently. It offers JDBC driver for invocation by Java:

Connection connection = DriverManager.getConnection("jdbc:sqlite::memory:");
Statement st = connection.createStatement();
st.execute("restore from d:/ex1");
ResultSet rs = st.executeQuery("SELECT * FROM orders");

SQLite offers standard SQL syntax and can handle regular data processing and computing tasks. More importantly, it supports window functions to be able to achieve many intra-group computations conveniently. In short, SQLite has stronger computing ability than other embedded databases.

SELECT x, y, row_number()OVER (ORDER BY y) AS row_number FROM t0 ORDER BY x;

SELECT a, b, group_concat(b, '.') OVER (ORDER BY a ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS group_concat FROM t1;

SQLite is not strong enough for handling complex scenarios

With noticeable advantages, SQLite also has its disadvantages when used to deal with complex computing tasks.

A Java application faces various data sources, such as csv, RDB, Excel and Restful. SQLite only takes care of simple scenario, that is, offering program for loading text data like csv from command line:

.import --csv --skip 1 --schema temp /Users/scudata/somedata.csv tab1

It does not provide any driver for loading data from most of other data sources. Hardcoding data loading needs to invoke the command line many times. The process is complicated and non-real-time.

Take loading data from an RDB as an example. The general way is to execute the command line in Java to convert the RDB table to csv; access SQLite through JDBC and create the desired table structure; execute command line in Java to export csv to SQLite; and finally, create index on the new table in an effort to increase performance. The approach is too rigid to write code easily when we want to define table structure and name flexibly or obtain to-be-loaded data through computation.

Same complicated conversion process with other data sources since SQLite also cannot load data from them directly.

SQL is more like the natural language. This requires a relatively low costs of learning and makes it easy to achieve simple computations. But SQL is not good at handling complex computations, such as complicated set-oriented computations, order-based computations, joins and multistep computations. As SQLite uses SQLto perform computations, it inherits both SQL’s merits and demerits and produces complicated and unreadable code if we try to use it to handle complex computing tasks.

To count the longest days when a stock rises continuously, SQL has the following code:

select max(continuousDays)-1
from (select count(*) continuousDays
from (select sum(changeSign) over(order by tradeDate) unRiseDays
from (select tradeDate,
case when price>lag(price) over(order by tradeDate) then 0 else 1 end changeSign from AAPL) )
group by unRiseDays)

This is not a difficulty unique to SQLite. In fact, all SQL database are not good at doing these computations due to incomplete set-orientation, and lack of sequence numbers and object reference mechanism.

Business logic consists of structured data computation and procedural control. SQLite has structured data computing ability thanks to its support of SQL, but, without stored procedures, it does not have procedural control ability and cannot implement general business logics independently. Usually, it turns to the Java main program’s branching and loop statements. Since Java does not have specialized structured data object to hold SQLite data tables and records, the SQLite coding involves inconvenient conversion and uneven procedure that result in low efficiency.

As we mentioned above, C programs constitute the core of SQLite. This enables the library’s integration in a Java application but the integration is not seamless. As a result, it needs time-consuming conversion in order to exchange data with the Java main program. When a large amount of data is involved or data exchanges are frequent, performance noticeably decreases. Still because of SQLite’s C-language core, the Java framework becomes less consistent and robust.

In view of this, esProc SPL native to JVM is a better partner for Java applications.

SPL supports various data sources

esProc SPL is the open-source embedded data engine under JVM. It has simplistic framework, can load any data source directly, can be integrated in and invoked by Java applications through JDBC driver, and handle subsequent computations conveniently.

With simplistic frameworkand without special service, SPL can be deployed in Java environment by importing the corresponding Jars.

SPL loads any data source directlyand in real-time with concise code and simple procedure. To load data stored in Oracle, for instance:


A

1

=connect("orcl")

2

=A1.query@x("select OrderID,Client,SellerID,OrderDate,Amount from orders order by OrderID")

3

>env(orders,A2)

With a CSV file that SQLite is good at loading, SPL loads it directly using its built-in function rather than from the external command line. The loading is stable and efficient and done with extremely short code:

=T("/Users/scudata/somedata.csv")

SPL supports various data sources. Except for RDB and csv, SPL also supports files like txt and xls, NoSQL like MongoDB, Hadoop,Redis,ElasticSearch,Kafkaand Cassandra, and hierarchical data such as WebService XML and Restful JSON. To load a HDSF file to the memory, for instance:


A

1

=hdfs_open(;"hdfs://192.168.0.8:9000")

2

=hdfs_file(A1,"/user/Orders.csv":"GBK")

3

=A2.cursor@t()

4

=hdfs_close(A1)

5

>env(orders,A4)


SPL offers integration-friendly JDBC driver. The amount of data loaded is generally large and JDBC only runs once at the startup of the application. Then we just need to save the above data loading process as a SPL script file and invoke the file name in Java as we call a stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call init()}");
statement.execute();

SPL has powerful computational capability

SPL offers a wealth of computing functions that enable it to achieve daily computations effortlessly. It supports a variety of advanced syntax and provides a large number of date functions and string functions that make it convenient to accomplish complex computations, including order-based computations, set-oriented computations, multistep computations, joins and business logics containing procedure control, which are hard to deal with in SQL.

Rich computing functionsenable SPL to handle various daily computations effortlessly:


A

B

1

=Orders.find(arg_OrderIDList)

// Multi-key-based search

2

=Orders.select(Amount>1000 && like(Client,\"*S*\"))

// Fuzzy query

3

= Orders.sort(Client,-Amount)

// Sort

4

= Orders.id(Client)

// Distinct

5

=join(Orders:O,SellerId; Employees:E,EId)
.new(O.OrderID, O.Client,O.Amount,E.Name,E.Gender,E.Dept)

// Join

SPL offers SQL syntax of SQL-92 standard. To perform grouping & aggregation, for instance:

$select year(OrderDate) y,month(OrderDate) m, sum(Amount) s,count(1) c
from {Orders}
Where Amount>=? and Amount<? ;arg1,arg2

SPL designs convenient function syntax like function options and hierarchical parameters that power SPL with excellent computational capability. It allows functions with similar functionalities to use same name and distinguishes them with different options. The basic functionality of select function is to filter data. If we need to get the first eligible record, we use @1 option:

T.select@1(Amount>1000)

When the size of data to be filtered is large, we use @m option to implement parallel processing to increase performance:

T.select@b(Amount>1000)

We use @b option to perform order-based grouping, which groups records ordered by the grouping field by placing neighboring ones with same grouping field values to the same group:

T.groups@b(Client;sum(Amount))

Function options can work together:

Orders.select@1b(Amount>1000)

Usually, parameters in a structured computation function are complicated. SQL, for instance, uses a lot of keywords to divide parameters of a statement into multiple groups, causing inconsistent statement structure. SPL has hierarchical parameters. It employs semicolon, comma and colon to identify parameters to three levels:

join(Orders:o,SellerId ; Employees:e,EId)

SPL boasts rich date and string functions. Besides regular functions like those for getting the date before or after a specified dateand string truncation, SPL also supplies richer date functions and string functions, far more than SQL and much more powerful than it. This enables SPL to implement same computations with significantly shorter code. For example:

Get the date before or after a specified number of quarter: elapse@q("2020-02-27",-3) // Return 2019-05-27

Get the date N workdays after: workday(date("2022-01-01"),25) // Return 2022-02-04

String functions:

Check whether a string all consists of letters: isdigit("12345") // Return true

Get a string before a specified substring: substr@l("abCDcdef","cd") // Return abCD

Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|") // 返回 ["aa","bb","cc"]

SPL also offers functions to get a date before or after a number of years, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.

SPL simplifies the traditional complex code for order-based computations. The inter-row computations are usually not easy, like the calculations of link relative ratio and YOY. SPL uses "field[relative position]" to reference another row or a value in it. This produces concise code. It can also automatically handle special cases, such as out-of-range values in an array. All these make SPL more convenient than SQL window functions. To append a computed column rate containing amount growth rate for each order, for instance:

=T.derive(AMOUNT/AMOUNT[-1]-1: rate)

Through the cooperation of position-oriented expressions and order-oriented functions, SPL can accomplish order-based computations that SQL finds hard to handle. To find students who attend school for at least 7 hours each day in the continuous 4 weeks according to the attendance table:


A

1

=Student.select(DURATION>=7).derive(pdate@w(ATTDATE):w)

2

=A1.group@o(SID;~.groups@o(W;count(~):CNT).select(CNT==7).group@i(W-W[-1]!=7).max(~.len()):weeks)

3

=A2.select(weeks>=4).(SID)

SPL streamlines conventionally complex code for set-oriented computations. SPL has thorough set-orientation, which, along with agile syntax and powerful set-oriented functions, enables it to express complex logics of set-oriented computations with unprecedentedly simple code. To find employees who are below the average age of their department for each department, for instance:


A

1

=Employees.group(DEPT; (a=~.avg(age(BIRTHDAY)),~.select(age(BIRTHDAY)<a)):YOUNG)

2

=A1.conj(YOUNG)

To count the longest days when a stock rises continuously:


A

1

=a=0,AAPL.max(a=if(price>price[-1],a+1,0))

SPL streamlines traditionally complex code for associative computations. SPL uses object reference to represent association, allowing developers to intuitively access the associated table through a dot. This avoids mess and complexity caused by the use of JOIN operator and is particularly fit for handling complex multilevel associations and self-associations. To find male employees whose managers are females according to the employee table, for instance:

=employees.select(gender:"male",dept.manager.gender:"female")

It is convenient to implement stepwise coding in SPL. With thorough set-orientation, SPL can use a variable to represent a set. This makes it fit for implementing stepwise programming. It is really easy to achieve computations that SQL needs nested queries to accomplish. To find n big customers whose order amounts to at least half of the total and sort the corresponding records by amount in descending order, for instance:


A

B

2

=sales.sort(amount:-1)

/Sort records by order amount in descending order (can be done in SQL)

3

=A2.cumulate(amount)

/Get a sequence of cumulative amounts

4

=A3.m(-1)/2

/Calculate the final cumulative amount, i.e., the total

5

=A3.pselect(~>=A4)

/Get position where the amount reaches half of the total

6

=A2(to(A5))

/Get records by position

SPL has procedural control syntax. It offers procedural control statements, and by working with SPL’s built-in structured data object, can implement various business logics conveniently.

Banching statements:


A

B

2


3

if T.AMOUNT>10000

=T.BONUS=T.AMOUNT*0.05

4

else if T.AMOUNT>=5000 && T.AMOUNT<10000

=T.BONUS=T.AMOUNT*0.03

5

else if T.AMOUNT>=2000 && T.AMOUNT<5000

=T.BONUS=T.AMOUNT*0.02

Loop statements:


A

B

1

=db=connect("db")


2

=T=db.query@x("select * from sales where SellerID=? order by OrderDate",9)

3

for T

=A3.BONUS=A3.BONUS+A3.AMOUNT*0.01

4


=A3.CLIENT=CONCAT(LEFT(A3.CLIENT,4), "co.,ltd.")

5


 

Similar to loop in Java, SPL also uses the break keyword to exit (terminate) the current loop body and the next keyword to skip (ignore) the current round of loop.

SPL boasts great computing performance. For in-memory computations, SPL, besides regular primary-key- and index-based algorithms, also offers many high-performance data structures and algorithms such as pre-join strategy, efficient and convenient parallel processing and pointer-based data reuse. This enables it to have noticeably higher performance than many SQL-based memory databases and use less memory space as well.

The optimized framework

SPL supports JDBC driver. It allows placing code outside of the Java program to reduce system coupling and embedding the code in Java to enable simple invocation. SPL supports interpreted execution and hot swap to make code easy to migrate, manage and maintain. It also supports mixed computations involving both in-memory data and disk data.

Storing computing code separately from the Java program can have loose coupling. SPL places code outside of the Java program and invoke it by the file name. This is both database-independent and Java-independent, naturally decoupling business logic from the front-end application.

With a short piece of computing code, we can combine lines of code into one statement and write it within the Java program:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = conn.createStatement();
String arg1="1000";
String arg2="2000"
ResultSet result = statement.executeQuery(=Orders.select(Amount>="+arg1+" && Amount<"+arg2+").groups(year(OrderDate):y,month(OrderDate):m; sum(Amount):s,count(1):c)");

Interpreted execution and hot swap. Business logics are various, complex and volatile. An ideal system framework should be able to handle changeable business logics. SPL is Java-based interpreted language that executes in real-time without recompilation. Any changes of the script take effect instantly with no downtime. The hot-swappable feature makes SPL particularly suitable for dealing with volatile business logics.

Ease of migration. SPL retrieves data from the database according to the source name. To migrate code from one data source to another, you just need to change the data source configurations in SPL’s configuration file rather than modifying the code. SPL supports dynamic data source, which can switch from one to another through a parameter or macro, to achieve ease of migration. To make the migration more convenient, SPL also provides non-database-specific, standard SQL syntax. You can use sqltranslate function to translate standard SQL into a mainstream SQL dialect and then use query function to execute.

Ease of administration and operation. As SPL supports outside-database computations, SPL code can be managed by a third-party tool, which facilitates teamwork. SPL scripts can be stored in different file directories, which is convenient and flexible and helps to achieve cost-effective management. SPL requires database privileges similar to Java, which won’t affect data security.

Support of computations between in-memory data and disk data. When there is too much data to be wholly loaded into the memory and, at the same time, an in-memory table is involved, we can perform the mixed computation between memory and disk in SPL. Suppose the primary table orders is already loaded to the memory and the large detail table orderdetail is stored in a text file, we are trying to perform a join between the primary table and the detail table:


A

1

=file("orderdetail.txt").cursor@t()

2

=orders.cursor()

3

=join(A1:detail,orderid ; A2:main,orderid)

4

=A3.groups(year(main.orderdate):y; sum(detail.amount):s)

SQLite is simple and convenient to use, but it has complicated data loading process and insufficient computing ability. SPL is also lightweight, supports various data sources, as well as has remarkable computational capability. It offers a rich number of computing functions, which enable to conveniently handle complex computations that SQL is not good at handling. SPL also supplies various framework optimization strategies. It allows placing code outside the Java program or embedding it into it, supports interpreted execution and hot swap for ease of migration, administration and maintenance, and can achieve mixed computations between in-memory data and disk data.