Is There Any Alternative Technology to Spark?
HANA is a popular in-memory database and, theoretically, has the potential to replace Spark, but that it is not open-source keeps people away. SQLite, another in-memory database, is open-source, but it only supports embedded call, which puts great limits on data size and computing performance. Redis is open-source, high-performance, and supports processing of huge volumes of data, but it is extremely bad at doing computations and turns to a great deal of hardcoding for performing in-memory computations.
The best Spark alternative is esProc SPL.
SPL is an open-source, in-memory big data computing library that uses fewer memory resources, offers higher performance and more powerful computational capabilities, and achieves mixed computations involving data in both external storage and the memory in a convenient and stable way.
SPL provides transparent and intuitive syntax for in-memory big data computing. After starting all cluster nodes, we load data from the database to the memory using the following SPL script:
A |
||
1 |
=["192.168.1.11:8281","192.168.1.12:8281"] |
|
2 |
fork [[0,2000000],[2000000,4000000]];A1 |
=connect("orcl").query@x("select OrderID,Amount,Client,OrderDate from Orders where OrderID>=? And OrderID<?", A2(1),A2(2)).keys(OrderID) |
3 |
=env(Order,B2) |
|
4 |
return |
And perform in-memory computations through simple and intuitive syntax:
A |
|
1 |
=["192.168.1.11:8281","192.168.1.12:8281"] |
2 |
=Orders=memory(A1,Order) |
3 |
=Orders.select(Amount>=arg1 && Amount<arg2) |
4 |
=A3.groups(year(OrderDate):y,month(OrderDate):m; sum(Amount):s,count(1):c) |
Besides cluster computing, SPL also supports single-machine services and embedded call.
SPL offers JDBC driver to be integrated in a Java program conveniently. The size of data is generally large and data loading is usually performed once and for all at the start of the application. In-memory computations, however, run repeatedly because of high concurrency. To invoke name of the script file for performing the in-memory computation in a Java program, for instance:
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://onlyServer=true");
CallableStatement statement = conn.prepareCall("{call MemoryQuery(?, ?)}");
statement.setObject(1, 1000);
statement.setObject(2, 2000);
statement.execute();
...
SPL supplies a wealth of library functions to implement in-memory computations conveniently. Here are more examples:
A |
B |
|
1 |
=Orders.find(arg_OrderIDList) |
//Multi-key-value-based search |
=Orders.select(Amount>1000 && like(Client,\"*S*\")) |
// Fuzzy query |
|
2 |
=Orders.sort(Client,-Amount)" |
//Sort |
3 |
=Orders.id(Client) |
//Distinct |
4 |
=join(Orders:O,SellerId; Employees:E,EId) |
//Join |
SPL has outstanding computational capabilities, which simplify computations with complex logic, including stepwise computations, order-based computations and post-grouping computations. SPL can effortlessly deal with computations that are hard to handle in SQL. Here is an example. We are trying to find the first n big customers whose sales amount occupies at least half of the total and sort them by amount in descending order.
A |
B |
|
=Orders.groups(Client;sum(Amount):subtotal) |
/Group and summarize by customer |
|
2 |
=A1.sort(subtotal:-1) |
/Sort by amount in descending order |
3 |
=A2.cumulate(subtotal) |
/Get the sequence of cumulative totals |
4 |
=A3.m(-1)/2 |
/The last cumulative total is the final total |
5 |
=A3.pselect(~>=A4) |
/Get position of record where the cumulative total reaches at least half of the total |
6 |
=A2(to(A5)) |
/Get values by position |
In addition to regular primary key and index-based methods, SPL offers a lot of high-performance data structure and algorithms, getting much better performance than SQL-based in-memory databases and occupying less memory space. It often takes a single machine in SPL to accomplish certain computations that need HANA/Spark clusters.
SPL supports pointer-style data reuse to remarkably reduce memory usage. SPL data engages in computations in the form of pointers, enabling one table to be repeatedly used in stepwise algorithms, different algorithms and concurrency algorithms. SQL, however, needs to copy desired records for each time of computation.
SPL supports using the pre-join technique to increase computing performance. SQL achieves the technique by copying records, and that consumes too much memory. SPL implements it through the pointer-style data reuse that uses little memory during the loading phase. It uses the point "." to reference a related field value during an in-memory computation without the need of time-consuming association operation, as shown below:
=callRecord.sum(OUTID.BRANCHID.OUTCOST+INID.BRANCHID.INCOST+OUTID.AGENTID.OUTCOST+INID.AGENTID.INCOST)
Apart from regular inter-dimension-table pre-joins, SPL is also able to achieve pre-joins between a primary table and the sub table that have higher performance by making use of characteristics of ordered data, and uses the point sign "." directly to reference a related field during in-memory computations, as shown below:
=relation.groups(year(main.orderdate):fieldyear, month(main.orderdate):fieldmonth; sum(sub.price*sub.amunt):subtotal,count(1):quantity)
SPL designs the composite table, a high-performance storage pattern, which supports compression and columnar storage format, is naturally ordered and has built-in primary keys and index. Compared with regular storage formats, the storage scheme is information-intensive and high-performance and thus can substantially speed up data loading. To start the SPL service on a single machine and load data from a composite table to the memory, for instance:
A |
|
1 |
=file("order.ctx").open().cursor@m(OrderID,Amount,Client,OrderDate) |
2 |
=A1.memory(OrderID) |
3 |
>env(Orders,A2) |
SPL supports memory compression that allows loading more data into the memory and decompresses it at computation. There is the uniform code for data loading:
A |
|
1 |
=file("order.ctx").open().cursor@m(OrderID,Amount,Client,OrderDate) |
2 |
=A1.memory@z(OrderID) |
3 |
>env(Orders,A2) |
SPL supports parallel processing to make full use of multi-core CPU to boost performance of in-memory computations. To enable the parallel processing, just add @m option in a function:
A |
|
1 |
=Orders.select@m(Amount>=arg1 && Amount<arg2) |
2 |
=A1.groups(year(OrderDate):y,month(OrderDate):m; sum(Amount):s,count(1):c) |
SPL supports various external data sources, including RDBs, files like txt\csv\xls, NoSQL databases such as MongoDB, Hadoop, Redis, ElasticSearch, Kafka and Cassandra, and multilevel data such as WebService, XML, Restful and Json, avoiding troubles of data loading and format conversion. To load an HDSF file to the memory, for instance:
A |
|
1 |
=hdfs_open(;"hdfs://192.168.0.8:9000") |
2 |
=hdfs_file(A1,"/user/Orders.csv":"GBK") |
3 |
=A2.cursor@t() |
4 |
=hdfs_close(A1) |
5 |
=A3.memory(OrderID).index() |
6 |
>env(orders,A5) |
It is convenient to achieve mixed computations involving data stored both on storage device and in memory in SPL, enabling the use of data whose size can be much greater than the memory capacity. The SPL mixed computations are simple and stable since the language uses the uniform data type. Suppose we have the primary table orders that is already loaded to the memory and a large detail table orderdetail that is a composite table (or which can be stored in any other source), and we are trying to associate the two tables, we have the following SPL code:
A |
|
1 |
=file("orderdetail.ctx").cursor(orderid,product,amount) |
2 |
=orders.cursor() |
3 |
=join(A1:detail,orderid ; A2:main,orderid) |
4 |
=A3.groups(year(main.orderdate):y;sum(detail.amount)) |
It is easy to achieve hot and cold data routing or temperature stratification in SPL, which helps to accomplish high-performance big data computing based on the current hardware environment only by bearing occasional performance decrease. SPL uses a uniform data type to represent both external data and memory data and data in any data sources. So, it can preload the new, hot data to the memory and store the historical cold data in a file or database, and then compute hot data, cold data or mixed data in a specified time period. For example:
A |
|
1 |
=connect@l("orcl").query@x("select * from orders where orderdate>=? and orderdate<? Order by orderdate",min(pBeginDate,hotcoldLine),min(pEndDate,hotcoldLine)) |
2 |
=hotData.select@b(orderdate>=max(pBeginDate,hotcoldLine) && orderdate<max(pEndDate,hotcoldLine)) |
3 |
=A1|A2 |
4 |
=A3.groups@o(year(orderdate):y;sum(amount):s) |
SPL also boasts a series of techniques to achieve high-performance big data computing, including double increment segmentation, bi-directional indexes, preaggregation, auto load balancing and “spare tire”-style fault tolerance plan.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version