Is There Any Lightweight Big Data Computing Technology?

 

All popular big data computing technologies, including Hadoop, Storm, Hive and Spark, use large-cluster-based strategies, which are suited to large enterprises having massive-scale data. Actually, those technologies originate from some IT giants. Yet, small-scale clusters are enough, even no clusters are needed, for handling many real-world scenarios involving the so-called “big data” that is, in fact, far less than that the giant corporations have, and smaller companies do not have many hardware components and a large team of maintenance staff. For them, a lightweight big data computing technology will exactly fit their needs.

Among the few such technologies, esProc SPL is the flagship.

SPL is an open-source Java class library with remarkably big data computing abilities. It features concise code, lightweight framework, ease of integration, as well as high-performance storage strategies and efficient single-machine/multi-machine cluster computing that bring into full play the small-scale clusters’ hardware performance advantages.

 

Without complicated computing framework and being able to be stand-alone, SPL gets lightweight framework. Users can perform SPL computations by only importing the necessary jars without starting the SPL server when clusters are not needed. There isn’t a heavy central system for SPL cluster computing. You just need a number of nodes, which can be PCs, Linux, servers, workstations and notebooks with different configurations or OS, to start the SPL service and perform the simple SPL code for cluster computations on any node:

A

1

=["192.168.1.11:8281","192.168.1.12:8281","192.168.1.13:8281","192.168.1.14:8281"]

2

=file("Orders.ctx":[1,2,3,4],A1)

3

=A2.open().cursor@m(Client, Amount, OrderDate; OrderDate>=arg1 && OrderDate<arg2)

4

=A3.groups(year(OrderDate),Client;sum(Amount))

The above code is used to achieve cluster-based grouping and aggregation. The workload of task splitting and combination is much lighter compared with a node’s capacity, and can thus be executed on any node/IDE.

 

SPL offers light JDBC driver to be conveniently integrated in a Java program. Now store the above algorithm as a SPL script file and invoke it in Java code using the way of calling a stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call groupQuery(?, ?)}");
statement.setObject(1, "2021-01-01");
statement.setObject(2, "2021-12-31");
statement.execute();
...

SPL provides a series of high-performance storage strategies and algorithms for big data computing, getting much higher performance than most of the SQL-based big data computing platforms. It takes a single machine in SPL to accomplish computations that usually need HANA/Spark clusters.

One of SPL’s high-performance storage formats is composite table, which, compared with regular storage formats, is a specially designed high-density and high-performance data storage scheme. Composite tables support compression by default and thus are good at storing big data, particularly when there are repeated field values. Besides row-oriented storage, composite tables support column-oriented storage, which is suitable for computing only a few fields of a wide table by significantly increasing compression ratio and performance, as shown below:

A

1

=file("Orders.ctx")

2

=A1.open().cursor(Client,Amount, OrderDate; OrderDate>=arg1 && OrderDate<arg2)

3

=A2.groups(year(OrderDate),Client;sum(Amount))

 

SPL also allows performing parallel processing on composite tables by adding a simple @m option in cursor function. This lets the library make full use of the multi-core CPU’s performance strength:

A

1

=file("Orders.ctx")

2

=A1.open().cursor@m(Client,Amount, OrderDate; OrderDate>=arg1 && OrderDate<arg2)

3

=A2.groups(year(OrderDate),Client;sum(Amount))

 

Traversal is a time-consuming part during a big data computation. SPL offers cursor-based traversal reuse that enables to achieve multiple computing goals during one traversal:

A

1

=file("Orders.ctx").open()

2

=A1.open().cursor(Client, Amount, OrderDate)

3

=channel(A2).groups(year(OrderDate);max(Amount))

4

=A2.groups(Client;sum(Amount))

5

=A3.result()

 

Like many OLAP Servers, SPL composite tables permit pre-summarization on them to cache results of frequently used aggregate operations in advance, and output the desired cached result as needed during an actual computation or further compute the result, which helps enhance performance. The code below, for instance, uses the pre-summarized data to achieve high-performance computing:

A

1

=file("fact.ctx").open()

2

=A1.open().cgroups(dim1,dim2;sum(fact1),sum(fact2))

 

There are association scenarios where one is a small dimension table and the other is a large fact table. The SPL performance enhancement approach for them is to load the whole dimension table to each node’s memory, store the fact table on multiple nodes in the form of a cluster composite table and then join the in-memory dimension table with the external fact table, as shown below:

A

1

=["192.168.1.11:8281","192.168.1.12:8281","192.168.1.13:8281","192.168.1.14:8281"]

2

=file("Orders.ctx":[1,2,3,4],A1)

3

=A2.open().cursor@m(SellerId,Amount)

4

=file("Employees.ctx",A2).open().memory()

5

=A2.join(SellerId,A4,Name,Dept)

6

=A5.groups(dept;sum(Amount))

 

Other times the association is between a large primary table and a large sub table. The SPL performance boost method for this is to store both tables on multiple nodes in the form of cluster composite tables ordered by the join fields, and use order-based merge to perform the join during a later computation:

A

B

1

=["192.168.1.11:8281","192.168.1.12:8281",

"192.168.1.13:8281","192.168.1.14:8281"]

2

=file("orders.ctx":[1,2,3,4],A1)

=file("orderdetail.ctx",A2)

3

=A2.open().cursor@m()

=B2.open().cursor(;;A3)

4

=joinx(A3:m,ID;B3:c,ID)

5

=A4.groups(m.Client;sum(c.Amount))

 

SPL also supports cluster computing on a large dimension table by allowing user-defined task size, specified number of parallel threads and custom efficient execution path, and by supplying external storage and memory fault-tolerance plans. What’s more, SPL supports various data sources, including RDBs, files, NoSQL databases, big data sources as well as mixed computations between them, getting rid of complicated and time-consuming data retrieval and loading and format conversion.