Open-source SPL: The Sharp Tool for Processing Retrieved Webservice/Restful Data

 

WebService/Restful is widely used for communication between applications represented by microservices, data exchange and public/proprietary data services. Its popularity stems from the use of hierarchical structured data, like XML and JSON, for data communication. XML and JSON allows data to be stored not only in a text format but in a hierarchical structure at the same time. This enables them to carry more general and richer information. The problem is that hierarchical data, compared with traditional two-dimensional data, is complicated and hard to process after it is retrieved.

In the days when there were no specialized JSON/XML processing technologies, JAVA developers usually turned to hardcoding or loaded JSON/XML data to the database and use SQL to further process it. Hardcoding involves very heavy workload, has almost zero computing ability and extremely slow development efficiency. Whereas SQL can provide ability to achieve the rest part of the processing, it triggers noticeable framework fault by creating undesired tight coupling and results in system delay. Moreover, databases are not good at handling hierarchical structured data though they excel at computing two-dimensional tables.

The situation was improved after specialized JSON/XML processing technologies, including XPath/JsonPath,Python Pandasand Scala, were invented. These technologies ensure better framework and provide a lot of computing abilities, but each has their problems. XPath and JSONPath support conditional queries and simple aggregation only and do not support common daily computations such as sorting, distinct, grouping & aggregation, joins and intersection operation. They also do not have their own hierarchical data objects and offer unsatisfactory computing abilities. Python Pandas supports general daily computations and offers dataFrame object to describe a two-dimensional table, but it is inconvenient for computing and processing hierarchical data and not integration-friendly with Java applications. Scala provides similar dataFrame object that can describe hierarchical structure but it is also not convenient for data computing and processing. Both Scala and Pandas give bad support for XML processing by requiring manual type conversion or the use of a third-party class library, which causes inefficient development.

Yet now a better alternative is available. It is SPL.

Specialized hierarchical data object

SPL is an open-source structured data/hierarchical data processing language under JVM. It offers built-in specialized hierarchical data object and hierarchy access methods that can conveniently express complex hierarchical relationships and give solid support for higher-level computing abilities.

SPL offers table sequence – the specialized hierarchical data objectthat can represent hierarchical XML/JSON data intuitively.

To read hierarchical XML string and parse it to a table sequence, for instance:


A

1

=file("d:\\xml\\emp_orders.xml").read()

2

=xml(A1,"xml/row")

Click cell A2 and you can see a hierarchical table sequence structure, where Eid field and State field store data of simple types and Orders field stores sets of records (two-dimensional tables). Click a row under Orders and we can expand the value to view details:

undefined

Atable sequence can be used to represent hierarchical JSON data loaded from a file (having same structure as the previous XML file):


A

1

=file("d:\\xml\\emp_orders.json").read()

2

=json(A1)

Click cell A2 and we can see that this table sequence has the same structure as the previous one. Actually, table sequences represent hierarchical data from any sources, including XML, JSON, files and WebService, in a consistent way. This is hard to be done with other technologies.

SPL offers convenient hierarchical data access methods. It accesses data on a certain level through the dot and in a certain position through the subscript.

Get a set of values of a single column on level 1: A2.(Client)

Get a set of values of multiple columns on level 1: A2.([Client,Name])

Get a set of all records on level 2: A2.conj(Orders)

Get the 10threcord on level 1: A2(10)

Get Orders field of the 10threcord on level 1(all the record’s child records): A2(10).Orders

Get a set of values of a single field under Orders field of the 10threcord on level 1: (A2(10).Orders).(Amount)

Get the 5threcord of Orders field in the 10threcord on level 1: (A2(10).Orders)(5)

Get records from 10thto 20thon level 1: A2(to(10,20))

Get the last three records on level 1: A2.m([-1,-2,-3])

SPL table sequences are so competent that they can represent complex hierarchical relationships. To represent hierarchical JSON data containing multiple subdocuments, for instance:

[
   {
      "race": {
          "raceId":"1.33.1141109.2",
          "meetingId":"1.33.1141109"
      },
      ...
      "numberOfRunners": 2,
      "runners": [
        {     "horseId":"1.00387464",
              "trainer": {
                  "trainerId":"1.00034060"
              },
          "ownerColours":"Maroon,pink,dark blue."
          },
          {   "horseId":"1.00373620",
              "trainer": {
                  "trainerId":"1.00010997"
              },
          "ownerColours":"Black,Maroon,green,pink."
          }
      ]
   },
...
]

Another instance is to perform grouping & aggregation on different levels of data (group data by trainerID and count members of ownerColours in each group). It is hard to handle the computation using the familiar way, but it becomes easy to handle in SPL:


A

1

2

=A1(1).runners

3

=A2.groups(trainer.trainerId; ownerColours.array().count():times)

Powerful computational capability

Based on table sequence, SPL provides a wealth of built-in computing functions, string functions and date functions, which are essentials for powerful computational capability. Depending on high-level syntax like function options and hierarchical parameters, SPL achieves greater computing ability than SQL.

Withrich built-in computing functions, SPL achieves a basic computation with one-liner code. To perform a conditional query on hierarchical JSON data, for instance:


A

1

…//Skip data retrieval and parsing

2

=A2.conj(Orders)

3

=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"*business*"))


A

B

1


2

= A3.sum(Salary)

Aggregation

3

=A2.groups(State,Gender;avg(Salary),count(1))

Grouping & aggregation on level 1

4

=A2.conj(Orders).groups(Client;sum(Amount))

Grouping & aggregation on level 2

5

=A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate)

Association

6

=A1.sort(Salary)

Sorting

7

=A1.id(State)

Distinct

8

=A2.top(-3;Amount)

Get topN

9

=A2.groups(Client;top(3,Amount))

Get TopN in each group (window function)

The code shows that SPL has a comprehensive support for conditional queries, covering comparison operators, logical operators, regular expressions and string functions such as like for fuzzy match. SPL also supports using mathematical operators (functions), position functions and date functions in conditional queries.

There are more examples:

SPL supplies a large number of built-in date functions and string functions, far more than each of other technologies, including SQL, provides and much more powerful than them. This enables SPL to implement same computations with significantly shorter code. For example:

Time functions:

Get the date before or after a specified date: elapse("2020-02-27",5) //Return 2020-03-03

Find the day of the week of a specified date: day@w("2020-02-27") //Return 5, which is Thursday

Get the date N workdays after: workday(date("2022-01-01"),25) //Return 2022-02-04

String functions:

Check whether a string all consists of letters: isdigit("12345") //Return true

Get a string before a specified substring: substr@l("abCDcdef","cd") //Return abCD

Split a string into an array of substrings by vertical bar: "aa|bb|cc".split("|") //Return ["aa","bb","cc"]

SPL also offers functions to get a date before or after a number of years, get the ordinal number of a date in the year, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.

SPL designs convenient function syntax like function options and hierarchical parameters that power SPL with excellent computational capability. It allows functions with similar functionalities to use same name and distinguishes them with different options. The basic functionality of select function is to filter data. If we need to get the first eligible record, we use @1 option:

Orders.select@1(Amount>1000)

When the size of data to be filtered is large, we use @m option to implement parallel processing to increase performance:

Orders.select@m(Amount>1000)

We use @b option to perform a quick filtering using the binary search if the original data is ordered:

Orders.select@b(Amount>1000)

Function options can work together:

Orders.select@1b(Amount>1000)

Usually, parameters in a structured computation function are complicated. SQL, for instance, uses a lot of keywords to divide parameters of a statement into multiple groups, causing inconsistent statement structure. SPL has hierarchical parameters. It employs semicolon, comma and colon to identify parameters to three levels, and writes complicated parameters in a general and simple way. For example:

join(Orders:o,SellerId ; Employees:e,EId)

The optimized framwork

SPL has built-in integration-friendly JDBC driver, which helps reduce system coupling effectively and achieve hot swap. SPL supports diverse hierarchical data sources and computes data coming from them with consistent and uniform way. This makes SPL code easy to migrate.

SPL offers general-purpose JDBC driverthrough which SPL code can be conveniently integrated by Java code. For instance, we can save the previous SPL code as a script file and invoke the file name in a Java program in the way of calling a stored procedure:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call groupBy()");

Placing the script file outside a Java program separates computing logic from application and thus effectively reduces system coupling. SPL, as an interpreted language, can achieve hot-swappable codeby executing in real-time after any changes without restarting the Java application. The ability ensures system stability and makes maintenance convenient.

SPL supports hierarchical files retrieved from WebService and RESTful, too. To read multilevel XML data from WebService and perform a conditional query, for instance:


A

1

=ws_client("http://127.0.0.1:6868/ws/RQWebService.asmx?wsdl")

2

=ws_call(A1,"RQWebService":"RQWebServiceSoap":"getEmp_orders")

3

=A2.conj(Orders)

4

=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"*business*"))

Similarly, to read multilevel JSON data from RESTful and perform the same conditional query:


A

1

=httpfile("http://127.0.0.1:6868/restful/emp_orders").read()

2

=json(A1)

3

=A2.conj(Orders)

4

=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"*business*"))

SPL can directly retrieve and compute hierarchical data stored in certain data sources like MongoDB, ElasticSearch and SalesForce.

To retrieve multilevel JSON data from MongoDB and perform a conditional query, for instance:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell@x(A1,"data.find()")

3

=A2.conj(Orders)

4

=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"*business*"))

SPL also supports databases, file source like txt, csv and xls, and NoSQL databases like Hadoop, Redis, Kafka and Cassandra. No matter what source data is retrieved from, SPL consistently reads it as a table sequence, enabling the uniform way of computing hierarchical data. The consistent and uniform syntax makes SPL code easy to migrate.

Each popular technology for processing Webservice/Restful has its shortcomings that lead to inefficient development. SPL offers built-in specialized hierarchical data object and convenient hierarchy access methods that make it excel at handling complicated hierarchical data. It also provides rich library functions that enable computational capability better than SQL’s. It has easy to use JDBC driver, adopts the strategy of placing computing code outside the application, and supports hierarchical data stored in various types of files and network services. All these features help increase development efficiency tremendously after WebService/Restful data is retrieved.