Looking for the Best Class Library for Computing JSON Data
It is more difficult to compute JSON data than to compute two-dimensional data. A class library, in this case, is necessary and useful to handle JSON formatted data. In this essay, we will compare four types of the class library – JsonPath, SQLite, Scala and esProc SPL, for computing JSON data in terms of syntactic expressiveness, deployment configurations, and data source support. Looking Looking for the Best Class Library for Computing JSON Data for details.
JSON is a lightweight, flexible format that uses a multilevel structure to represent a relationship between data. The multilevel structure is more complicated than the two-dimensional structure and more difficult to compute. JSON class libraries are thus created to facilitate the computation. This essay will examine several common JSON class libraries in aspects of syntactic expressiveness, deployment configurations, and data source support. Some libraries, such as Gson, Fastjson and Jackson, focus more on parsing and maintenance and are not good at data computation. They are not the targets of our comparison.
JsonPath
JSONPath aims to be the “XPath” on JSON data. Though there is some way for it to go, the query language has already played a role in a lot of real-world projects, often along with the JSON class library for maintenance.
The following example shows us the class library’s syntactic expressiveness. EO.json stores information of employees and their orders. Below is part of the source data:
[{
"_id": {"$oid": "6074f6c7e85e8d46400dc4a7"},
"EId": 7,"State": "Illinois","Dept": "Sales","Name": "Alexis","Gender": "F","Salary": 9000,"Birthday": "1972-08-16",
"Orders": [
{"OrderID": 70,"Client": "DSG","SellerId": 7,"Amount": 288,"OrderDate": "2009-09-30"},
{"OrderID": 131,"Client": "FOL","SellerId": 7,"Amount": 103.2,"OrderDate": "2009-12-10"}
]
}
{
"_id": {"$oid": "6074f6c7e85e8d46400dc4a8"},
"EId": 8,"State": "California", ...
}]
The task is to find all orders where the amount is between 500 and 2000 and where the client name contains “bro” from the JSON file. Below is Java code of doing this:
package org.example;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.JsonPath;
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
public class App1
{
public static void main(String\[\] args )throws Exception
{
String str=file2str("D:\\\json\\\EO.json");
Object document = Configuration.defaultConfiguration().jsonProvider().parse(str);
ArrayList l=JsonPath.read(document, "$\[*\].Orders\[?(@.Amount>500 && @.Amount<2000 && @.Client =~ /.*?bro.*?/i)\]");
System.out.println(l);
}
public static String file2str(String fileName)throws Exception{
File file = new File(fileName);
Long fileLength = file.length();
byte\[\] fileContent = new byte\[fileLength.intValue()\];
FileInputStream in = new FileInputStream(file);
in.read(fileContent);
in.close();
return new String(fileContent, "UTF-8");
}
}
In the above code, @.Amount>500 && @.Amount<2000 is an interval query condition. @.Client =~ /.*?bro.*?/i is a fuzzy query condition. Short is one of the merits of JsonPath queries. The demerit is that they are not mature enough. Unlike the SQL counterpart like, the JsonPath fuzzy query uses a regular expression rather than a convenient function. In fact, the language supports the simplest operations only, including conditional queries and aggregate operations. It does not support most of the commonly used operations, including grouping & aggregation, joins, and set-oriented operations.
The weak data source support is another indication that JsonPath is far from a fully developed query language. It hardcodes even the access of the basic file source, let alone other data sources.
The convenient deployment configurations are the only advantage of JasonPath. Users just need to add json-path to Maven.
There are class libraries of same low-level principles, only with slightly different features. FastJson, for instance, adds like function to the existing features of JsonPath. It becomes easier to use but more unstable. In a word, all of them are partially developed.
SQLite
SQLite is a lightweight, easy to integrate, embedded in-memory database. It is small but has many abilities, including handling JSON data.
The previous conditional query, for example, can be handled using the following Java code:
package test;
import java.io.File;
import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class Main {
public static void main(String\[\] args)throws Exception {
Connection connection = DriverManager.getConnection("jdbc:sqlite/ex1");
Statement statement = connection.createStatement();
statement.execute("create table datatable ( path string , data json1)");
String sql="insert into datatable values('1', json('"+file2str("D:\\\json\\\EO.json") +"'))";
statement.execute(sql);
sql="select value from(" +
"select value" +
"from datatable, json_tree(datatable.data,'$')" +
"where type ='object'and parent!=0" +
")where json\_extract( value,'$.Amount') >500 and json\_extract(value,'$.Amount') <2000 and json_extract(value,'$.Client') like'%bro%'";
ResultSet results = statement.executeQuery(sql);
printResult(results);
if(connection != null) connection.close();
}
public static void printResult(ResultSet rs) throws Exception{
int colCount=rs.getMetaData().getColumnCount();
System.out.println();
for(int i=1;i<colCount+1;i++){
System.out.print(rs.getMetaData().getColumnName(i)+"\\t");
}
System.out.println();
while(rs.next()){
for (int i=1;i<colCount+1;i++){
System.out.print(rs.getString(i)+"\\t");
}
System.out.println();
}
}
…
}
Java creates datatable table in SQLite, reads in data from the JSON file and inserts it into the table as a record, and performs the conditional query using a SQL statement. SQL json_tree function parses the multilevel JSON file into data of two-dimensional structure (which is like a table) and json_extract function retrieves targeted fields from the two-dimensional JSON data.
SQLite can implement grouping & aggregation, too:
select strftime('%Y',Orderdate),sum(Amount) from(
select json\_extract( value,'$.OrderDate')OrderDate,json\_extract(value,'$.Amount')Amount
from datatable, json_tree( datatable.data, '$')
where type = 'object' and parent!=0
)group by strftime('%Y',Orderdate)
It can perform a join between employees and orders using the following SQL statements:
with base as (
select value,id,parent,type
from datatable, json_tree(datatable.data, '$')
),emp_orders as(
select orders.value o,emp.value e from base ordersArr,base orders,base emp
where ordersArr.parent=emp.id and orders.parent=ordersArr.id and emp.parent=0 and emp.type='object'
)select json\_extract( o,'$.OrderID'),json\_extract( o,'$.Client'),json\_extract(o,'$.Amount'),json\_extract(o,'$.OrderDate'), json\_extract( e,'$.Name'), json\_extract(e,'$.Gender'),json_extract(e,'$.Dept')
from emp_orders
As the above code shows, SQLite has strong syntactic expressiveness for handling common computations. But the SQLite code is too long and difficult to understand and use. Unlike familiar SQL statement, the unique “select…from table name, function where…” statement is roundabout. It also has long and complicated code for a join query.
The direct cause of lengthy and difficult code is that JSON data is multilevel. The root cause is that SQL is good at computing two-dimensional data but that it cannot compute JSON data directly. To do the computation, the language needs to first convert the multilevel JSON data into a two-dimensional structure using the json_tree function (or json_each function). You cannot expect the code to be easy and concise when you try to compute multilevel data using the language and method for handling two-dimensional data.
SQLite presents rather weak data source support. It hardcodes even the access of the basic file source and can compute data only after it is retrieved as table and imported into the database.
It is simple to deploy SQLite. Users just need to introduce a jar to get it done.
Scala
Scala is a fair popular structured data computation language, as well as one of the earliest languages that support JSON data computation. The language first reads in JSON data from the source and stores it as DataFrame data object (or RDD), then computes the data using DataFrame’s standard computing ability.
We can handle the previous conditional query using the following Scala code:
package test
import scala.io.Source
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{asc, desc}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
object JTest {
def main(args: Array\[String\]): Unit = {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val df=spark.read.json("D:\\\data\\\EO.json")
val Orders = df.select(explode(df("Orders"))).select("col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")
val condition=Orders.where("Amount>500 and Amount<2000 and Client like'%bro%' ")
condition.show()
}
}
Scala reads in JSON data as multilevel DataFrame object, uses explode function to retrieve all orders, and performs the conditional query through where function.
Scala achieves grouping & aggregation in the following way:
val groupBy=Orders.groupBy(year(Orders("OrderDate"))).agg(sum("Amount"))
It implements the join between employees and orders in the following way:
val df1=df.select(df("Name"),df("Gender"),df("Dept"),explode(df("Orders")))
val relation=df1.select("Name","Gender","Dept","col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")
Scala has great syntactic expressiveness for handling common computing scenarios and, compared with SQLite, generates short and easy code. The language does not use a join function, though it has one, to perform the join operation, it retrieves target values directly the multilevel data instead. This makes simple logic and much shorter code.
Scala code is short and easy to understand because DataFrame supports multilevel data handling and thus can represent JSON structure conveniently. Its DataFrame-based functions are more convenient for computing multilevel data.
Scala offers excellent data source support. It has a special function to retrieve JSON data from many data sources, including files, MongoDB, Elasticsearch, WebService, etc.
Scala’s basic class libraries support JSON data computation, so there is no need to make deployment specifically (unless you need to retrieve data from other certain data sources, such as MongoDB).
esProc SPL
esProc SPL is the professional open-source structured data computation language. It computes data coming from any data source using uniform syntax and data structure as Scala does. Yet it is lighter and has simpler syntax, as well as offers loosely coupled JDBC interface.
SPL handles the previous conditional query in the following way:
A |
|
1 |
=json(file("D:\\data\\EO.json").read()) |
2 |
=A1.conj(Orders) |
3 |
=A2.select(Amount>500 && Amount<=2000 && like@c(Client,"*bro*")) |
SPL reads in JSON data as a multilevel table sequence object (which is similar to Scala DataFrame), concatenates all orders via conj function, and performs the conditional query through select function.
This block of code can be debugged or executed in esProc IDE, or stored as a script file (like condition.dfx) for invocation from a Java program through the JDBC interface. Below is the code for invocation:
package Test;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class test1 {
public static void main(String\[\] args)throws Exception {
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call condition()");
printResult(result);
if(connection != null) connection.close();
}
…
}
This is similar to calling a stored procedure. SPL also supports the SQL-like way of embedding the code directly into a Java program without the need of storing it as a script file. Below is the code for embedding:
…
ResultSet result = statement.executeQuery("=json(file(\\"D:\\\data\\\EO.json\\").read()).conj(Orders).select(Amount>500 && Amount<=3000 && like@c(Client,\\"\*bro\*\\"))");
…
SPL achieves grouping & aggregation operations and join operations in the following way:
A |
B |
|
1 |
=json(file("D:\\data\\EO.json").read()) |
|
2 |
=A1.conj(Orders) |
|
3 |
=A2.select(Amount>1000 && Amount<=3000 && like@c(Client,"*s*")) |
/Conditional query |
4 |
=A2.groups(year(OrderDate);sum(Amount)) |
/Grouping & aggregation |
5 |
=A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) |
/Join operation |
As the above code shows, esProc SPL has the most powerful syntactic expressiveness that enables handling common operations, generates concise and easy-to-understand code, and facilitates easier integration. The programming language gives intuitive support for operators to be able to retrieve values directly from multilevel data during a join, which further compresses the code.
The outstanding syntactic expressiveness simplifies computations of multilevel JSON data. Let’s look at an example. JSONstr.json’s runners field is the subdocument, which consists of three fields – horseId, ownerColours and trainer. The trainer filed has a subfield trainerId and ownerColors contains comma-separated arrays. Below is part of the source data:
\[
{
"race": {
"raceId":"1.33.1141109.2",
"meetingId":"1.33.1141109"
},
...
"numberOfRunners": 2,
"runners": \[
{ "horseId":"1.00387464",
"trainer": {
"trainerId":"1.00034060"
},
"ownerColours":"Maroon,pink,dark blue."
},
{ "horseId":"1.00373620",
"trainer": {
"trainerId":"1.00010997"
},
"ownerColours":"Black,Maroon,green,pink."
}
\]
},
...
\]
The task is to group data by trainerId and count members of ownerColours in each group. Below is the SPL script for doing this:
A |
|
1 |
=json(file("/workspace/JSONstr.json").read()) |
2 |
=A1(1).runners |
3 |
=A2.groups(trainer.trainerId; ownerColours.array().count():times) |
SPL provides great data source support. It has the special function to retrieve JSON data from a variety of data sources, including files, MongoDB, Elasticsearch, WebService, etc.
JSON data read and write is one of SPL’s basic features, so users do not need to make specific deployment (unless they need to retrieve data from certain data sources, such as MongoDB).
In a nutshell, esProc SPL has the most powerful syntactic expressiveness that can simplify multilevel JSON data computations; Scala has good syntactic expressiveness that can handle common operations; SQLite has enough expressive ability but the code is hard to write and read; JsonPath has a too weak expressiveness ability to handle the common computations. In the aspect of data source support, both esProc SPL and Scala provide a wealth of choices; JsonPath is weak at this, and SQLite is weaker. About deployment, SQLite is the simplest, and the other three are easy too.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL