Looking for the Best Class Library for Computing XML Data

 

It is inconvenient to compute XML data, so a class library is necessary and useful to handle it. In this essay, we will compare four types of XML class libraries – dom4j, MySQL, Scala and esProc SPL, in the aspects of syntactic expressiveness, deployment configurations, and data source support. Looking Looking for the Best Class Library for Computing XML Data for details.

 

The advantage of XML format is its flexibility in representing data, and the disadvantage is the inconvenience of being computed. We need a class library to facilitate its computation. Now let’s examine several common XML class libraries to compare their syntactic expressiveness, deployment configurations, and data source support.

dom4j

XML is a veteran data format. All languages provide class libraries for computing XML data. Java alone has a dozen of XML class libraries, including dom4j, JDOM, Woodstox, XOM,  Xerces-J , and Crimson, among which dom4j is the most developed. Below is an example that shows the class library’s syntactic expressiveness.

The file Employees_Orders.xml stores information of employees and their orders. Below is part of the source data:

<?xml version="1.0"   encoding="UTF-8"?>

<xml>

<row>

         <EId>2</EId>

         <State>"New   York"</State>

         <Dept>"Finance"</Dept>

         <Name>"Ashley"</Name>

         <Gender>"F"</Gender>

         <Salary>11000</Salary>

         <Birthday>"1980-07-19"</Birthday>

<Orders>[]</Orders>

</row>

<row>

         <EId>3</EId>

         <State>"New   Mexico"</State>

         <Dept>"Sales"</Dept>

         <Name>"Rachel"</Name>

         <Gender>"F"</Gender>

         <Salary>9000</Salary>

         <Birthday>"1970-12-17"</Birthday>

         <Orders>

                  <OrderID>32</OrderID>

                  <Client>"JFS"</Client>

                  <SellerId>3</SellerId>

                  <Amount>468.0</Amount>

                  <OrderDate>"2009-08-13"</OrderDate>

         </Orders>

         <Orders>

                  <OrderID>39</OrderID>

                  <Client>"NR"</Client>

                  <SellerId>3</SellerId>

                  <Amount>3016.0</Amount>

                  <OrderDate>"2010-08-21"</OrderDate>

                  </Orders>

         <Orders>

</row>

…

<xml>

Use dom4j to find orders whose amounts are between 1000 and 3000 and whose clients contain “bro”. Below is Java code for doing this:

package   org.example;

import   org.dom4j.Document;

import   org.dom4j.Node;

import   org.dom4j.io.SAXReader;

import   java.util.List;

public class   App

{

      public   static void main (String\[\] args )throws Exception

      {

           SAXReader   saxReader = SAXReader.createDefault();

           Document   doc =   saxReader.read("file:\\\D:\\\xml\\\Employees_Orders.xml");

           List<Node>   list=doc.selectNodes("/xml/row/Orders\[Amount>1000   and Amount<=3000   and contains(Client,'bro')\]")

           int   i=0;

            System.out.println("--------------count of the current     resultSet="+list.size());

            for(Node n:list){

                String OrderID=n.selectSingleNode("./OrderID").getText();

                String Client=n.selectSingleNode("./Client").getText();

                String SellerId=n.selectSingleNode("./SellerId").getText();

                String Amount=n.selectSingleNode("./Amount").getText();

                String OrderDate=n.selectSingleNode("./OrderDate").getText();

                System.out.println(++i+":"+OrderID+"\\t"+Client+"\\t"+SellerId+"\\t"+Amount+"\\t"+OrderDate);

          }

      }

}

In the above code, /xml/row/Orders defines the query range; Amount>1000 and Amount<=3000 and contains(Client,'bro') defines the query condition (which is equivalent to the predicate in a sentence). The query syntax is part of the XPath (XQuery is a superset of it) and has a history of over two decades. XPath is concise, easy to understand and learn and boasts a wealth of functions that cater to a variety of conditional query needs. The mathematical functions abs and floor, string functions compare and substring, and date functions year-from-date and timezone-from-time are among the most commonly seen functions.

dom4j(XPath) has sufficiently strong syntactic expressiveness in handling conditional queries but it does not support handling sorting, distinct operation, grouping operation, aggregate operation, set-oriented operations and join operations. A whole data computation process always involves more than one type of operation. The class library is not an all-round player because it shows excellent performance on conditional queries only.

The class library gives a not so satisfactory support for data sources. It supports retrieving data from files only, yet the most seen XML data sources are WebService and HTTP.

It is convenient to deploy the dom4j (XPath). Users just need to add dom4j and jaxen to Maven.

MySQL

Most long-lasting relational databases, including DB2, Oracle, MSSQL and MySQL, support XML data computation. Among them, MySQL is the most used in real-world business situations.

We can handle the previous conditional query using the following SQL+JAVA code:

package   org.example;

import   java.io.File;

import   java.io.FileInputStream;

import   java.sql.*;

public class   App

{

      public   static void main(String\[\] args) throws Exception, ClassNotFoundException {

            Class.forName("com.mysql.cj.jdbc.Driver");

            Connection conn = DriverManager.getConnection("jdbc:mysql://127.0.0.1:3307/test?&useSSL=false&serverTimezone=UTC", "root", "runqian");

            Statement statement = conn.createStatement();

            statement.execute("drop table if exists testtable");

            statement.execute("CREATE TABLE testtable (testxml MEDIUMTEXT)     ENGINE=InnoDB DEFAULT CHARSET=UTF8");

            statement.execute("insert into testtable     values('"+readFile("D:\\\xml\\\Employees_Orders.xml")     +"')");

            String   conditionSQL="" +

                    "with recursive old as (" +

                    "select extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and     Amount<=3000 and contains(Client,\\"bro\\")\]/OrderID')   oneLine1,  " +

                    "    extractvalue(testxml,'/xml/row/Orders\[Amount>1000   and   Amount<=3000 and contains(Client,\\"bro\\")\]/Client')   oneLine2,  " +

                    "      extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and     Amount<=3000 and contains(Client,\\"bro\\")\]/SellerId')     oneLine3, " +

                    "      extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and Amount<=3000     and contains(Client,\\"bro \\")\]/Amount')   oneLine4, " +

                    "      extractvalue(testxml,'/xml/row/Orders\[Amount>1000 and     Amount<=3000 and contains(Client,\\"bro\\")\]/OrderDate') oneLine5    " +

                    "  from testtable" +

                    ")," +

                    "N as ( " +

                    "  select 1 as n " +

                    "  union select n + 1 from   N, old" +

                    "  where n <=   length(oneLine1) -   length(replace(oneLine1,' ',''))" +

                    ")" +

                    "select substring\_index(substring\_index(oneLine1,' ', n),' ',     -1) OrderID," +

                    "    substring\_index(substring\_index(oneLine2,' ', n),'   ', -1) Client,  " +

                    "    substring\_index(substring\_index(oneLine3,' ', n),'   ', -1) SellerId,  " +

                    "    substring\_index(substring\_index(oneLine4,' ', n),'   ', -1) Amount,  " +

                    "    substring\_index(substring\_index(oneLine5,' ', n),'   ', -1) OrderDate  " +

                    "from N, old";

            ResultSet results = statement.executeQuery(conditionSQL);

            printResult(results);

            if   (conn != null)

                conn.close();

      }

      public   static void printResult(ResultSet rs) throws Exception{

            int   colCount=rs.getMetaData().getColumnCount();

            System.out.println();

            for(int i=1;i<colCount+1;i++){

                System.out.print(rs.getMetaData().getColumnName(i)+"\\t");

          }

            System.out.println();

            while(rs.next()){

                for (int i=1;i<colCount+1;i++){

                  System.out.print(rs.getString(i)+"\\t");

              }

                System.out.println();

          }

      }

      public   static String readFile(String fileName)throws Exception{

          File   file = new File(fileName);

          Long   fileLength = file.length();

          byte\[\]   fileContent = new byte\[fileLength.intValue()\];

            FileInputStream in = new FileInputStream(file);

            in.read(fileContent);

            in.close();

          return   new String(fileContent, "UTF-8");

      }

}

The above code is written in its logic. It creates testtable table in MySQL, reads in XML strings from Employees_Orders.xml, inserts each string to testtable as a record, and query the table using SQL. Below is part of the result set:

OrderID     Client       SellerId      Amount   OrderDate      

49      "SPLI"         5       1050.6       "2010-09-03" 

122    "SPL"          8       2527.2       "2009-12-02" 

140    "OFS"         8       1058.4       "2010-12-18" 

The SQL query part in the above Java code is the most difficult. extractvalue function is used to parse XML data. The function supports XPath query syntax and can join query results (such as all order dates) into a large space-separated string. To split the large string into smaller strings (each record corresponds to one order date, for instance), we need the complicated with statement.

The code does not split the XML data when implementing the conditional query. We can also implement the query by splitting the data. We split the XML file into the employee part and the order part, then split each part into records and import them into the database, and finally, perform the conditional query on the orders table. The SQL query can thus be significantly simplified but, at the same time, XML’s flexibility in representing data becomes insignificant.

Only SQL statements are used in the above code to achieve the conditional query. Actually, we can bring Java in to do the computation. The specific way is to parse XML using SQL and convert each record into N records with Java. This way, the SQL query also becomes simplified, but the hardest part is still there, only being transferred to Java. The high-level language, however, is not good at conditional query handling. It needs a second database write before doing the query, which adds extra processing workload.

Though it generates complicated code, MySQL has enough syntactic expressiveness to deal with most of the common computations. To group orders by year and sum order amounts in each group, for instance, MySQL has the following code:

with recursive   old as (

           select  extractvalue(testxml,'/xml/row/Orders/OrderID') oneLine1,

                      extractvalue(testxml,'/xml/row/Orders/Client')   oneLine2,

                      extractvalue(testxml,'/xml/row/Orders/SellerId')   oneLine3,

                      extractvalue(testxml,'/xml/row/Orders/Amount')   oneLine4,

                      extractvalue(testxml,'/xml/row/Orders/OrderDate')   oneLine5

  from     testtable

),

N as (

           select   1 as n

           union   select n + 1 from N, old

  where n     <= length(oneLine1) - length(replace(oneLine1, '',''))

),

query as(

           select  substring\_index(substring\_index(oneLine1,   '', n),' ', -1) OrderID,

                      substring\_index(substring\_index(oneLine2,   '', n),' ', -1) Client,

                      substring\_index(substring\_index(oneLine3,   '', n),' ', -1) SellerId,

                      substring\_index(substring\_index(oneLine4, '', n),' ', -1) Amount,

                     STR\_TO\_DATE(substring\_index(substring\_index(oneLine5, '', n),' ',     -1),'"%Y-%m-%d"') OrderDate

           from   N, old)

select   year(OrderDate),sum(Amount) from query group   by year(OrderDate)

Here’s another instance. To join the employee table and the orders table and retrieve certain fields, MySQL produces more complicated code (as shown below) that involves recursive queries, which reduces efficiency.

with recursive   oldOrders as (

           select  extractvalue(testxml,'/xml/row/Orders/OrderID') oneLine1,

                      extractvalue(testxml,'/xml/row/Orders/Client')   oneLine2,

                      extractvalue(testxml,'/xml/row/Orders/SellerId')   oneLine3,

                      extractvalue(testxml,'/xml/row/Orders/Amount')   oneLine4,

                      extractvalue(testxml,'/xml/row/Orders/OrderDate')   oneLine5

  from     testtable

),

N as (

           select   1 as n

           union   select n + 1 from N, oldOrders

  where n     <= length(oneLine1) - length(replace(oneLine1, '',''))

),

Orders as(

           select  substring\_index(substring\_index(oneLine1,   '', n),' ', -1) OrderID,

                      substring\_index(substring\_index(oneLine2,   '', n),' ', -1) Client,

                      substring\_index(substring\_index(oneLine3,   '', n),' ', -1) SellerId,

                      substring\_index(substring\_index(oneLine4,     '', n),' ', -1) Amount,

                     STR\_TO\_DATE(substring\_index(substring\_index(oneLine5, '', n),' ',     -1),'"%Y-%m-%d"') OrderDate

           from   N, oldOrders),

oldEmp as (

           select  extractvalue(testxml,'/xml/row/EId') oneLine1,

                      extractvalue(testxml,'/xml/row/Dept')   oneLine2,

                      extractvalue(testxml,'/xml/row/Name')   oneLine3,

                      extractvalue(testxml,'/xml/row/Gender')   oneLine4

  from     testtable),

N1 as (

           select   1 as n

           union   select n + 1 from N1, oldEmp

  where n     <= length(oneLine1) - length(replace(oneLine1, '',''))

),

Emp as(

           select  substring\_index(substring\_index(oneLine1,   '', n),' ', -1) EId,

                      substring\_index(substring\_index(oneLine2,   '', n),' ', -1) Dept,

                      substring\_index(substring\_index(oneLine3,   '', n),' ', -1) Name,

                      substring\_index(substring\_index(oneLine4,     '', n),' ', -1) Gender

           from   N1, oldEmp)

select   Orders.OrderID,Emp.Name  from Orders,Emp where     Orders.OrderID=Emp.EId

MySQL has weak data source support by denying retrieval from Webservice and HTTP source and by hardcoding even the basic file data retrieval. It also needs to write the file data into the database for further computation.

It is convenient to deploy and configure MySQL. Users only need to introduce the driver jar.

Scala

Scala is a popular, widely used, excellent structured data computation language. It thus boasts plenty of third-party library functions. Spark and databricks are the two function libraries to compute XML data.

To achieve the previous conditional query, Scala has the following code:

package test

import   com.databricks.spark.xml.XmlDataFrameReader

import   org.apache.spark.sql.SparkSession

import   org.apache.spark.sql.functions._

object xmlTest   {

  def     main(args: Array\[String\]): Unit = {

      val spark   = SparkSession.builder()

          .master("local")

          .getOrCreate()

      val df =   spark.read

          .option("rowTag", "row")

          .option("inferSchema","true")

          .xml("D:\\\xml\\\Employees_Orders.xml")

     val Orders   =     df.select(explode(df("Orders"))).select("col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")

      val   condition=Orders.where("Amount>1000 and Amount<=3000   and Client   like'%S%' ")

        condition.show()

  }

}

The code reads in the XML file as a multilevel DataFrame object, gets all orders using explode function, and performs the conditional query using where function.

You can also use Scala to group and summarize XML data. The code is as follows:

// Remove   extra quotation marks from both sides of each OrderDate value

 val     ordersWithDateType= Orders.withColumn("OrderDate",     regexp_replace(col("OrderDate"), "\\"",""))  
 val     groupBy=ordersWithDateType.groupBy(year(ordersWithDateType("OrderDate"))).agg(sum("Amount"))

To join the employee table and the orders table, Scala uses the following code:

val   df1=df.select(df("Name"),df("Gender"),df("Dept"),explode(df("Orders")))

val     relation=df1.select("Name","Gender","Dept","col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")

Scala has rather strong syntactic expressiveness in handling common computation, produces short and easy to understand code, and compared with MySQL, is easy to learn. When implementing a join operation, Scala can get target values directly from the multilevel XML data without creating two two-dimensional tables in advance. The logic is considerably simplified, the code becomes much shorter, and execution is efficient.

Scala code is short and easy to understand because DataFrame supports multilevel data handling and thus can represent XML structure conveniently. Its DataFrame-based functions are more convenient for computing multilevel data.

Scala offers excellent data source support. It has a special function to retrieve XML data from many data sources, including files,Webservice, HTTP, etc.

To deploy and configure Scala for XML data computation, users just need to introduce databricks and Spark (without the need of deploying Spark service) function libraries.

esProc SPL

esProc SPL is the professional open-source structured data computation language. It computes data coming from any data source using uniform syntax and data structure as Scala does. Yet it is lighter and has simpler syntax.

esProc SPL handles the previous conditional query in the following way:


A

1

=xml(file("D:\\xml\\Employees_Orders.xml").read(),"xml/row")

2

=A1.conj(Orders)

3

=A2.select(Amount>100 && Amount<=3000 && like@c(Client,"*bro*"))

SPL reads in XML data as a multilevel table sequence object (which is similar to Scala DataFrame), concatenates all orders via conj function, and performs the conditional query through select function.

This block of code can be debugged or executed in esProc IDE, or stored as a script file (like condition.dfx) for invocation from a Java program through the JDBC interface. Below is the code for invocation:

    package Test;  
    import java.sql.Connection;  
    import java.sql.DriverManager;  
    import java.sql.ResultSet;  
    import java.sql.Statement;  
    public class test1 {  
        public static void main(String\[\]   args)throws   Exception {  
            Class.forName("com.esproc.jdbc.InternalDriver");  
            Connection connection = DriverManager.getConnection("jdbc:esproc:local://");  
            Statement statement = connection.createStatement();  
            ResultSet result = statement.executeQuery("call condition()");  
            printResult(result);  
            if(connection != null)     connection.close();  
        }

…

}

SPL achieves grouping and aggregation using the following code:

=A2.groups(year(OrderDate);sum(Amount))

Or implements join operations using code below:

=A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate)

As the above code shows, SPL has the most powerful syntactic expressiveness that enables handling common operations, generates concise and easy to understand code, and facilitates loosely coupled integration with a Java program. The programming language has table sequence object to support multilevel data handling and gives intuitive support for operators to be able to retrieve values directly from multilevel data during a join, which further compresses the code.

With the powerful syntactic expressiveness, SPL can simplify multilevel XML data computation in most cases. Here is one example:

The file book1.xml stores book information, where the author node has author name and country attributes and where certain books have more than one author. Below is part of the XML file:

<?xml   version="1.0"?>

<library>

      <book   category="COOKING">

            <title>Everyday Italian</title>

            <author name="Giada De Laurentiis" country="it"     />

            <year>2005</year>

            <info>Hello Italian!</info>

        </book>

      <book   category="CHILDREN">

            <title>Harry Potter</title>

            <author name="J K. Rowling" country="uk"/>

            <year>2005</year>

            <info>Hello Potter!</info>

        </book>

      <book   category="WEB">

            <title>XQuery Kick Start</title>

          <author name="James   McGovern" country="us"   />

          <author name="Per   Bothner" country="us"/>

            <year>2005</year>

            <info>Hello XQuery</info>

        </book>

      <book   category="WEB">

            <title>Learning XML</title>

            <author name="Erik T. Ray" country="us"/>

            <year>2003</year>

            <info>Hello XML!</info>

        </book>

</library>

The code rearranges XML data into a structured two-dimensional table, where each author field value is displayed in the format of “author name[country] and where comma is used to separate information of multiple authors, and query the table to get information of books published in the year 2005. Below is the result set:

title

category

year

author

info

Everyday Italian

COOKING

2005

Giada De Laurentiis[it]

Hello Italian!

Harry Potter

CHILDREN

2005

J K. Rowling[uk]

Hello Potter!

XQuery Kick Start

WEB

2005

James McGovern[us],Per Bothner[us]

Hello XQuery

The computing task is not easy, but SPL makes it easier using the following code:



1

=file("D:\\xml\\book1.xml")

2

=xml@s(A1.read(),"library/book").library

3

=A2.new(category,book.field("year").ifn():year,book.field("title").ifn():title,book.field("lang").ifn():lang,book.field("info").ifn():info,book.field("name").select(~).concat@c():name,book.field("country").select(~).concat(","):country)

4

=A3.new(title,category,year,(lang,name.array().(~+"[")++country.array().(~+"]")).concat@c():author,info)

5

=A4.select(year==2005)

 

SPL provides great data source support. It has the special function to retrieve XML data from a variety of data sources, including files, WebService, and HTTP, etc.

XML data read and write is one of SPL’s basic features, so users do not need to make specific deployment. To integrate an SPL script into a Java program, you just need to introduce the relevant jars, and that is convenient.

In a nutshell, esProc SPL has the most powerful syntactic expressiveness that can simplify multilevel XML data computations; Scala has good syntactic expressiveness that can handle common operations; MySQL has enough expressive ability, but the code is too complicated, except for scenarios when the XML file has relatively simple structure and can be split and written into the database; dom4j has a too weak expressiveness ability to handle common computations; it is only suitable for dealing with pure conditional queries. In the aspect of data source support, both esProc SPL and Scala provides a wealth of choices. About deployment, both dom4j and MySQL are simple, yet the other two are also easy.