Looking for the Best Class Library for Computing MongoDB Data

 

MongoDB's built-in query expressions are not so convenient to use on certain occasions, so a third-party function library is necessary and useful to handle the MongoDB data. In this essay, we will compare four types of the class library – MongoDB Connector, Calcite, Scala and esProc SPL, for computing MongoDB data in terms of syntactic expressiveness and deployment configurations. Looking Looking for the Best Class Library for Computing MongoDB Data for details.

MongoDB's built-in JSON-style query expressions are inconvenient to use on certain occasions, so we need to retrieve data from the MongoDB database and compute it using external third-party functions. Below is the comparison of several MongoDB class libraries in the aspects of syntactic expressiveness and deployment configurations.

MongoDB Connector

MongoDB connector is the official class library MongoDB offers. It simulates MySQL services to translate SQL to JSON-style expressions, receives SQL requests upward from ODBC or JDBC, and access MongoDB data downward using JSON-style expressions.

Connector supports basic SQL syntax. Let's look at an example. There is a collection named test1 in MongoDB database. Most of its fields contain values of simple data types for storing employee information. Orders field stores multiple orders of the current employee, which are array data types. Below is part of the source data:

[{

      "_id": {"$oid":   "6074f6c7e85e8d46400dc4a7"},

      "EId": 7,"State":   "Illinois","Dept": "Sales","Name":   "Alexis","Gender": "F","Salary":   9000,"Birthday": "1972-08-16",

      "Orders": [

         {"OrderID":   70,"Client": "DSG","SellerId":   7,"Amount": 288,"OrderDate": "2009-09-30"},

         {"OrderID":   131,"Client": "FOL","SellerId":   7,"Amount": 103.2,"OrderDate": "2009-12-10"}

    ]

}

{

      "_id": {"$oid":   "6074f6c7e85e8d46400dc4a8"},

      "EId": 8,"State": "California", ...

}]

The following code embeds SQL into a Java program to perform the conditional query on orders table.

    package mon;  
    import java.sql.Connection;  
    import java.sql.DriverManager;  
    import java.sql.ResultSet;  
    import java.sql.Statement;  
    public class Main {  
        public static void main(String\[\]   args)throws   Exception {  
            Class.forName("com.mysql.jdbc.Driver");  
            Connection connection  = DriverManager.getConnection("jdbc:mysql://127.0.0.1:3307?source=mongo&mechanism=PLAIN&useSSL = false&authenticationPlugins=org.mongodb.mongosql.auth.plugin.MongoSqlAuthenticationPlugin");  
            Statement statement  =  connection.createStatement();  
            String str = "SELECT *   FROM   mongo.test1_orders where Orders.AmountOrders.Amount>1000 and Orders.AmountOrders.Amount<=3000  and Orders.ClientOrders.Client like'%S%' ";  
            ResultSet result =   statement.executeQuery(str); 

            …  
            if(connection != null)     connection.close();  
        }  
    }

We can modify the SQL statement to achieve grouping & aggregation and conditional query:

str="SELECT year(Orders.OrderdateOrders.Orderdate) y,sum(Orders.AmountOrders.Amount) s FROM   mongo.test1_orders  group by year(Orders.OrderdateOrders.Orderdate)";

str="SELECT o.Orders.OrderIDOrders.OrderID,o.Orders.ClientOrders.Client,o.Orders.Sel≤rIdOrders.Sel≤rId,o.Orders.AmountOrders.Amount,o.Orders.OrderDateOrders.OrderDate,e.Name,e.Gender,e.Dept     from mongo.test1_Orders o, mongo. test1 e where   o.Orders.Sel≤rIdOrders.Sel≤rId=e.EId";

In the above code, Orders.Orderdate is the default field name in a subdocument. Though it uses the dot mark to appear like the primary table and sub table relationship (which can be redefined in the metadata file), it isn't one and data in them will not be parsed separately because SQL does not support the multilevel data type. In fact, Connector identifies collection test1 as two separate tables. One is table test1 that does not contain subdocuments. The other is table test1_Orders that only contains subdocuments. SQL cannot make use of the existing primary table and sub table relationship but needs to create an extra relationship. This reduces computing efficiency.

Besides the usual SQL disadvantage of not supporting multilevel data computation, Connector is among the weakest SQL implementations in terms of syntactic expressiveness. One evidence is that the language does not support window functions. In fact, MongoDB Connector is originally intended to cater to the basic needs of certain BI tools.

As one of MongoDB's built-in component, the class library has simple integration and deployment configurations. Once it is installed, users only need to execute the following command at command line to start the database service:

mongosqld --mongo-uri "mongodb://localhost:27017/?connect=direct" --addr "127.0.0.1:3307"

Calcite

Calcite aims to compute data coming from any source using SQL, including MongoDB. Unfortunately, Calcite on MongoDB documents are few and rough and there are no detailed illustrations for certain features, so the following description may not be accurate.

Calcite can only retrieve data by collection. If a collection is large, the data retrieval probably results in memory overflow (as Calcite only supports in-memory computing). The class library cannot retrieve values directly from a multilevel collection, like the above collection test1 that contains multilevel subdocuments. To cater to Calcite, we rearrange the test file into two single-level collections, Employees and Orders.

Below is Calcite code for performing a conditional query on collection Orders:

    package   org.example;  
    import java.sql.Connection;  
    import java.sql.DriverManager;  
    import java.sql.ResultSet;  
    import java.sql.Statement;  
    import java.util.Properties;  
    public class App  
    {  
        public static void main(String\[\]   args )   throws Exception{  
            Properties config = new     Properties();  
            config.put("model",   "d:\\\mongo-model.json");  
            config.put("lex",     "MYSQL");

            Connection con =     DriverManager.getConnection("jdbc:calcite:", config);  
            Statement stmt =     con.createStatement();

            String   sql ="select * from orders where Amount>1000 and     Amount<=3000";  
            ResultSet rs =     stmt.executeQuery(sql);  
            …  
            if(con!= null) con.close();  
        }  
    }

There isn't a fuzzy query part (while the other class libraries have) in the conditional query statement because Calcite does not support fuzzy query. The library also does not have the function for getting the year and the function for doing conversion between string data and date data. It cannot achieve the previous grouping & aggregation task. We have two rewrite the code as follows:

sql="select Client,sum(Amount)   from orders group by Client";

Calciteshe is not join-operation-friendly, too. The class library cannot get only some of the fields and use the asterisk sign * to get all fields. It cannot implement the previous join query directly, so we rewrite the code as follows:

sql= "SELECT * from   Orders,Employees where   Orders.SellerId=Employees.EId";

There are two steps for deploying Calcite. First, we introduce calcite-mongodb to Maven, and then create metadata file mongo_model.json. The specific steps are as follows:

{

      "version": "1.0",

      "defaultSchema":     "dSchema",

      "schemas": \[

     {

          "type": "custom",

          "name": "alias",

          "factory": "org.apache.calcite.adapter.mongodb.MongoSchemaFactory",

          "operand": {

           "host": "localhost:27017",

           "database": "mongo"

        }

      },

      {

          "name": "dSchema",

          "tables": \[

               {

              "name": "orders",

              "type": "view",

             "sql": "select cast(\_MAP\['OrderID'\] AS integer)AS     OrderID,cast(\_MAP\['Client'\] AS varchar(40)) AS   Client,cast(\_MAP\['SellerId'\]   AS integer)AS SellerId,cast(\_MAP\['Amount'\]   AS float)AS   Amount,cast(_MAP\['OrderDate'\] AS varchar(20)) AS OrderDate   from   \\"alias\\".\\"Orders\\""

          },

         {

              "name": "employees",

              "type": "view",

              "sql": "select cast(\_MAP\['EId'\] AS integer)AS     EId,cast(\_MAP\['State'\] AS varchar(40)) AS State,cast(\_MAP\['Dept'\] AS     varchar(40)) AS Dept,cast(\_MAP\['Name'\] AS varchar(40)) AS     Name,cast(\_MAP\['Gender'\] AS varchar(40)) AS Gender,cast(\_MAP\['Salary'\] AS     float)AS Salary,cast(_MAP\['Birthday'\] AS    varchar(20)) AS   Birthday from   \\"alias\\".\\"Employees\\""

          }

        \]

      }

  \]

}

In the above configurations, \ "alias\".\"Orders\" is a physical table name and orders is the corresponding view name. In theory, there is no need to configure a view and users just query the target physical table through the code directly. In fact, the direct query results in many SQL errors (such as one about grouping & aggregation). This is probably because Calcite is not mature enough.

Scala

Scala is a commonly used structured data computation language. It has a relatively long history of supporting MongoDB data computation. The computing principle is like this. Scala reads in a collection from the MongoDB database, stores it as a DataFrame object (or RDD), and compute it using its uniform computing ability.

Yet the Scala class library has some intrinsic weaknesses. It can only retrieve data by collection and does not support retrieving data using MongoDB's JSON-style query expressions. If a collection contains a large volume of data, it will take long to get the job done. The query language cannot get values directly from a multilevel collection. Users have to rearrange the to-be-computed multilayer collection in the MongoDB database into single-level collections for further process. In the previous example, test1 collection is split into two single-level collections, Orders and Employees.

Below is the Scala code for performing a conditional query on collection Orders:

package test

import   org.apache.spark.sql.SparkSession

import   com.mongodb.spark.config._

import   com.mongodb.spark.sql.toSparkSessionFunctions

object Mon {

  def     main(args: Array\[String\]): Unit = {

      val   warehouseLocation =   "file:${system:user.dir}/spark-warehouse"

      val spark   = SparkSession.builder()

          .master("local")

          .appName("MongoDB Test")

          .getOrCreate()

      val Orders   = spark.loadFromMongoDB(ReadConfig(

          Map("uri" ->   "mongodb://127.0.0.1:27017/mongo.Orders")

      ))

      val   condtion=Orders.where("Amount>1000 and Amount<=3000 and   Client   like'%S%' ")

        condtion.show()

  }

}

We can also perform grouping & aggregation and join operations using Scala:

// Grouping   & aggregation

val   groupBy=Orders.groupBy(year(Orders("OrderDate"))).agg(sum("Amount"))

//Join   operations

val Employees   = spark.loadFromMongoDB(ReadConfig(

          Map("uri" ->     "mongodb://127.0.0.1:27017/mongo.employees")

      ))

val     join=Orders.join(Employees,Orders("SellerId")===Employees("EId"),"Inner")  
          .select("OrderID","Client","SellerId","Amount","OrderDate","Name","Gender","Dept")

Though a multilevel collection must be split into single-level collections for data retrieval, DataFrame has a rather powerful computing way once data is retrieved. That is the advantage of a uniform data structure.

Scala is easy to deploy for programmers. They just need to add org.mongodb.spark to Maven.

esProc SPL

esProc SPL is the professional open-source structured data computation language. It computes data coming from any data source using uniform syntax and data structure as Calcite does. Yet it is lighter , has simpler syntax, and supports MongoDB data computation more.

esProc SPL handles the previous conditional query on collection test1 in the following way:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"test1.find()")

3

=A2.conj(Orders)

4

=A3.select(Amount>1000 && Amount<=3000 && like@c(Client,"*s*")).fetch()

5

=mongo_close(A1)

According to A2, SPL supports MongoDB's JSON-style query expressions (find, count, distinct and aggregate). To do an interval-based query, the SPL is like this: =mongo_shell(A2,"test1.find({Orders.Amount:{gt:1000,gt:1000,lt:3000}})").

This method can reduce the volume of data retrieved and prevent memory overflow when a collection is large and the JSON-style expression is simple and speed up query when the query is performed on an index, for example., SPL can do the job effortlessly if there is still a large volume of retrieved data. It returns a cursor type result set to handle data that cannot fit into the memory.

This block of code can be executed in esProc IDE, or stored as a script file (like select.dfx) for invocation from a Java program through the JDBC interface. Below is the code for invocation:

    package Test;  
    import java.sql.Connection;  
    import java.sql.DriverManager;  
    import java.sql.ResultSet;  
    import java.sql.Statement;  
    public class test1 {  
        public static void main(String\[\]   args)throws   Exception {  
            Class.forName("com.esproc.jdbc.InternalDriver");  
            Connection connection     =DriverManager.getConnection("jdbc:esproc:local://");  
            Statement statement =     connection.createStatement();  
            ResultSet result =     statement.executeQuery("call select()");

            ……  
            if(connection != null)     connection.close();  
        }  
    }

SPL achieves grouping and aggregation using the following code:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"test1.find()")

3

=A2.conj(Orders).groups(year(OrderDate);sum(Amount))

4

=mongo_close(A1)

Or implements a join operation using code below:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"test1.find()")

3

=A2.new(Orders.OrderID,Orders.Client,Name,Gender,Dept).fetch()

4

=mongo_close(A1)

SPL produces simpler code for performing a join operation than any other class library does (because it does not do the actual join), including MongoDB connector thanks to the language’s intrinsic multilevel data structure. It can match a multilevel collection like test1 and express a the primary table and sub table relationship without an extra join. Other class libraries, however, are of single level data structure and it is hard for them to handle multilevel collections.

SPL supports joining two single-level collections, of course:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"Orders.find()").fetch()

3

=mongo_shell(A1,"Employees.find()").fetch()

4

=mongo_close(A1)

5

=join(A2,SellerId;A3,EId)

6

=A5.new(_1.OrderID,_1.Client,_2.Name,_2.Gender,_2.Dept)

SPL has rich expressions with procedural syntax and the support of SQL syntax. As SQL does not support multilevel data (see Calcite and Conenctor), esProc SQL syntax only supports the join of two single-level collections, as shown by the code below:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A34,"Orders.find()").fetch()

3

=mongo_shell(A34,"Employees.find()").fetch()

4

=mongo_close(A34)

5

$select o.OrderId,o.Client,e.Name,e.Gender,e.Dept  from {A35}  o join {A36} e on  o.SellerId=e.EId

One of MongoDB’s feature is the multilevel data, yet it is difficult to handle multilevel data using JSON-style expressions. If we could use esProc SPL as the computing class library, the computing process will be much simplified. In the following task, we need to sum the numbers under income field and those under output field respectively.

_id

income

output

1

"cpu":1000,   "mem":500,     "mouse":"100"

"cpu":1000, "mem":600     ,"mouse":"120"

2

"cpu":2000,"mem":1000,      "mouse":"50","mainboard":500

{"cpu":1500,   "mem":300}

Below is the code using JSON-style expressions. It is complicated.

var fields =   \[  "income",   "output"\];

db.computer.aggregate(\[ 

     { 

        $project:{ 

             "values":{ 

                $filter:{ 

                   input:{ 

                        "objectToArray":"objectToArray":"$ROOT"

                   },

                   cond:{ 

                      $in:\[ 

                         "$$this.k",

                         fields

                      \]

                   }

                }

           }

        }

     },

     { 

       unwind:"unwind:"values"

     },

     { 

        $project:{ 

             key:"$values.k",

             values:{ 

                "$sum":{ 

                   "$let":{ 

                      "vars":{ 

                         "item":{ 

                              "objectToArray":"objectToArray":"values.v"

                         }

                      },

                        "in":"$$item.v"

                   }

                }

           }

        }

     },

     {$sort: {"_id":-1}},

   {   "$group": {

      "\_id": "$\_id",

      'income':{"first":"first":"values"},

      "output":{"last":"last":"values"}

      }},

\]);

SPL handles it in a simple and elegant way:


A

1

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

2

=mongo_shell(A1,"computer.find()").fetch()

3

=A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)

4

>A1.close()

To deploy esProc, just start MongoCLi in Extend library. The deployment supports the use of GUI interface.

Here is the summary. In terms of syntactic expressiveness, esProc SPL has the biggest support for multilevel data computation; Connector gives enough support by identifying a multilevel collection as multiple tables; and Scala is expensive to use since it needs to rearrange the multilevel data into single-level data; Calcite is both high-cost and immature. In the aspect of deployment and configuration, Connector is the simplest; both esProc SPL and Scala are simple, too; and Calcite is still at the bottom.