MongoDB Data Grouping & Aggregationate

 

The document-based NoSQL database MongoDB is popular worldwide thanks to its powerful processing ability. Besides the basic insert, delete, update and query, it also supports the relatively complicated grouping and aggregate operation. Yet in computing field there is always a better option. Here we’ll compare MongoDB method and SPL (Structured Process Language) way of grouping and summarizing data stored in MongoDB in several scenarios.

1. Nested array query

Task : Calculate average score of all subjects and each student’s total score.
Sample data:

_id name sex Scroe
1 Tom F [{“lesson”:"Physics",“mark”:60  },{“lesson”:"Chemical",“mark”:72}]
2 Jerry M [{“lesson”:"Physics",“mark”:92  },{“lesson”:"Math",“mark”:81}]

Expected result:

Physics 76
Tom 132
Chemical 72
Jerry 173
Math 81

Mongodb script:

1png

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed using $unwind operator and group() method.
SPL script (student.dfx):

A
1 =mongo_open(“mongodb://127.0.0.1:27017/raqdb”)
2 =mongo_shell(A1,“student.find()”).fetch()
3 =A2.conj(scroe).groups(lesson:LESSON;avg(mark):AVG)
4 =A2.new(name:NAME,scroe.sum(mark):TOTAL)
5 >A1.close()

Average score for each subject:

LESSON AVG
Chemical 72.0
Math 81.0
Physics 76.0

Total score of each student:

NAME TOTAL
Tom 132
Jerry 173

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from student collection.
A3: Split score field and union it into a table sequence, and group the table by subject to calculate the average score.
A4: Calculate each student’s total score and return a new table sequence consisting of NAME and TOTAL fields. new() function creates a new table sequence.
A5: Close mongodb connection.

2. Nested document query

Task: Calculate sum of quantities in every income value and output value in each document.
Sample data :

_id income output
1 {“cpu”:1000, “mem”:500,   “mouse”:“100”} {“cpu”:1000, “mem”:600 ,“mouse”:“120”}
2 {“cpu”:2000, “mem”:1000,  “mouse”:“50”,“mainboard”:500 } {“cpu”:1500, “mem”:300}

Expected result

_id income output
1 1600 1720
2 3550 1800

Mongodb script

var fields = [  "income", "output"];
db.computer.aggregate([ 
   { 
      $project:{ 
         "values":{ 
            $filter:{ 
               input:{ 
                    "\$objectToArray":"$$ROOT"
               },
               cond:{ 
                  $in:[ 
                     "$$this.k",
                     fields
                  ]
               }
            }
         }
      }
   },
   { 
      \$unwind:"$values"
   },
   { 
      $project:{ 
         key:"$values.k",
         values:{ 
            "$sum":{ 
               "$let":{ 
                  "vars":{ 
                     "item":{ 
                          "\$objectToArray":"$values.v"
                     }
                  },
                    "in":"$$item.v"
               }
            }
         }
      }
   },
   {$sort: {"_id":-1}},
   { "$group": {
    "_id": "$_id",
    'income':{"\$first":   "$values"},
    "output":{"\$last":   "$values"}
    }},
]);

filter operator stores income field and output field as arrays. $unwind operator split each array into a record. $sum operator calculates sum of members. $group operator groups data by _id and joins up values.
SPL script:

A
1 =mongo_open(“mongodb://127.0.0.1:27017/raqdb”)
2 =mongo_shell(A1,“computer.find()”).fetch()
3 =A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)
4 >A1.close()

Result:

ID INCOME OUTPUT
1 1600.0 1720.0
2 3550.0 1800.0

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.
SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.

3. Group and aggregate by segments

Task: Divide the following data by specified intervals of sales and count records in each segment.

Sample data:

_id NAME STATE SALES
1 Ashley New York 11000
2 Rachel Montana 9000
3 Emily New York 8800
4 Matthew Texas 8000
5 Alexis Illinois 14000

Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞)

Expected result :

Segment number
3 3
4 2

MongoDB script:

var a_count=0;
var b_count=0;
var c_count=0;
var d_count=0;
var e_count=0;
db.sales.find({
}).forEach(
    function(myDoc) {
        if (myDoc.SALES <3000)   {
            a_count += 1;
        }
        else if (myDoc.SALES <5000)   {
            b_count += 1;
        }
        else if (myDoc.SALES   <7500) {
            c_count += 1;
        }
        else if (myDoc.SALES   <10000) {
            d_count += 1;
        }
        else {
            e_count += 1;
        }       
    }
    );
   
print("a_count="+a_count)
print("b_count="+b_count)
print("c_count="+c_count)
print("d_count="+d_count)
print("e_count="+e_count)

MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions.
SPL script:

A
1 [3000,5000,7500,10000,15000]
2 =mongo_open(“mongodb://127.0.0.1:27017/raqdb”)
3 =mongo_shell(A2,“sales.find()”).fetch()
4 =A3.groups(A1.pseg(int(~.SALES)):Segment;count(1):   number)
5 >A2.close()

Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg()function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.
Both scripts get the desired result, but SPL script is more concise by using pseg() function.

4. Multilevel group and aggregate

Task: Group the following data by addr and then by book and count books for each addr and those for each book type.

addr book
address1 book1
address2 book1
address1 book5
address3 book9
address2 book5
address2 book1
address1 book1
address15 book1
address4 book3
address5 book1
address7 book11
address1 book1

Expected result:

_id Total books Count
address1 4 book1 3
book5 1
address15 1 book1 1
address2 3 book1 2
book5 1
address3 1 book9 1
address4 1 book3 1
address5 1 book1 1
address7 1 book11 1

MongoDB script :

db.books.aggregate([
    {   "$group": {
          "_id": {
              "addr": "$addr",
              "book": "$book"
        },
          "bookCount": {"\$sum": 1}
    }},
    {   "$group": {
          "_id": "$_id.addr",
          "books": {
              "$push": {
                  "book": "$_id.book",
                  "count": "$bookCount"
            },
        },
          "count": {"\$sum": "$bookCount"}
    }},
    {"$sort":   { "count": -1} },
    {   "$project": {
          "books": {"\$slice": [ "$books", 2] },
          "count": 1
    }}
]).pretty()

The above script groups data by addr,book and count books, and then re-group data by addr to count books, and display data in desired format.
SPL script (books.dfx):

A B
1 =mongo_open(“mongodb://127.0.0.1:27017/raqdb”)
2 =mongo_shell(A1,“books.find()”)
3 =A2.groups(addr,book;count(book):   Count)
4 =A3.groups(addr;sum(Count):Total)
5 =A3.join(addr,A4:addr,Total) return A5
6 >A1.close()

Result:

Address book Count Total
address1 book1 3 4
address1 book5 1 4
address15 book1 1 1
address2 book1 2 3
address2 book5 1 3
address3 book9 1 1
address4 book3 1 1
address5 book1 1 1
address7 book11 1 1

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from books collection.
A3: Group data by addr&book and count books in each group.
A4: Re-group data by addr to count books in each group.
A5: Add A4’s Total field to A3’s table sequence by addr values.
B5: Return A5’s table sequence.
A6: Close database connection.
The above SPL script is both concise and integration-friendly.
To handle the grouping and aggregation in Java, you need to implement the requirement from the low level. The code is complicated and can’t be used to handle other scenarios. esProc offers JDBC interface to enable a Java application to call the SPL script in the manner of calling a stored procedure. See esProc JDBC Basic Uses to learn more.

Here’s how Java calls an SPL script:

 public void testStudent (){
        Connection con = null;
       com.esproc.jdbc.InternalCStatement st;
    try{
       // Establish a connection
       Class.forName("com.esproc.jdbc.InternalDriver");
       con= DriverManager.getConnection("jdbc:esproc:local://");
       // Call the stored procedure, where books is the SPL script (with dfx extension) name
       st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call books ()");
       // Execute the stored procedure
       st.execute();
       // Get the result set
       ResultSet rs = st.getResultSet();
        
     catch(Exception e){
       System.out.println(e);
    }

It’s easy to use the result of executing an SPL script by a Java application. You can also load the SPL script directly and use its name as a parameter to execute it via a function. esProc supports ODBC, too. The integration via ODBC is similar.
MongoDB’s schemaless document-based structure enables storage convenient and flexible data presentation and query. On the other hand, it increases the complexity of a shell script because there are details to consider and it needs many steps to get to the goal. By contrast, SPL streamlines the process, generates concise and simple code, and increases efficiency.