MongoDB Data Grouping & Aggregation

MongoDB Data Grouping & Aggregation

 

The document-based NoSQL database MongoDB is popular worldwide thanks to its powerful processing ability. Besides the basic insert, delete, update and query, it also supports the relatively complicated grouping and aggregate operation. Yet in computing field there is always a better option. Here we’ll compare MongoDB method and SPL (Structured Process Language) way of grouping and summarizing data stored in MongoDB in several scenarios.

 

1. Nested array query

Task: Calculate average score of all subjects and each student’s total score.

Sample data:

_id

name

Gender

Score

1

Tom

F

[{"subject":"  Physics","mark":60  },
   {"subject":"Chemical","mark":72  }]

2

Jerry

M

[{"subject":"  Physics","mark":92  },
   {"subject":"  Math","mark":81}]

Expected result:

Physics

76


Tom

132

Chemical

72


Jerry

173

Math

81




Mongodb script:

db.student.aggregate( [ 
   {\$unwind : "$score"},
  {$group: {
   "_id":   {"name"  :"$name"}   , 
   "qty":{"\$sum"   :  "$score.mark"}
   }
  }
  ] )

        

db.student.aggregate( [
   {$unwind : "score"},
  {$group: {
   "_id":  {"subject":"$score.subject"}   , 
   "qty":{"\$avg":   "$score.mark"}
   }
  }
  ] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed usingscore"},
  {$group: {
   "_id":  {"subject":"$score.subject"}   , 
   "qty":{"\$avg":   "$score.mark"}
   }
  }
  ] )

   

db.student.aggregate( [ 
   {\$unwind : "$score"},
  {$group: {
   "_id":   {"name"  :"$name"}   , 
   "qty":{"\$sum"   :  "$score.mark"}
   }
  }
  ] )

   

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed usingscore"},
  {$group: {
   "_id":  {"subject":"$score.subject"}   ,
   "qty":{"\$avg":   "$score.mark"}
   }
  }
  ] )

db.student.aggregate( [
   {\$unwind : "$score"},
  {$group: {
   "_id":   {"name"  :"$name"}   ,
   "qty":{"\$sum"   :  "$score.mark"}
   }
  }
  ] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed using unwind operator and group() method.

SPL script (student.dfx):


A

B

1

=mongo_open("mongodb://127.0.0.1:27017/raqdb")


2

=mongo_shell(A1,"student.find()").fetch()


3

=A2.conj(score).groups(subject:SUBJECT;avg(mark):AVG)


4

=A2.new(name:NAME,score.sum(mark):TOTAL)


5

>A1.close()


Average score for each subject:

SUBJECT

AVG

Chemical

72.0

Math

81.0

Physics

76.0

Total score of each student:

NAME

TOTAL

Tom

132

Jerry

173

 

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from student collection.
A3: Split score field and union it into a table sequence, and group the table by subject to calculate the average score.
A4: Calculate each student’s total score and return a new table sequence consisting of NAME and TOTAL fields. new() function creates a new table sequence.
A5: Close mongodb connection.

 

2. Nested document query

Task: Calculate sum of quantities in every income value and output value in each document.
Sample data:

_id

income

output

1

"cpu":1000, "mem":500,  "mouse":"100"

"cpu":1000, "mem":600   ,"mouse":"120"

2

"cpu":2000, "mem":1000,
   "mouse":"50","mainboard":500  

{"cpu":1500,   "mem":300}

Expected result:

_id

income

output

1

1600

1720

2

3550

1800

Mongodb script:

    

var fields = [  "income", "output"];
  db.computer.aggregate([
   {
     project:{ 
      "values":{   
       $filter:{ 
          input:{ 
             "\$objectToArray":"$$ROOT"
          },
          cond:{ 
            $in:[ 
             "$$this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    { 
      $project:{
       key:"$values.k",
       values:{ 
         "$sum":{ 
          "$let":{ 
            "vars":{ 
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"$$item.v"
           }
         }
        }
      }
    },
   {$sort: {"_id":-1}},
    { "$group": {
    "_id": "$_id",
    'income':{"\$first":  "$values"},
    "output":{"\$last":    "$values"}
    }},
  ]);

project:{ 
      "values":{   
       $filter:{ 
          input:{ 
             "\$objectToArray":"$$ROOT"
          },
          cond:{ 
            $in:[ 
             "$$this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    { 
      $project:{
       key:"$values.k",
       values:{ 
         "$sum":{ 
          "$let":{ 
            "vars":{ 
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"$$item.v"
           }
         }
        }
      }
    },
   {$sort: {"_id":-1}},
    { "$group": {
    "_id": "$_id",
    'income':{"\$first":  "$values"},
    "output":{"\$last":    "$values"}
    }},
  ]);

   

project:{
      "values":{   
       $filter:{
          input:{
             "\$objectToArray":"$$ROOT"
          },
          cond:{
            $in:[
             "$$this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    {
      $project:{
       key:"$values.k",
       values:{
         "$sum":{
          "$let":{
            "vars":{
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"$$item.v"
           }
         }
        }
      }
    },
   {$sort: {"_id":-1}},
    { "$group": {
    "_id": "$_id",
    'income':{"\$first":  "$values"},
    "output":{"\$last":    "$values"}
    }},
  ]);

filter operator stores income field and output field as arrays. unwindoperatorspliteacharrayintoarecord.unwindoperatorspliteacharrayintoarecord.unwind operator split each array into a record. sum operator calculates sum of members. group operator groups data by _id and joins up values.

SPL script:

      
   

A

   

B

       

1

   

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

   
       

2

   

=mongo_shell(A1,"computer.find()").fetch()

   
       

3

   

=A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)

   
       

4

   

>A1.close()

   
   

Result:

      

ID

   

INCOME

   

OUTPUT

       

1

   

1600.0

   

1720.0

       

2

   

3550.0

   

1800.0

   

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.

SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.

 

3. Group and aggregate by segments

Task: Divide the following data by specified intervals of sales and count records in each segment.

Sample data

      

_id

   

NAME

   

STATE

   

SALES

       

1

   

Ashley

   

New York

   

11000

       

2

   

Rachel

   

Montana

   

9000

       

3

   

Emily

   

New York

   

8800

       

4

   

Matthew

   

Texas

   

8000

       

5

   

Alexis

   

Illinois

   

14000

   

Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞) 

Expected result

      

Segment

   

number

       

3

   

3

       

4

   

2

   

MongoDB script:

      

var a_count=0;
  var b_count=0;
  var c_count=0;
  var d_count=0;
  var e_count=0;
  db.sales.find({
  }).forEach( 
    function(myDoc) { 
      if (myDoc.SALES <3000)  { 
         a_count += 1;
       }
       else if (myDoc.SALES <5000)  { 
        b_count += 1;
       }
       else if (myDoc.SALES  <7500) { 
        c_count += 1;
       }
      else if (myDoc.SALES  <10000) { 
        d_count += 1;
      }
      else { 
        e_count += 1;
      } 
     } 
    );
  
  print("a_count="+a_count)
  print("b_count="+b_count)
  print("c_count="+c_count)
  print("d_count="+d_count)
  print("e_count="+e_count)

   

MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions. 

SPL script:

      
   

A

   

B

       

1

   

[3000,5000,7500,10000,15000]

   
       

2

   

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

   
       

3

   

=mongo_shell(A2,"sales.find()").fetch()

   
       

4

   

=A3.groups(A1.pseg(int(~.SALES)):Segment;count(1):    number)

   
       

5

   

>A2.close()

   
   

Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg() function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.

Both scripts get the desired result, but SPL script is more concise by using pseg() function.

 

4. Multilevel group and aggregate 

Task: Group the following data by addr and then by book and count books for each addr and those for each book type.

      

addr

   

book

       

address1

   

book1

       

address2

   

book1

       

address1

   

book5

       

address3

   

book9

       

address2

   

book5

       

address2

   

book1

       

address1

   

book1

       

address15

   

book1

       

address4

   

book3

       

address5

   

book1

       

address7

   

book11

       

address1

   

book1

   

Expected result:

      

_id

   

Total

   

books

   

Count

       

address1

   

4

   

book1

   

3

       
   
   

book5

   

1

       

address15

   

1

   

book1

   

1

       

address2

   

3

   

book1

   

2

       
   
   

book5

   

1

       

address3

   

1

   

book9

   

1

       

address4

   

1

   

book3

   

1

       

address5

   

1

   

book1

   

1

       

address7

   

1

   

book11

   

1

   

MongoDB script

      

db.books.aggregate([
   {    "group operator groups data by _id and joins up values.

SPL script:

      
   

A

   

B

       

1

   

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

   
       

2

   

=mongo_shell(A1,"computer.find()").fetch()

   
       

3

   

=A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)

   
       

4

   

>A1.close()

   
   

Result:

      

ID

   

INCOME

   

OUTPUT

       

1

   

1600.0

   

1720.0

       

2

   

3550.0

   

1800.0

   

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.

SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.

 

3. Group and aggregate by segments

Task: Divide the following data by specified intervals of sales and count records in each segment.

Sample data

      

_id

   

NAME

   

STATE

   

SALES

       

1

   

Ashley

   

New York

   

11000

       

2

   

Rachel

   

Montana

   

9000

       

3

   

Emily

   

New York

   

8800

       

4

   

Matthew

   

Texas

   

8000

       

5

   

Alexis

   

Illinois

   

14000

   

Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞) 

Expected result

      

Segment

   

number

       

3

   

3

       

4

   

2

   

MongoDB script:

      

var a_count=0;
  var b_count=0;
  var c_count=0;
  var d_count=0;
  var e_count=0;
  db.sales.find({
  }).forEach( 
    function(myDoc) { 
      if (myDoc.SALES <3000)  { 
         a_count += 1;
       }
       else if (myDoc.SALES <5000)  { 
        b_count += 1;
       }
       else if (myDoc.SALES  <7500) { 
        c_count += 1;
       }
      else if (myDoc.SALES  <10000) { 
        d_count += 1;
      }
      else { 
        e_count += 1;
      } 
     } 
    );
  
  print("a_count="+a_count)
  print("b_count="+b_count)
  print("c_count="+c_count)
  print("d_count="+d_count)
  print("e_count="+e_count)

   

MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions. 

SPL script:

      
   

A

   

B

       

1

   

[3000,5000,7500,10000,15000]

   
       

2

   

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

   
       

3

   

=mongo_shell(A2,"sales.find()").fetch()

   
       

4

   

=A3.groups(A1.pseg(int(~.SALES)):Segment;count(1):    number)

   
       

5

   

>A2.close()

   
   

Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg() function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.

Both scripts get the desired result, but SPL script is more concise by using pseg() function.

 

4. Multilevel group and aggregate 

Task: Group the following data by addr and then by book and count books for each addr and those for each book type.

      

addr

   

book

       

address1

   

book1

       

address2

   

book1

       

address1

   

book5

       

address3

   

book9

       

address2

   

book5

       

address2

   

book1

       

address1

   

book1

       

address15

   

book1

       

address4

   

book3

       

address5

   

book1

       

address7

   

book11

       

address1

   

book1

   

Expected result:

      

_id

   

Total

   

books

   

Count

       

address1

   

4

   

book1

   

3

       
   
   

book5

   

1

       

address15

   

1

   

book1

   

1

       

address2

   

3

   

book1

   

2

       
   
   

book5

   

1

       

address3

   

1

   

book9

   

1

       

address4

   

1

   

book3

   

1

       

address5

   

1

   

book1

   

1

       

address7

   

1

   

book11

   

1

   

MongoDB script

      

db.books.aggregate([
   {    "group operator groups data by _id and joins up values.

SPL script:


A

B

1

=mongo_open("mongodb://127.0.0.1:27017/raqdb")


2

=mongo_shell(A1,"computer.find()").fetch()


3

=A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)


4

>A1.close()


Result:

ID

INCOME

OUTPUT

1

1600.0

1720.0

2

3550.0

1800.0

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.

SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.

 

3. Group and aggregate by segments

Task: Divide the following data by specified intervals of sales and count records in each segment.

Sample data:

_id

NAME

STATE

SALES

1

Ashley

New York

11000

2

Rachel

Montana

9000

3

Emily

New York

8800

4

Matthew

Texas

8000

5

Alexis

Illinois

14000

Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞)

Expected result:

Segment

number

3

3

4

2

MongoDB script:

var a_count=0;
  var b_count=0;
  var c_count=0;
  var d_count=0;
  var e_count=0;
  db.sales.find({
  }).forEach(
    function(myDoc) {
      if (myDoc.SALES <3000)  {
         a_count += 1;
       }
       else if (myDoc.SALES <5000)  {
        b_count += 1;
       }
       else if (myDoc.SALES  <7500) {
        c_count += 1;
       }
      else if (myDoc.SALES  <10000) {
        d_count += 1;
      }
      else {
        e_count += 1;
      }
     }
    );
 
  print("a_count="+a_count)
  print("b_count="+b_count)
  print("c_count="+c_count)
  print("d_count="+d_count)
  print("e_count="+e_count)

MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions.

SPL script:


A

B

1

[3000,5000,7500,10000,15000]


2

=mongo_open("mongodb://127.0.0.1:27017/raqdb")


3

=mongo_shell(A2,"sales.find()").fetch()


4

=A3.groups(A1.pseg(int(~.SALES)):Segment;count(1):    number)


5

>A2.close()


Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg() function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.

Both scripts get the desired result, but SPL script is more concise by using pseg() function.

 

4. Multilevel group and aggregate

Task: Group the following data by addr and then by book and count books for each addr and those for each book type.

addr

book

address1

book1

address2

book1

address1

book5

address3

book9

address2

book5

address2

book1

address1

book1

address15

book1

address4

book3

address5

book1

address7

book11

address1

book1

Expected result:

_id

Total

books

Count

address1

4

book1

3



book5

1

address15

1

book1

1

address2

3

book1

2



book5

1

address3

1

book9

1

address4

1

book3

1

address5

1

book1

1

address7

1

book11

1

MongoDB script:

db.books.aggregate([
   {    "group": {
    "_id": {
    "addr": "addr",
    "book": "addr",
    "book": "addr",
    "book": "book"
   },
    "bookCount":   {"$sum": 1}
   }},
   {    "group": {
    "_id":   "group": {
    "_id":   "group": {
    "_id":   "_id.addr",
    "books": {
    "push": { 
    "book": "push": { 
    "book": "push": {
    "book": "_id.book",
    "count": "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"}
     }},
     {"sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "project": {
        "books": {"$slice": [   "$books", 2] },
        "count": 1
     }}
  ]).pretty()

The above script groups data by addr,book and count books, and then re-group data by addr to count books, and display data in desired format.

SPL script (books.dfx):


A

B

1

=mongo_open("mongodb://127.0.0.1:27017/raqdb")


2

=mongo_shell(A1,"books.find()")


3

=A2.groups(addr,book;count(book):     Count)


4

=A3.groups(addr;sum(Count):Total)


5

=A3.join(addr,A4:addr,Total)

return A5

6

>A1.close()


Result:

Address

book

Count

Total

address1

book1

3

4

address1

book5

1

4

address15

book1

1

1

address2

book1

2

3

address2

book5

1

3

address3

book9

1

1

address4

book3

1

1

address5

book1

1

1

address7

book11

1

1

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from books collection.
A3: Group data by addr&book and count books in each group.
A4: Re-group data by addr to count books in each group.
A5: Add A4’s Total field to A3’s table sequence by addr values.
B5: Return A5’s table sequence.
A6: Close database connection.

The above SPL script is both concise and integration-friendly.

To handle the grouping and aggregation in Java, you need to implement the requirement from the low level. The code is complicated and can’t be used to handle other scenarios. esProc offers JDBC interface to enable a Java application to call the SPL script in the manner of calling a stored procedure. See esProc JDBC Basic Uses to learn more.

Here’s how Java calls an SPL script:
   public void testStudent (){
        Connection con = null;
       com.esproc.jdbc.InternalCStatement st;
    try{
       // Establish a connection
       Class.forName("com.esproc.jdbc.InternalDriver");
       con= DriverManager.getConnection("jdbc:esproc:local://");
       // Call the stored procedure, where books is the SPL script (with dfx extension) name
       st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call books ()");
       // Execute the stored procedure
       st.execute();
       // Get the result set
       ResultSet rs = st.getResultSet();
        
     catch(Exception e){
       System.out.println(e);
    }

It’s easy to use the result of executing an SPL script by a Java application. You can also load the SPL script directly and use its name as a parameter to execute it via a function. esProc supports ODBC, too. The integration via ODBC is similar.

MongoDB’s schemaless document-based structure enables storage convenient and flexible data presentation and query. On the other hand, it increases the complexity of a shell script because there are details to consider and it needs many steps to get to the goal. By contrast, SPL streamlines the process, generates concise and simple code, and increases efficiency.