"The document-based NoSQL database MongoDB is popular worldwide thanks to its powerful processing .."

pjf RaqForum 22 No.
552 View • 4 Years ago

MongoDB Data Grouping & Aggregation

The document-based NoSQL database MongoDB is popular worldwide thanks to its powerful processing ability. Besides the basic insert, delete, update and query, it also supports the relatively complicated grouping and aggregate operation. Yet in computing field there is always a better option. Here we’ll compare MongoDB method and SPL (Structured Process Language) way of grouping and summarizing data stored in MongoDB in several scenarios.

1. Nested array query

Task: Calculate average score of all subjects and each student’s total score.

Sample data:

_id	name	Gender	Score
1	Tom	F	[{"subject":" Physics","mark":60 }, {"subject":"Chemical","mark":72 }]
2	Jerry	M	[{"subject":" Physics","mark":92 }, {"subject":" Math","mark":81}]

Expected result:

Physics	76	Tom	132
Chemical	72	Jerry	173
Math	81

Mongodb script:

db.student.aggregate( [
   {$unwind : "score"},
  {score"},
  {score"},
  {group: {
   "_id":   {"name"  :"name"}   ,
   "qty":{"\$sum"   :  "name"}   ,
   "qty":{"\$sum"   :  "name"}   ,
   "qty":{"\$sum"   :  "score.mark"}
   }
  }
  ] )

db.student.aggregate( [
{unwind : "score"},
  {unwind : "score"},
  {unwind : "score"},
  {group: {
   "_id":  {"subject":"score.subject"}   ,
   "qty":{"\$avg":   "score.subject"}   ,
   "qty":{"\$avg":   "score.subject"}   ,
   "qty":{"\$avg":   "score.mark"}
   }
  }
  ] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed usingscore"},
  {group: {
   "_id":  {"subject":"group: {
   "_id":  {"subject":"group: {
   "_id":  {"subject":"score.subject"}   ,
   "qty":{"$avg":   "score.mark"}
   }
  }
  ] )

db.student.aggregate( [
   {\$unwind : "score.mark"}
   }
  }
  ] )

db.student.aggregate( [
   {\$unwind : "score"},
  {group: {
   "_id":   {"name"  :"group: {
   "_id":   {"name"  :"group: {
   "_id":   {"name"  :"name"}   ,
   "qty":{"$sum"   :  "score.mark"}
   }
  }
  ] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed usingscore"},
  {score.mark"}
   }
  }
  ] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed usingscore"},
{group: {
"_id": {"subject":"score.subject"}   ,
   "qty":{"\$avg":   "score.subject"}   ,
   "qty":{"\$avg":   "score.subject"} ,
"qty":{"\$avg": "score.mark"}
}
}
] )

db.student.aggregate( [
{$unwind : "score"},
  {score"},
  {score"},
{group: {
"_id": {"name" :"name"}   ,
   "qty":{"\$sum"   :  "name"}   ,
   "qty":{"\$sum"   :  "name"} ,
"qty":{"\$sum" : "score.mark"}
}
}
] )

The score field contains arrays of [{subject, mark},…] structure. They need to be split to map on to the student to be further processed using unwind operator and group() method.

SPL script (student.dfx):

	A	B
1	=mongo_open("mongodb://127.0.0.1:27017/raqdb")
2	=mongo_shell(A1,"student.find()").fetch()
3	=A2.conj(score).groups(subject:SUBJECT;avg(mark):AVG)
4	=A2.new(name:NAME,score.sum(mark):TOTAL)
5	>A1.close()

Average score for each subject:

SUBJECT	AVG
Chemical	72.0
Math	81.0
Physics	76.0

Total score of each student:

NAME	TOTAL
Tom	132
Jerry	173

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from student collection.
A3: Split score field and union it into a table sequence, and group the table by subject to calculate the average score.
A4: Calculate each student’s total score and return a new table sequence consisting of NAME and TOTAL fields. new() function creates a new table sequence.
A5: Close mongodb connection.

2. Nested document query

Task: Calculate sum of quantities in every income value and output value in each document.
Sample data:

_id	income	output
1	｛"cpu":1000, "mem":500, "mouse":"100"｝	｛"cpu":1000, "mem":600 ,"mouse":"120"｝
2	｛"cpu":2000, "mem":1000, "mouse":"50","mainboard":500 ｝	{"cpu":1500, "mem":300}

Expected result:

_id	income	output
1	1600	1720
2	3550	1800

Mongodb script:

var fields = [ "income", "output"];
db.computer.aggregate([
{
   project:{
      "values":{
       filter:{
          input:{
             "\$objectToArray":"filter:{
          input:{
             "\$objectToArray":"filter:{
          input:{
             "\$objectToArray":"ROOT"
          },
          cond:{
            ROOT"
          },
          cond:{
            ROOT"
          },
          cond:{
            in:[
             "this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    {
      $project:{
       key:"$values.k",
       values:{
         "$sum":{
          "$let":{
            "vars":{
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    {
      $project:{
       key:"$values.k",
       values:{
         "$sum":{
          "$let":{
            "vars":{
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"$values"
    },
    {
      $project:{
       key:"$values.k",
       values:{
         "$sum":{
          "$let":{
            "vars":{
              "item":{
                 "\$objectToArray":"$values.v"
              }
            },
             "in":"item.v"
           }
         }
        }
      }
    },
   {sort: {"_id":-1}},
    { "sort: {"_id":-1}},
    { "sort: {"_id":-1}},
    { "group": {
    "_id": "_id",
    'income':{"\$first":  "_id",
    'income':{"\$first":  "_id",
    'income':{"\$first":  "values"},
    "output":{"$last":    "values"}
    }},
  ]);

project:{
      "values":{
      values"}
    }},
  ]);

project:{
      "values":{
       filter:{
          input:{
             "$objectToArray":"ROOT"
          },
          cond:{
            $in:[
             "ROOT"
          },
          cond:{
            $in:[
             "ROOT"
          },
          cond:{
            $in:[
             "this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     $unwind:"values"
    },
    {
     values"
    },
    {
     values"
    },
    {
      project:{
       key:"values.k",
       values:{
         "values.k",
       values:{
         "values.k",
       values:{
         "sum":{
          "let":{
            "vars":{
              "item":{
                 "\$objectToArray":"let":{
            "vars":{
              "item":{
                 "\$objectToArray":"let":{
            "vars":{
              "item":{
                 "\$objectToArray":"values.v"
              }
            },
             "in":"item.v"
           }
         }
        }
      }
    },
   {$sort: {"_id":-1}},
    { "$group": {
    "_id": "$_id",
    'income':{"\$first":  "$values"},
    "output":{"\$last":    "$values"}
    }},
  ]);

project:{
      "values":{
       $filter:{
          input:{
             "\$objectToArray":"item.v"
           }
         }
        }
      }
    },
   {$sort: {"_id":-1}},
    { "$group": {
    "_id": "$_id",
    'income':{"\$first":  "$values"},
    "output":{"\$last":    "$values"}
    }},
  ]);

project:{
   "values":{
   $filter:{
        input:{
          "\$objectToArray":"ROOT"
        },
        cond:{
         in:[
             "in:[
             "in:[
          "this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"this.k",
             fields
            ]
           }
          }
         }
     }
    },
    {
     \$unwind:"this.k",
           fields
         ]
        }
        }
       }
   }
},
{
   \$unwind:"values"
},
{
    project:{
       key:"project:{
       key:"project:{
    key:"values.k",
     values:{
      "sum":{
          "sum":{
          "sum":{
        "let":{
          "vars":{
            "item":{
               "$objectToArray":"values.v"
              }
            },
             "in":"values.v"
              }
            },
             "in":"values.v"
            }
          },
          "in":"item.v"
           }
         }
        }
      }
    },
   {item.v"
           }
         }
        }
      }
    },
   {item.v"
         }
       }
      }
    }
},
{sort: {"_id":-1}},
{ "group": {
    "_id": "group": {
    "_id": "group": {
"_id": "_id",
  'income':{"$first": "values"},
    "output":{"\$last":    "values"},
    "output":{"\$last":    "values"},
"output":{"\$last": "values"}
}},
]);

filter operator stores income field and output field as arrays. unwindoperatorspliteacharrayintoarecord.unwindoperatorspliteacharrayintoarecord.unwind operator split each array into a record. sum operator calculates sum of members. group operator groups data by _id and joins up values.

SPL script:

=mongo_open("mongodb://127.0.0.1:27017/raqdb")

=mongo_shell(A1,"computer.find()").fetch()

=A2.new(_id:ID,income.array().sum():INCOME,output.array().sum():OUTPUT)

>A1.close()

Result:

INCOME

OUTPUT

1600.0

1720.0

3550.0

1800.0

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from computer collection.
A3: Split each income value and output value into a sequence and perform sum and join results up with corresponding ID to form a new table sequence.
A4: Close database connection.

SPL gets field values from each subdocument and calculates their sum. This is much simpler. Beside, a nested document looks like a nested array at first sight, so be careful when you write the script.

3. Group and aggregate by segments

Task: Divide the following data by specified intervals of sales and count records in each segment.

Sample data:

_id	NAME	STATE	SALES
1	Ashley	New York	11000
2	Rachel	Montana	9000
3	Emily	New York	8800
4	Matthew	Texas	8000
5	Alexis	Illinois	14000

Grouping conditions: [0,3000);[3000,5000);[5000,7500);[7500,10000);[10000, ∞)

Expected result:

Segment	number
3	3
4	2

MongoDB script:

MongoDB lacks related API to handle grouping according to specified intervals and the hardcoding process is roundabout. The above script is just one of the Mongo solutions.

SPL script:

	A	B
1	[3000,5000,7500,10000,15000]
2	=mongo_open("mongodb://127.0.0.1:27017/raqdb")
3	=mongo_shell(A2,"sales.find()").fetch()
4	=A3.groups(A1.pseg(int(~.SALES)):Segment;count(1): number)
5	>A2.close()

Script explanation:
A1: Define intervals by which SALES is segmented.
A2: Connect to mongodb database.
A3: Read in data from sales table.
A4: Group records by sales intervals and count records in each group. pseg() function returns the sequence number of a group where a records falls in. int() function converts data into an integer.
A5: Close Mongodb connection.

Both scripts get the desired result, but SPL script is more concise by using pseg() function.

4. Multilevel group and aggregate

Task: Group the following data by addr and then by book and count books for each addr and those for each book type.

addr	book
address1	book1
address2	book1
address1	book5
address3	book9
address2	book5
address2	book1
address1	book1
address15	book1
address4	book3
address5	book1
address7	book11
address1	book1

Expected result:

_id	Total	books	Count
address1	4	book1	3
		book5	1
address15	1	book1	1
address2	3	book1	2
		book5	1
address3	1	book9	1
address4	1	book3	1
address5	1	book1	1
address7	1	book11	1

MongoDB script:

db.books.aggregate([
{ "group": {
  "_id": {
"addr": "addr",
    "book": "addr",
    "book": "addr",
"book": "book"
},
"bookCount": {"sum": 1}
   }},
   {    "group": {
    "_id":   "group": {
    "_id":   "group": {
    "_id":   "_id.addr",
    "books": {
    "push": {
    "book": "push": {
    "book": "push": {
    "book": "_id.book",
    "count": "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"}
     }},
     {"sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "project": {
        "books": {"sum": 1}
   }},
   {    "group": {
    "_id":   "group": {
    "_id":   "group": {
    "_id":   "_id.addr",
    "books": {
    "push": {
    "book": "push": {
    "book": "push": {
    "book": "_id.book",
    "count": "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"}
     }},
     {"sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "project": {
        "books": {"sum": 1}
}},
{ "group": {
    "_id":   "group": {
    "_id":   "group": {
"_id": "_id.addr",
  "books": {
"push": {
    "book": "push": {
    "book": "push": {
  "book": "_id.book",
  "count": "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
   },
       },
        "count": {"\$sum":   "bookCount"
},
    },
     "count": {"\$sum": "bookCount"}
   }},
   {"sort":  { "count": -1} },
     {  "sort":  { "count": -1} },
     {  "sort": { "count": -1} },
{ "project": {
     "books": {"slice": [ "$books", 2] },
     "count": 1
   }}
]).pretty()

The above script groups data by addr,book and count books, and then re-group data by addr to count books, and display data in desired format.

SPL script (books.dfx):

	A	B
1	=mongo_open("mongodb://127.0.0.1:27017/raqdb")
2	=mongo_shell(A1,"books.find()")
3	=A2.groups(addr,book;count(book): Count)
4	=A3.groups(addr;sum(Count):Total)
5	=A3.join(addr,A4:addr,Total)	return A5
6	>A1.close()

Result:

Address	book	Count	Total
address1	book1	3	4
address1	book5	1	4
address15	book1	1	1
address2	book1	2	3
address2	book5	1	3
address3	book9	1	1
address4	book3	1	1
address5	book1	1	1
address7	book11	1	1

Script explanation:
A1: Connect to mongodb database.
A2: Read in data from books collection.
A3: Group data by addr&book and count books in each group.
A4: Re-group data by addr to count books in each group.
A5: Add A4’s Total field to A3’s table sequence by addr values.
B5: Return A5’s table sequence.
A6: Close database connection.

The above SPL script is both concise and integration-friendly.

To handle the grouping and aggregation in Java, you need to implement the requirement from the low level. The code is complicated and can’t be used to handle other scenarios. esProc offers JDBC interface to enable a Java application to call the SPL script in the manner of calling a stored procedure. See esProc JDBC Basic Uses to learn more.

Here’s how Java calls an SPL script:
   public void testStudent (){
        Connection con = null;
       com.esproc.jdbc.InternalCStatement st;
    try{
       // Establish a connection
       Class.forName("com.esproc.jdbc.InternalDriver");
       con= DriverManager.getConnection("jdbc:esproc:local://");
       // Call the stored procedure, where books is the SPL script (with dfx extension) name
       st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call books ()");
       // Execute the stored procedure
       st.execute();
      // Get the result set
       ResultSet rs = st.getResultSet();

     catch(Exception e){
       System.out.println(e);
    }

It’s easy to use the result of executing an SPL script by a Java application. You can also load the SPL script directly and use its name as a parameter to execute it via a function. esProc supports ODBC, too. The integration via ODBC is similar.

MongoDB’s schemaless document-based structure enables storage convenient and flexible data presentation and query. On the other hand, it increases the complexity of a shell script because there are details to consider and it needs many steps to get to the goal. By contrast, SPL streamlines the process, generates concise and simple code, and increases efficiency.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

pjf • 552 View • 4 Years ago

MongoDB Data Grouping & Aggregation

1. Nested array query

2. Nested document query

3. Group and aggregate by segments

4. Multilevel group and aggregate

ToC