Cross Aggregation for MongoDB

 

Also known as contingency table analysis, a cross aggregation is used in statistics to summarize the simultaneous distribution of two or more variables to get their relationship. The statistical technique maps a row with a set of n attributes to a categorical variable or a combination of categorical variables to form a database table for analysis. Heres one example:

 



Score

School

Subject

1

2

3

4

5

A

Sub1

Number of students

Number of students

Sub2

Number of students




B

Sub1

Number of students





Sub2

Number of students





 

If we combine the subject and score, then the table looks like this:

 


Subject-Score

School

sub1-1

sub1-2

sub1-5

sub2-1

A

Number of students





B

Number of students






 

MongoDB is able to store data of similar structures in a clear and natural way. But it has difficulty performing the cross aggregation. Its also complicated to retrieve and summarize data in advanced languages, like Java. In this case, we can perform cross aggregation in esProc SPL (Structured Process Language) for MongoDB. Heres an example:

The Student collection stores school, student name, subject and score, as shown below:

 

db.student.insert  ({school:'school1', sname : 'Sean' , sub1: 4, sub2 :5})

db.student.insert  ({school:'school1', sname : 'chris' , sub1: 4, sub2 :3})

db.student.insert  ({school:'school1', sname : 'becky' , sub1: 5, sub2 :4})

db.student.insert  ({school:'school1', sname : 'sam' , sub1: 5, sub2 :4})

db.student.insert  ({school:'school2', sname : 'dustin' , sub1: 2, sub2 :2})

db.student.insert  ({school:'school2', sname : 'greg' , sub1: 3, sub2 :4})

db.student.insert  ({school:'school2', sname : 'peter' , sub1: 5, sub2 :1})

db.student.insert  ({school:'school2', sname : 'brad' , sub1: 2, sub2 :2})

db.student.insert  ({school:'school2', sname : 'liz' , sub1: 3, sub2 :null})

 

Heres the crosstab I expect: one school for each row; the second column records the number of students scored 5 for sub1 for each school, the third column records the number of students scored 4 for sub 1 for each school, and so on. The scores are in descending order from 5 to 1. After sub1, it will be sub 2, sub3

undefined 

 

Below is SPL script for the cross aggregation:


A

1

=mongo_open("mongodb://localhost:27017/local?user=test&password=test")

2

=mongo_shell(A1,"student.find()").fetch()

3

=A2.group(school)

4

=A3.new(school:school,~.align@a(5,sub1).(~.len()):sub1,~.align@a(5,sub2).(~.len()):sub2)

5

=A4.new(school,sub1(5):sub1-5,sub1(4):sub1-4,sub1(3):sub1-3,sub1(2):sub1-2,sub1(1):sub1-1,sub2(5):sub2-5,sub2(4):sub2-4,sub2(3):sub2-3,sub2(2):sub2-2,sub2(1):sub2-1)

6

=mongo_close(A1)

A1: Connect to MongoDB. The IP and port are localhost:27017, the database is test, and both user and password are test.

A2: Retrieve Student collection from MongoDB database using find() function and return the data as a cursor. An SPL cursor can be retrieved and processed in batches. That avoids memory overflow due to a data volume exceeding memory capacity. Here the data volume is relatively small, so we can fetch all records from the cursor at a time.

A3: First we group the records by school.
 undefined

A4: align() function groups each group by aligning members to the sequence [1,2,3,4,5] and len() function gets the length of every subgroup.


undefined

A5: Populate A4s each length under the corresponding subject-score column header to generate a record sequence.

Heres the final result:

undefined 


The above algorithm is like this: SPL groups records, then group each group using align() function by scores and calculates the length of each subgroup, and then display data in the required format.

Note: To access MongoDB in esProc, you need to put the necessary Java driver (like mongo-java-driver-3.9.1.jar) into the databases external library directory (extLib\MongoCli) in esProc.

 

esProc doesnt limit the computations only within its IDE, it enables integration with a Java application. To do this, you just need to write another line of code, which is return A5, to output the result in the form of a data set to the Java application. Similarly, to call an SPL script from Java to access MongoDB, you should first put MongoDB Java driver into the Java applications classpath.