Cross Aggregation for MongoDB
Also known as contingency table analysis, a cross aggregation is used in statistics to summarize the simultaneous distribution of two or more variables to get their relationship. The statistical technique maps a row with a set of n attributes to a categorical variable or a combination of categorical variables to form a database table for analysis. Here’s one example:
Score | ||||||
School | Subject | 1 | 2 | 3 | 4 | 5 |
A | Sub1 | Number of students | Number of students | … | … | … |
Sub2 | Number of students | … | ||||
B | Sub1 | Number of students | ||||
Sub2 | Number of students |
If we combine the subject and score, then the table looks like this:
Subject-Score | ||||||
School | sub1-1 | sub1-2 | … | sub1-5 | sub2-1 | … |
A | Number of students | … | ||||
B | Number of students |
MongoDB is able to store data of similar structures in a clear and natural way. But it has difficulty performing the cross aggregation. It’s also complicated to retrieve and summarize data in advanced languages, like Java. In this case, we can perform cross aggregation in esProc SPL (Structured Process Language) for MongoDB. Here’s an example:
The Student collection stores school, student name, subject and score, as shown below:
db.student.insert ({school:'school1', sname : 'Sean' , sub1: 4, sub2 :5})
db.student.insert ({school:'school1', sname : 'chris' , sub1: 4, sub2 :3})
db.student.insert ({school:'school1', sname : 'becky' , sub1: 5, sub2 :4})
db.student.insert ({school:'school1', sname : 'sam' , sub1: 5, sub2 :4})
db.student.insert ({school:'school2', sname : 'dustin' , sub1: 2, sub2 :2})
db.student.insert ({school:'school2', sname : 'greg' , sub1: 3, sub2 :4})
db.student.insert ({school:'school2', sname : 'peter' , sub1: 5, sub2 :1})
db.student.insert ({school:'school2', sname : 'brad' , sub1: 2, sub2 :2})
db.student.insert ({school:'school2', sname : 'liz' , sub1: 3, sub2 :null})
Here’s the crosstab I expect: one school for each row; the second column records the number of students scored 5 for sub1 for each school, the third column records the number of students scored 4 for sub 1 for each school, and so on. The scores are in descending order from 5 to 1. After sub1, it will be sub 2, sub3…
Below is SPL script for the cross aggregation:
A | |
1 | =mongo_open("mongodb://localhost:27017/local?user=test&password=test") |
2 | =mongo_shell(A1,"student.find()").fetch() |
3 | =A2.group(school) |
4 | =A3.new(school:school,~.align@a(5,sub1).(~.len()):sub1,~.align@a(5,sub2).(~.len()):sub2) |
5 | =A4.new(school,sub1(5):sub1-5,sub1(4):sub1-4,sub1(3):sub1-3,sub1(2):sub1-2,sub1(1):sub1-1,sub2(5):sub2-5,sub2(4):sub2-4,sub2(3):sub2-3,sub2(2):sub2-2,sub2(1):sub2-1) |
6 | =mongo_close(A1) |
A1: Connect to MongoDB. The IP and port are localhost:27017, the database is test, and both user and password are test.
A2: Retrieve Student collection from MongoDB database using find() function and return the data as a cursor. An SPL cursor can be retrieved and processed in batches. That avoids memory overflow due to a data volume exceeding memory capacity. Here the data volume is relatively small, so we can fetch all records from the cursor at a time.
A3: First we group the records by school.
A4: align() function groups each group by aligning members to the sequence [1,2,3,4,5] and len() function gets the length of every subgroup.
A5: Populate A4’s each length under the corresponding subject-score column header to generate a record sequence.
Here’s the final result:
The above algorithm is like this: SPL groups records, then group each group using align() function by scores and calculates the length of each subgroup, and then display data in the required format.
Note: To access MongoDB in esProc, you need to put the necessary Java driver (like mongo-java-driver-3.9.1.jar) into the database’s external library directory (extLib\MongoCli) in esProc.
esProc doesn’t limit the computations only within its IDE, it enables integration with a Java application. To do this, you just need to write another line of code, which is return A5, to output the result in the form of a data set to the Java application. Similarly, to call an SPL script from Java to access MongoDB, you should first put MongoDB Java driver into the Java application’s classpath.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese Version