"Also known as contingency table analysis, a cross aggregation is used in statistics to summarize .."

Hoo RaqForum 19 No.
1 Reply • 1,281 View • 5 Years ago

Cross Aggregation for MongoDB

Also known as contingency table analysis, a cross aggregation is used in statistics to summarize the simultaneous distribution of two or more variables to get their relationship. The statistical technique maps a row with a set of n attributes to a categorical variable or a combination of categorical variables to form a database table for analysis. Here’s one example:

		Score
School	Subject	1	2	3	4	5
A	Sub1	Number of students	Number of students	…	…	…
A	Sub2	Number of students	…
B	Sub1	Number of students
B	Sub2	Number of students

If we combine the subject and score, then the table looks like this:

	Subject-Score
School	sub1-1	sub1-2	…	sub1-5	sub2-1	…
A	Number of students	…
B	Number of students

MongoDB is able to store data of similar structures in a clear and natural way. But it has difficulty performing the cross aggregation. It’s also complicated to retrieve and summarize data in advanced languages, like Java. In this case, we can perform cross aggregation in esProc SPL (Structured Process Language) for MongoDB. Here’s an example:

The Student collection stores school, student name, subject and score, as shown below:

db.student.insert ({school:'school1', sname : 'Sean' , sub1: 4, sub2 :5})

db.student.insert ({school:'school1', sname : 'chris' , sub1: 4, sub2 :3})

db.student.insert ({school:'school1', sname : 'becky' , sub1: 5, sub2 :4})

db.student.insert ({school:'school1', sname : 'sam' , sub1: 5, sub2 :4})

db.student.insert ({school:'school2', sname : 'dustin' , sub1: 2, sub2 :2})

db.student.insert ({school:'school2', sname : 'greg' , sub1: 3, sub2 :4})

db.student.insert ({school:'school2', sname : 'peter' , sub1: 5, sub2 :1})

db.student.insert ({school:'school2', sname : 'brad' , sub1: 2, sub2 :2})

db.student.insert ({school:'school2', sname : 'liz' , sub1: 3, sub2 :null})

Here’s the crosstab I expect: one school for each row; the second column records the number of students scored 5 for sub1 for each school, the third column records the number of students scored 4 for sub 1 for each school, and so on. The scores are in descending order from 5 to 1. After sub1, it will be sub 2, sub3…

undefined

Below is SPL script for the cross aggregation:

	A
1	=mongo_open("mongodb://localhost:27017/local?user=test&password=test")
2	=mongo_shell(A1,"student.find()").fetch()
3	=A2.group(school)
4	=A3.new(school:school,~.align@a(5,sub1).(~.len()):sub1,~.align@a(5,sub2).(~.len()):sub2)
5	=A4.new(school,sub1(5):sub1-5,sub1(4):sub1-4,sub1(3):sub1-3,sub1(2):sub1-2,sub1(1):sub1-1,sub2(5):sub2-5,sub2(4):sub2-4,sub2(3):sub2-3,sub2(2):sub2-2,sub2(1):sub2-1)
6	=mongo_close(A1)

A1: Connect to MongoDB. The IP and port are localhost:27017, the database is test, and both user and password are test.

A2: Retrieve Student collection from MongoDB database using find() function and return the data as a cursor. An SPL cursor can be retrieved and processed in batches. That avoids memory overflow due to a data volume exceeding memory capacity. Here the data volume is relatively small, so we can fetch all records from the cursor at a time.

A3: First we group the records by school.
undefined

A4: align() function groups each group by aligning members to the sequence [1,2,3,4,5] and len() function gets the length of every subgroup.

undefined

A5: Populate A4’s each length under the corresponding subject-score column header to generate a record sequence.

Here’s the final result:

undefined

The above algorithm is like this: SPL groups records, then group each group using align() function by scores and calculates the length of each subgroup, and then display data in the required format.

Note: To access MongoDB in esProc, you need to put the necessary Java driver (like mongo-java-driver-3.9.1.jar) into the database’s external library directory (extLib\MongoCli) in esProc.

esProc doesn’t limit the computations only within its IDE, it enables integration with a Java application. To do this, you just need to write another line of code, which is return A5, to output the result in the form of a data set to the Java application. Similarly, to call an SPL script from Java to access MongoDB, you should first put MongoDB Java driver into the Java application’s classpath.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc