Getting Top N from MongoDB Data


MongoDB is a popular NoSQL database that can store a huge volume of data. There is a common need in statistical projects: sort a large number of objects and only get the top N, such as getting the latest values of a target field. In this article we’ll illustrate how to handle this in SPL.
Collection last3 has two fields: variable and timestamp. Task: group documents by variable, get documents with the latest 3 timestamp, and then find the document with the earliest timestamp.
Here’s part of the source data:

{"_id" :  ObjectId("54f69645e4b077ed8d997857"),"variable" :  "A", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997856"),"variable" :  "A", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997855"),"variable" :  "A", "timestamp" : ISODate("1995-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997854"),"variable" :  "B", "timestamp" : ISODate("1995-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997853"),"variable" :  "B", "timestamp" : ISODate("1995-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997852"),"variable" :  "B", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997851"),"variable" :  "C", "timestamp" : ISODate("1994-01-03T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997850"),"variable" :  "C", "timestamp" : ISODate("1994-01-02T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997858"),"variable" :  "C", "timestamp" : ISODate("1994-01-01T00:00:00Z")}
{"_id" :  ObjectId("54f69645e4b077ed8d997859"),"variable" :  "C", "timestamp" : ISODate("1993-01-01T00:00:00Z")}

SPL script:

1 =mongo_open(“mongodb://localhost:27017/local?user=test&password=test”)
2 =mongo_shell(A1,“last3.find(,{_id:0};{variable:1})”)
3 for A2;variable;-timestamp)
4 =@|B3
5 =B4.minp(~.timestamp)
6 =mongo_close(A1)

A1: Connect to MongoDB database. The connection string’s format is mongo://ip:port/db?arg=value&….
A2: find() function retrieves documents from collection last3, sorts them and return result as a cursor. As no filtering condition is specified, it gets all fields except _id and sorts them by variable.
A3: Fetch a set of documents with same variable field value from A2’s cursor each time and do this iteratively. A3’s working scope covers an indented area from B3 to B4, within which A3 references a loop variable. A3 stores data in the memory and we can view a fetch result in debugging mode:

variable timestamp
C 1994-01-03 08:00:00
C 1994-01-02 08:00:00
C 1994-01-01 08:00:00
C 1993-01-01 08:00:00

B3: Select documents with the latest (largest) timestamp.
B4: Append B3’s result to B4. Here’s B4’’s result:

variable timestamp
A 1995-01-03 08:00:00
A 1995-01-02 08:00:00
A 1995-01-01 08:00:00
B 1995-01-02 08:00:00
B 1995-01-01 08:00:00
B 1994-01-03 08:00:00
C 1994-01-03 08:00:00
C 1994-01-02 08:00:00
C 1994-01-01 08:00:00

A5: Find the document with the earliest (smallest) timestamp from B4:

variable timestamp
C 1994-01-01 08:00:00

A6: Close MongoDB connection.
By simplifying the implementation of mongodb shell, it’s much easier to get topN from MongoDB data in SPL.