“Top N” Queries on Structured Data

 

A “top N” query gets the first or bottom N values, the records, or their positions, from a data set. Sometimes we use such a query after a grouping operation to get values, specific information, or positions from a subset. This article will tell you how to handle different types of “top N” scenarios through examples and offers sample scripts in esProc SPL. Looking “Top N” Queries on Structured Data for details.

 

1. The maximum/minimum

Getting the maximum or minimum value can be regarded as a special top N scenario where N is 1. It’s so common that I single it out for special illustration. There are many such computing tasks, such as getting the highest math score in class one, finding the age of the youngest employee, and so on. On certain occasions, instead of the specific values we are more concerned with their positions. Often this happens in inter-row calculations. One example is to find how much the sales amount in the company’s best month goes up compared with the previous month. To get this done we need to first get the sequence number of the record holding the month with the highest sales amount, then the sales amount in the previous month, and compare them. Other times we want to get certain information in the record containing the maximum or minimum value. Examples include getting the name of the student who has the highest math score in class one, finding which department the youngest employee belongs to in a company, etc.

Now we’ll look at how to deal with the three scenarios of “getting maximum/minimum”. The following table records information of NASDAQ Composite:

Date

Open

Close

Volume

2019/01/02

6506.910156

6665.939941

2261800000

2019/01/03

6584.77002

6463.5

2607290000

2019/01/04

6567.140137

6738.859863

2579550000

2019/01/07

6757.529785

6823.470215

2507550000

2019/01/08

6893.439941

6897.0

2380290000

 

1.1 Get the maximum/minimum value

Example 1 Get the highest closing price in the NASDAQ in the year of 2019.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019)

/ Get data of 2019

3

=A2.max(Close)

/ A.max() function gets the highest   closing price

To get the lowest closing price in the NASDAQ in the year of 2019:


A

B

3

=A2.min(Close)

/ A.min()function gets the lowest   closing price

 

1.2 Get the sequence number(s) of record(s) holding the maximum/minimum value

Example 2 Calculate the growth rate of the trading date with the highest NASDAQ closing price in 2019 compared with the closing price in the previous day.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019).sort(Date)

/ Get data of 2019 and sort it by Date

3

=A2.pmax(Close)

/ A.pmax() gets the sequence number of   the record holding the highest closing price

4

=A2.calc(A3,Close/Close[-1]-1)

/ Calculate the growth rate by   dividing the highest closing price by the closing price of the previous day

There could be more than one record having the maximum value. To return the sequence numbers of all eligible records, you can use @a option in A.pmax() function:


A

B

3

=A2.pmax@a(Close)

/ Get sequence numbers of all records   holding the highest closing price

You can use @z option in A.pmax()function to locate the record(s) from backwards to forwards:


A

B

3

=A2.pmax@z(Close)

/ Get the sequence number of the   record holding the highest closing price

1.3 Get the record(s) holding the maximum/minimum value

Example 3 Get the date in the record holding the highest NASDAQ closing price in the year of 2019.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019)

/ Get data of 2019

3

=A2.maxp(Close)

/ A.maxp() gets the record having the   highest closing price

4

=A3.Date

/ Get the date from A3’s record

A.minp()function is used to get the record(s) containing the minimum value:


A

B

3

=A2.minp(Close)

/ A.minp() gets the record having the   lowest closing price

Both A.maxp()and A.minp() functions can work with @a option and @z option to achieve specific computing goals. Here we won’t give examples.

 

2. The top/bottom N

There are same three scenarios of getting the top/bottom N. Use the same NASDAQ composite data to illustrate them:

Date

Open

Close

Volume

2019/01/02

6506.910156

6665.939941

2261800000

2019/01/03

6584.77002

6463.5

2607290000

2019/01/04

6567.140137

6738.859863

2579550000

2019/01/07

6757.529785

6823.470215

2507550000

2019/01/08

6893.439941

6897.0

2380290000

 

2.1 Get the top/bottom N values

Example 4 Get the top 3 NASDAQ volumes in the year of 2019.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019)

/ Get data of 2019

3

=A2.top(-3, Volume)

/ A.top(n,x) function gets the top 3   volumes

To get the bottom 4 NASDAQ volumes in the year of 2019:


A

B

3

=A2.top(4, Volume)

/ A.top(n,x) function gets the bottom   4 volumes

 

2.2 Get the sequence numbers of the top/bottom N values

Example 5 Calculate the growth rate of each of the trading dates having the 3 highest NASDAQ closing prices in 2019 compared with their closing price in the prior day.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019).sort(Date)

/ Get data of 2019 and sort it by Date

3

=A2.ptop(-3, Close)

/A.ptop(n,x) function gets the   sequence numbers of the highest 3 closing prices

4

=A3.run(~=A2(~).Volume/A2(~-1).Volume-1)

/ Calculate the growth rate circularly   by dividing the current volume by volume of the prior day

 

2.3 Get the records holding the top/bottom N values

Example 6 Get the records of trading dates with 5 lowest volumes in NASDAQ 2019.

SPL script


A

B

1

=file("IXIC.txt").import@t()

/ Import IXIC data

2

=A1.select(year(Date)==2019).sort(Date)

/ Get data of 2019

3

=A2.top(5; Close)

/ A.top(n; x) function gets the   records of 5 trading dates with the lowest volumes

 

3. Top N queries on post-grouping subsets

Getting the maximum/minimum value and the top/bottom N values from each subset after a data set is grouped are also common computing goals, like finding the 5 top-selling items for each month and getting the customers whose total order amounts rank in top 3 for each year. Now let’s look at how to handle top N queries on subsets.

3.1 Get the maximum value in each group

Example 7 Based on the score table, get the highest math score in each class. Below is part of the source table:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

95

1

1

Math

90

1

1

PE

80

1

2

English

75

1

2

Math

84

SPL script


A

B

1

=file("Score.txt").import@t()

/ Import Score table

2

=A1.select(Subject:"Math")

/ Get records of math subject

3

=A2.groups(Class; max(Score):BestScore)

/ Group the selected records by   classes and use max() function to get the highest math score in each class

 

3.2 Get top N values/records in each group

A top N query can be regarded as a kind of aggregate operation on each subset after a data set is grouped. Here we look at how to get top N values and records respectively.

Example 8 Query the highest two math scores in each class. Below is part of the score table:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

95

1

1

Math

90

1

1

PE

80

1

2

English

75

1

2

Math

84

SPL script


A

B

1

=file("Score.txt").import@t()

/ Import Score table

2

=A1.select(Subject:"Math")

/ Get records of math subject

3

=A2.group(Class; ~.top(-2,   Score):top2)

/ Group the selected records by   classes and use A.top() function to get the highest two math scores in each   class

4

=A3.new(Class, top2(1):First,  top2(2):Second)

/ Create the result set where column 1   contains classes, column 2 holds the highest score and column 3 has the   second highest score

 

Example 9 For each subject in each class, query the information of students whose scores rank in top 3. Below is part of the score table:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

95

1

1

Math

90

1

1

PE

80

1

2

English

75

1

2

Math

84

SPL script


A

B

1

=file("Score.txt").import@t()

/ Import Score table

2

=A1.group(Class,Subject;~.top(-3;Score):top3)

/ Group records by classes and   subjects and get the highest two scores in each group

3

=A2.conj(top3)

/ Concatenate records with the highest   2 scores from all groups

 

3.3 Get top N values/records cumulatively

etting top N values or records in a cumulative way won’t generate the subsets after the data set is grouped. This method is used to handle scenarios when data volume is huge. There are same two scenarios here – top N values and top N records.

Example 10 Get the hire dates of the two employees in each department who have the longest hire durations. Below is part of the employee table:

EID

NAME

DEPT

EntryDate

1

Rebecca

R&D

2005/03/11

2

Ashley

Finance

2008/03/16

3

Rachel

Sales

2010/12/01

4

Emily

HR

2006/08/15

5

Ryan

R&D

2004/07/30

SPL script


A

B

1

=file("Employee.txt").cursor@t()

/ Create cursor of the Employee table

2

=A1.groups(Department;  top(2,EntryDate):Top2)

/ Group the cursor by departments and   get the earliest two hire dates in each group

3

=A2.news(Top2;Department, ~:EntryDate)

/ Create a new table where column 1 is   Department and column 2 is EntryDate

 

Example 11 Get the information of employees whose salaries rank in top 3 in each department. Below is part of the employee table:

EID

NAME

DEPT

SALARY

1

Rebecca

R&D

7000

2

Ashley

Finance

11000

3

Rachel

Sales

9000

4

Emily

HR

7000

5

Ryan

R&D

13000

SPL script


A

B

1

=file("Employee.txt").cursor@t()

/ Create cursor of the Employee table

2

=A1.groups(Department;   top(-3;Salary):Top3)

/ Group the cursor by departments and   get the records having a Salary value that ranks in top 3 in each group

3

=A2.conj(Top3)

/ Concatenate eligible records from   all groups

 

Find more examples in SPL CookBook.