SPL: order-related grouping

 

Sometimes the order of the data makes sense when grouping. We at times group the adjacent records that have the same field values or that meet certain conditions. For example, find out the nation that ranks in the first of consecutive Olympic gold medals, find out how many days at most that the closing price of a stock has been increased, and so on. This is where order-related grouping comes in.

 

1.  Grouping by consecutive the same values

When grouping an ordered set, a new group will be created when the values of the fields for grouping change.

 

[e.g. 1] According to the table of the Olympic medal tally, find out the nation with the most consecutive first places and its medal information. Some of the data are as follows:

Game

Nation

Gold

Silver

Copper

30

USA

46

29

29

30

China

38

27

23

30

UK

29

17

19

30

Russia

24

26

32

30

Korea

13

8

7

 

The option @o of A.group() function in SPL enables to create a new group when field values change.

The SPL script looks like this:


A

1

=T("Olympic.txt")

2

=A1.sort@z(GAME,GOLD,SILVER,COPPER)

3

=A2.group@o1(GAME)

4

=A3.group@o(NATION)

5

=A4.maxp(~.len())

A1: import the Olympic medal table.

A2: sort the Olympic Games and the number of medals (gold, silver, bronze) in descending order.

A3: select one of every Olympic Game, because order is the first one of each game.

A4: create new groups when nations change.

A5: select the group with the largest number of members, which is the group with the most consecutive gold medals.

 

2. Grouping by adjacent conditions

When an ordered set is grouped, a new group will be created when evaluation result of the grouping condition is true.

 

[e.g. 2] How many days at most are the closing prices of the Shanghai Composite Index in 2020 consecutively rise? (the rising of the first trading day index). Some of the data are as follows:

DATE

CLOSE

OPEN

VOLUME

AMOUNT

2020/01/02

3085.1976

3066.3357

292470208

3.27197122606E11

2020/01/03

3083.7858

3089.022

261496667

2.89991708382E11

2020/01/06

3083.4083

3070.9088

312575842

3.31182549906E11

2020/01/07

3104.8015

3085.4882

276583111

2.88159227657E11

2020/01/08

3066.8925

3094.2389

297872553

3.06517394459E11

 

The option @i of the A.group() function in SPL enables to create a new group when conditions change.

The SPL script looks like this:


A

1

=T("SSEC.csv")

2

=A1.select(year(DATE)==2020).sort(DATE)

3

=A2.group@i(CLOSE<CLOSE[-1])

4

=A3.max(~.len())

A1: import the Shanghai Composite Index table.

A2: select the records of 2020 and sort them in ascending order of date.

A3: create a new group when the closing price is less than the closing price of the previous day.

A4: calculate the maximum number of days with consecutive rising.

 

3. Grouping by sequence numbers

Sometimes, we can directly or indirectly get the group number (members should be assigned to which group). In this case, we can directly group by the group number.

 

[e.g. 3] Divide the employee into three groups based on their working years (numbers with a remainder are assigned to certain group), and calculate the average salary of each group. Some of the data are as follows:

ID

NAME

BIRTHDAY

ENTRYDATE

DEPT

SALARY

1

Rebecca

1974/11/20

2005/03/11

R&D

7000

2

Ashley

1980/07/19

2008/03/16

Finance

11000

3

Rachel

1970/12/17

2010/12/01

Sales

9000

4

Emily

1985/03/07

2006/08/15

HR

7000

5

Ashley

1975/05/13

2004/07/30

R&D

16000

 

The option @n of the A.group()function in SPL is used to group by sequence number, and records with the same number are assigned to the same group (number N is assigned to Group N, N starts at 1) .

The SPL script looks like this:


A

1

=T("Employee.csv").sort(ENTRYDATE)

2

=A1.group@n((#-1)*3\A1.len()+ 1)

3

=A2.new(#:GROUP_NO,   ~.avg(SALARY):AVG_SALARY)

A1: import the employee table, and sort them by date of entry.

A2: calculate the number of the group to which they belong by the sorted row number, and group them by the number.

A3: calculate the average salary of each group.