"Sometimes the order of the data makes sense when grouping. We at times group the adjacent record .."

pjf RaqForum 22 No.
1 Reply • 528 View • 3 Years ago

SPL: order-related grouping

Sometimes the order of the data makes sense when grouping. We at times group the adjacent records that have the same field values or that meet certain conditions. For example, find out the nation that ranks in the first of consecutive Olympic gold medals, find out how many days at most that the closing price of a stock has been increased, and so on. This is where order-related grouping comes in.

1. Grouping by consecutive the same values

When grouping an ordered set, a new group will be created when the values of the fields for grouping change.

[e.g. 1] According to the table of the Olympic medal tally, find out the nation with the most consecutive first places and its medal information. Some of the data are as follows:

Game	Nation	Gold	Silver	Copper
30	USA	46	29	29
30	China	38	27	23
30	UK	29	17	19
30	Russia	24	26	32
30	Korea	13	8	7
…	…	…	…	…

The option @o of A.group() function in SPL enables to create a new group when field values change.

The SPL script looks like this:

	A
1	=T("Olympic.txt")
2	=A1.sort@z(GAME,GOLD,SILVER,COPPER)
3	=A2.group@o1(GAME)
4	=A3.group@o(NATION)
5	=A4.maxp(~.len())

A1: import the Olympic medal table.

A2: sort the Olympic Games and the number of medals (gold, silver, bronze) in descending order.

A3: select one of every Olympic Game, because order is the first one of each game.

A4: create new groups when nations change.

A5: select the group with the largest number of members, which is the group with the most consecutive gold medals.

2. Grouping by adjacent conditions

When an ordered set is grouped, a new group will be created when evaluation result of the grouping condition is true.

[e.g. 2] How many days at most are the closing prices of the Shanghai Composite Index in 2020 consecutively rise? (the rising of the first trading day index). Some of the data are as follows:

DATE	CLOSE	OPEN	VOLUME	AMOUNT
2020/01/02	3085.1976	3066.3357	292470208	3.27197122606E11
2020/01/03	3083.7858	3089.022	261496667	2.89991708382E11
2020/01/06	3083.4083	3070.9088	312575842	3.31182549906E11
2020/01/07	3104.8015	3085.4882	276583111	2.88159227657E11
2020/01/08	3066.8925	3094.2389	297872553	3.06517394459E11
…	…	…	…	…

The option @i of the A.group() function in SPL enables to create a new group when conditions change.

The SPL script looks like this:

	A
1	=T("SSEC.csv")
2	=A1.select(year(DATE)==2020).sort(DATE)
3	=A2.group@i(CLOSE<CLOSE[-1])
4	=A3.max(~.len())

A1: import the Shanghai Composite Index table.

A2: select the records of 2020 and sort them in ascending order of date.

A3: create a new group when the closing price is less than the closing price of the previous day.

A4: calculate the maximum number of days with consecutive rising.

3. Grouping by sequence numbers

Sometimes, we can directly or indirectly get the group number (members should be assigned to which group). In this case, we can directly group by the group number.

[e.g. 3] Divide the employee into three groups based on their working years (numbers with a remainder are assigned to certain group), and calculate the average salary of each group. Some of the data are as follows:

ID	NAME	BIRTHDAY	ENTRYDATE	DEPT	SALARY
1	Rebecca	1974/11/20	2005/03/11	R&D	7000
2	Ashley	1980/07/19	2008/03/16	Finance	11000
3	Rachel	1970/12/17	2010/12/01	Sales	9000
4	Emily	1985/03/07	2006/08/15	HR	7000
5	Ashley	1975/05/13	2004/07/30	R&D	16000
…	…	…	…	…	…

The option @n of the A.group()function in SPL is used to group by sequence number, and records with the same number are assigned to the same group (number N is assigned to Group N, N starts at 1) .

The SPL script looks like this: