"When grouping, we often need to assign records that meet the same criterion to the same group. F .."

pjf RaqForum 22 No.
1 Reply • 474 View • 3 Years ago

SPL: grouping by enumeration criteria

When grouping, we often need to assign records that meet the same criterion to the same group. For example, divide cities into large, medium and small ones according to the urban population, sort families into rich, middle class, well-off, and poor ones according to annual income, and so forth. According to the pre-defined enumeration criteria, the qualified members are assigned to the corresponding group, which is referred to as the enumeration grouping.

1. Saving each member in the first matching group

The members of the set to be grouped are taken as parameters to calculate the criteria according to the order of the enumeration criterion set defined in advance.

[e.g. 1] According to the city population of Chinese cities, classify and calculate the cities. Cities with more than 2 million people are megacity behemoths, cities with 1-2 million people are megacities, cities with 0.5-1 million people are big cities, and others are medium-sized and small cities. Some of the data are as follows:

ID	CITY	POPULATION	PROVINCE
1	Shanghai	12286274	Shanghai
2	Beijing	9931140	Beijing
3	Chongqing	7421420	Chongqing
4	Guangzhou	7240465	Guangdong
5	Hong Kong	7010000	Hong Kong Special Administrative Region
…	…	…	…

The A.enum() function in SPL is used for enumeration grouping, and by default each member is saved only in the first matching group.

The SPL script looks like this:

	A
1	=T("UrbanPopulation.txt")
2	[?>2000000,?>1000000,?>500000,?<=500000]
3	=A1.enum(A2,POPULATION)
4	[megacity behemoths, megacities, big cities, medium-sized and small cities]
5	=A3.new(A4(#):CITY_TYPE,~.count():COUNT)

A1: import the city population table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate and group city population tables by criteria, and by default save each member in the first matching group.

A4: define the name of each group.

A5: calculate the number of cities in each group based on the results of enumerated grouping.

2. Saving mismatching members in a new group

When grouping by enumeration criteria, there may be members that do not match any of the criteria, and we can save them in a new group. This applies to scenarios where we care not only about the information of matching members, but also about other mismatching members.

[e.g. 2] To calculate the average salaries by group, the employees are divided into three groups by age: those under 35, those under 45, and others. The data of the employee table are as follows:

ID	NAME	BIRTHDAY	STATE	DEPT	SALARY
1	Rebecca	1974/11/20	California	R&D	7000
2	Ashley	1980/07/19	New York	Finance	11000
3	Rachel	1970/12/17	New Mexico	Sales	9000
4	Emily	1985/03/07	Texas	HR	7000
5	Ashley	1975/05/13	Texas	R&D	16000
…	…	…	…	…	…

option @n of the A.enum() function in SPL is used to save mismatching members to a new group when grouping by enumeration.

The SPL script is as follows:

	A
1	=T("Employee.csv")
2	[?<35,?<45]
3	=A1.enum@n(A2, age(BIRTHDAY))
4	[under 35, under 45, other]
5	=A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY)

A1: import the employee table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate groups by age, use the option @n to save the mismatching members to a new group.

A4: define the name of each group.

A5: calculate the average salary per employee based on the results of enumerated grouping.

3. Overlapped grouping according to enumeration criteria

Sometimes we need to redistribute members that meet multiple grouping criteria to all corresponding groups.

[e.g.3] According to the city GDP table, calculate the per capita GDP of the municipalities, the first-tier cities and the second-tier cities respectively. It should be noted that the subgroups may have overlapped members, for example, Beijing is both a first-tier city and a municipality. Some of the data are as follows:

ID	CITY	GDP	POPULATION
1	Shanghai	32679	2418
2	Beijing	30320	2171
3	Shenzhen	24691	1253
4	Guangzhou	23000	1450
5	Chongqing	20363	3372
…	…	…	…

The option @r of the A.enum() function in SPL is used to query whether all the members match or not for each group when enumerating the groups.

The SPL script looks like this:

	A
1	=T("CityGDP.txt")
2	[["Beijing","Shanghai","Tianjin","Chongqing"],["Beijing","Shanghai","Guangzhou","Shenzhen"],["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"]]
3	[A2(1).pos(?)>0,A2(2).pos(?)>0,A2(3).pos(?)>0]
4	=A1.enum@r(A3,CITY)
5	[municipalities, first-tier cities, second-tier cities]
6	=A4.new(A5(#):AREA,~.sum(GDP)/~.sum(POPULATION)*10000:CAPITA_GDP)