SPL: grouping by enumeration criteria
When grouping, we often need to assign records that meet the same criterion to the same group. For example, divide cities into large, medium and small ones according to the urban population, sort families into rich, middle class, well-off, and poor ones according to annual income, and so forth. According to the pre-defined enumeration criteria, the qualified members are assigned to the corresponding group, which is referred to as the enumeration grouping.
1. Saving each member in the first matching group
The members of the set to be grouped are taken as parameters to calculate the criteria according to the order of the enumeration criterion set defined in advance.
[e.g. 1] According to the city population of Chinese cities, classify and calculate the cities. Cities with more than 2 million people are megacity behemoths, cities with 1-2 million people are megacities, cities with 0.5-1 million people are big cities, and others are medium-sized and small cities. Some of the data are as follows:
ID |
CITY |
POPULATION |
PROVINCE |
1 |
Shanghai |
12286274 |
Shanghai |
2 |
Beijing |
9931140 |
Beijing |
3 |
Chongqing |
7421420 |
Chongqing |
4 |
Guangzhou |
7240465 |
Guangdong |
5 |
Hong Kong |
7010000 |
Hong Kong Special Administrative Region |
… |
… |
… |
… |
The A.enum() function in SPL is used for enumeration grouping, and by default each member is saved only in the first matching group.
The SPL script looks like this:
A |
|
1 |
=T("UrbanPopulation.txt") |
2 |
[?>2000000,?>1000000,?>500000,?<=500000] |
3 |
=A1.enum(A2,POPULATION) |
4 |
[megacity behemoths, megacities, big cities, medium-sized and small cities] |
5 |
=A3.new(A4(#):CITY_TYPE,~.count():COUNT) |
A1: import the city population table.
A2: define grouping criteria.
A3: use the A.enum() function to enumerate and group city population tables by criteria, and by default save each member in the first matching group.
A4: define the name of each group.
A5: calculate the number of cities in each group based on the results of enumerated grouping.
2. Saving mismatching members in a new group
When grouping by enumeration criteria, there may be members that do not match any of the criteria, and we can save them in a new group. This applies to scenarios where we care not only about the information of matching members, but also about other mismatching members.
[e.g. 2] To calculate the average salaries by group, the employees are divided into three groups by age: those under 35, those under 45, and others. The data of the employee table are as follows:
ID |
NAME |
BIRTHDAY |
STATE |
DEPT |
SALARY |
1 |
Rebecca |
1974/11/20 |
California |
R&D |
7000 |
2 |
Ashley |
1980/07/19 |
New York |
Finance |
11000 |
3 |
Rachel |
1970/12/17 |
New Mexico |
Sales |
9000 |
4 |
Emily |
1985/03/07 |
Texas |
HR |
7000 |
5 |
Ashley |
1975/05/13 |
Texas |
R&D |
16000 |
… |
… |
… |
… |
… |
… |
option @n of the A.enum() function in SPL is used to save mismatching members to a new group when grouping by enumeration.
The SPL script is as follows:
A |
|
1 |
=T("Employee.csv") |
2 |
[?<35,?<45] |
3 |
=A1.enum@n(A2, age(BIRTHDAY)) |
4 |
[under 35, under 45, other] |
5 |
=A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY) |
A1: import the employee table.
A2: define grouping criteria.
A3: use the A.enum() function to enumerate groups by age, use the option @n to save the mismatching members to a new group.
A4: define the name of each group.
A5: calculate the average salary per employee based on the results of enumerated grouping.
3. Overlapped grouping according to enumeration criteria
Sometimes we need to redistribute members that meet multiple grouping criteria to all corresponding groups.
[e.g.3] According to the city GDP table, calculate the per capita GDP of the municipalities, the first-tier cities and the second-tier cities respectively. It should be noted that the subgroups may have overlapped members, for example, Beijing is both a first-tier city and a municipality. Some of the data are as follows:
ID |
CITY |
GDP |
POPULATION |
1 |
Shanghai |
32679 |
2418 |
2 |
Beijing |
30320 |
2171 |
3 |
Shenzhen |
24691 |
1253 |
4 |
Guangzhou |
23000 |
1450 |
5 |
Chongqing |
20363 |
3372 |
… |
… |
… |
… |
The option @r of the A.enum() function in SPL is used to query whether all the members match or not for each group when enumerating the groups.
The SPL script looks like this:
A |
|
1 |
=T("CityGDP.txt") |
2 |
[["Beijing","Shanghai","Tianjin","Chongqing"],["Beijing","Shanghai","Guangzhou","Shenzhen"],["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"]] |
3 |
[A2(1).pos(?)>0,A2(2).pos(?)>0,A2(3).pos(?)>0] |
4 |
=A1.enum@r(A3,CITY) |
5 |
[municipalities, first-tier cities, second-tier cities] |
6 |
=A4.new(A5(#):AREA,~.sum(GDP)/~.sum(POPULATION)*10000:CAPITA_GDP) |
A1: import the city GDP table.
A2: define municipalities, first- and second-tier city constant variables.
A3: enumerate the criteria of municipalities, first-tier cities and second-tier cities.
A4: use the A.enum() function to group by enumeration criteria, and use the option @r to query whether all members of each group match or not.
A5: define the name of each group.
A6: calculate the per capita GDP for each group based on the results of enumerated grouping.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL