SPL: grouping by enumeration criteria

 

When grouping, we often need to assign records that meet the same criterion to the same group. For example, divide cities into large, medium and small ones according to the urban population, sort families into rich, middle class, well-off, and poor ones according to annual income, and so forth. According to the pre-defined enumeration criteria, the qualified members are assigned to the corresponding group, which is referred to as the enumeration grouping.

 

1. Saving each member in the first matching group

The members of the set to be grouped are taken as parameters to calculate the criteria according to the order of the enumeration criterion set defined in advance.

 

[e.g. 1] According to the city population of Chinese cities, classify and calculate the cities. Cities with more than 2 million people are megacity behemoths, cities with 1-2 million people are megacities, cities with 0.5-1 million people are big cities, and others are medium-sized and small cities. Some of the data are as follows:

ID

CITY

POPULATION

PROVINCE

1

Shanghai

12286274

Shanghai

2

Beijing

9931140

Beijing

3

Chongqing

7421420

Chongqing

4

Guangzhou

7240465

Guangdong

5

Hong   Kong

7010000

Hong Kong   Special Administrative Region

The A.enum() function in SPL is used for enumeration grouping, and by default each member is saved only in the first matching group.

The SPL script looks like this:


A

1

=T("UrbanPopulation.txt")

2

[?>2000000,?>1000000,?>500000,?<=500000]

3

=A1.enum(A2,POPULATION)

4

[megacity behemoths, megacities, big cities, medium-sized and small cities]

5

=A3.new(A4(#):CITY_TYPE,~.count():COUNT)

A1: import the city population table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate and group city population tables by criteria, and by default save each member in the first matching group.

A4: define the name of each group.

A5: calculate the number of cities in each group based on the results of enumerated grouping.

 

2. Saving mismatching members in a new group

When grouping by enumeration criteria, there may be members that do not match any of the criteria, and we can save them in a new group. This applies to scenarios where we care not only about the information of matching members, but also about other mismatching members.

 

[e.g. 2] To calculate the average salaries by group, the employees are divided into three groups by age: those under 35, those under 45, and others. The data of the employee table are as follows:

ID

NAME

BIRTHDAY

STATE

DEPT

SALARY

1

Rebecca

1974/11/20

California

R&D

7000

2

Ashley

1980/07/19

New York

Finance

11000

3

Rachel

1970/12/17

New Mexico

Sales

9000

4

Emily

1985/03/07

Texas

HR

7000

5

Ashley

1975/05/13

Texas

R&D

16000

option @n of the A.enum() function in SPL is used to save mismatching members to a new group when grouping by enumeration.

The SPL script is as follows:


A

1

=T("Employee.csv")

2

[?<35,?<45]

3

=A1.enum@n(A2, age(BIRTHDAY))

4

[under 35, under 45, other]

5

=A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY)

A1: import the employee table.

A2: define grouping criteria.

A3: use the A.enum() function to enumerate groups by age, use the option @n to save the mismatching members to a new group.

A4: define the name of each group.

A5: calculate the average salary per employee based on the results of enumerated grouping.

 

3. Overlapped grouping according to enumeration criteria

Sometimes we need to redistribute members that meet multiple grouping criteria to all corresponding groups.

 

[e.g.3] According to the city GDP table, calculate the per capita GDP of the municipalities, the first-tier cities and the second-tier cities respectively. It should be noted that the subgroups may have overlapped members, for example, Beijing is both a first-tier city and a municipality. Some of the data are as follows:

ID

CITY

GDP

POPULATION

1

Shanghai

32679

2418

2

Beijing

30320

2171

3

Shenzhen

24691

1253

4

Guangzhou

23000

1450

5

Chongqing

20363

3372

The option @r of the A.enum() function in SPL is used to query whether all the members match or not for each group when enumerating the groups.

The SPL script looks like this:


A

1

=T("CityGDP.txt")

2

[["Beijing","Shanghai","Tianjin","Chongqing"],["Beijing","Shanghai","Guangzhou","Shenzhen"],["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"]]

3

[A2(1).pos(?)>0,A2(2).pos(?)>0,A2(3).pos(?)>0]

4

=A1.enum@r(A3,CITY)

5

[municipalities, first-tier cities, second-tier cities]

6

=A4.new(A5(#):AREA,~.sum(GDP)/~.sum(POPULATION)*10000:CAPITA_GDP)

A1: import the city GDP table.

A2: define municipalities, first- and second-tier city constant variables.

A3: enumerate the criteria of municipalities, first-tier cities and second-tier cities.

A4: use the A.enum() function to group by enumeration criteria, and use the option @r to query whether all members of each group match or not.

A5: define the name of each group.

A6: calculate the per capita GDP for each group based on the results of enumerated grouping.