SPL: recounting after obtaining a grouped subset

 

Grouping operation is to assign the members in a set with the same attributes to the same group. For example, employee tables are grouped by department, and each group of employees has the same department; sales tables can be grouped by year, and each group has sales records of the same year, and so on.

The essence of grouping operation is to divide a set into several subsets according to certain regulations. That is to say, the return values should be a set consisting of sets. For each member set, we call it a grouped subset.

In general, people are more interested in aggregation values of grouped subsets, so grouping operations are often accompanied by further aggregation of subsets. However, there are times when we are more concerned with these grouped subsets than in aggregated values.

 

Let’s take the employee table as an example. Some of the figures are as follows:

ID

NAME

BIRTHDAY

STATE

DEPT

SALARY

1

Rebecca

1974/11/20

California

R&D

7000

2

Ashley

1980/07/19

New York

Finance

11000

3

Rachel

1970/12/17

New Mexico

Sales

9000

4

Emily

1985/03/07

Texas

HR

7000

5

Ashley

1975/05/13

Texas

R&D

16000

 

[e.g. 1] Identify employees with the same date of birth.

The A.group() function is provided in SPL. We can use it to accomplish real grouping, and the result set after grouping is composed of multiple grouped subsets.

The SPL script is as follows:

说明: IMG_256

A

1

=T("Employee.csv")

2

=A1.group(BIRTHDAY)

3

=A2.select(~.len()>1).conj()

A1: import the employee table.

A2: use A.group() function to group them by date of birth.

A3: select the subgroup with more than one member, which is exactly the subset with the same date of birth, and then merge those subsets.

 

Strictly speaking, grouping and aggregating are two separate actions, but they always occur together in SQL, leaving the illusion that both of them must be used at the same time. In fact, this combination is a limitation for grouping operations, or much more operations can be done other than a few aggregation functions in SQL after grouping.

 

[e.g. 2] Search the employees who are younger than the average age of the department.

As described earlier, the A.group()function is used for grouping. We can define the operations that are performed on each subset after the grouping within the A.group() function. It is not limited to SUM, COUNT and other aggregation operations, and we can define some complex operations.

The SPL script looks like this:

说明: IMG_256

A

1

=T("Employee.csv")

2

=A1.group(DEPT; (a=~.avg(age(BIRTHDAY)), ~.select(age(BIRTHDAY)<a)):YOUNG)

3

=A2.conj(YOUNG)

A1: import the employee table.

A2: group by department and select records of below-average age in each subgroup. In the aggregation operation of A.group() function, we can use temporary variables to make the operation easier to understand.

A3: concatenate the selected records.

 

[e.g. 3] Select the states with more than 50 employees and summarize their average salaries by department.

The grouping itself is a complex operation, and the grouped subsets can be used repeatedly to improve the operational efficiency. In SPL, the A.group() function can keep the grouped subsets after grouping, and can perform operations like grouping aggregation and other operations on them again.

 

The SPL script is as follows:

说明: IMG_256

A

1

=T("Employee.csv")

2

=A1.group(STATE).select(~.len()>50)

3

=A2.conj(~.groups(DEPT; avg(SALARY):AVG_SALARY).derive(A2.~.STATE:STATE))

A1: import the employee table.

A2: group by state and select a grouped subsets with more than 50 employees.

A3: again, group the grouped subsets of each state by department and aggregate their average salaries, and then merge the result sets.