SPL: grouping & aggregation

 

Grouping operation is to assign the members of a set with the same attributes to the same group. For example, employee tables are grouped by department, and each group of employees has the same department; sales tables can be grouped by sales year, and each group has sales records of the same year, and so on.

Sometimes we need to calculate the data by different groups, and it is when grouping & aggregation is needed. Grouping & aggregation means grouping the data according to certain regulations, and then performing aggregation operation for each group.

 

[e.g. 1] Query the total sales of each month in 2014 according to the sales table. Some of the data are as follows:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

The A.groups() function is provided in SPL for grouping & aggregation.

The SPL script looks like this:

A

1

=T("Sales.csv").select(year(ORDERDATE)==2014)

2

=A1.groups(month(ORDERDATE):MONTH; sum(AMOUNT):AMOUNT)

A1: import sales table and select the records of 2014.

A2: use the A.groups() function to group data by month and aggregate the total sales of each month.

 

[e.g. 2] Query the departments with an average age of less than 45 in New York State according to the employee table. Some of the data are as follows:

ID

NAME

BIRTHDAY

STATE

DEPT

SALARY

1

Rebecca

1974/11/20

California

R&D

7000

2

Ashley

1980/07/19

New York

Finance

11000

3

Rachel

1970/12/17

New Mexico

Sales

9000

4

Emily

1985/03/07

Texas

HR

7000

5

Ashley

1975/05/13

Texas

R&D

16000

 The SPL script looks like this:

A

1

=T("Employee.csv").select(STATE=="New York")

2

=A1.groups(DEPT; avg(age(BIRTHDAY)):AVGAGE)

3

=A2.select(AVGAGE<45)

A1: import the employee table and select the records of New York State.

A2: use the A.groups() function to calculate the average age by department.

A3: select the records with an average age of less than 45 in the result set after grouping & aggregation.

In SPL, the A.select() function can be used to select data both before and after grouping.

 

In SPL, the A.groups()function not only supports aggregation functions supported by SQL, such as sum, count, avg, max, min,etc., but also supports some of the more common ones that are not supported by the Group BY statement of SQL: top (first Ns/last Ns) , iterate (iteration function) , icount (count distinct value) , median (median, some database support) , and so on.

 

[e.g. 3] According to the sales table, summarize the times that customers are ranked as the top three sales of each month in 2014. Some of the data are as follows:

ID

CUSTOMERID

ORDERDATE

SELLERID

PRODUCTID

AMOUNT

10400

EASTC

2014/01/01

1

27

3063.0

10401

HANAR

2014/01/01

1

17

3868.6

10402

ERNSH

2014/01/02

8

70

2713.5

10403

ERNSH

2014/01/03

4

42

1005.9

10404

MAGAA

2014/01/03

2

74

1675.0

The SPL script looks like this:

A

1

=T("Sales.csv").select(year(ORDERDATE)==2014)

2

=A1.groups(month(ORDERDATE):MONTH; top(-3;AMOUNT):TOP3)

3

=A2.conj(TOP3).groups(CUSTOMERID; count(~):COUNT)

A1: import the sales table and select the records of 2014.

A2: use the A.groups() function to aggregate the top three by month.

A3: after concatenating the top three records of each month, use the A.groups() function to count the times for each customer.

 

[e.g. 4] Summarize the number of students who get the median scores or fail the exam (less than 60 marks) of each class in every subject. Some of the data are as follows:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

The solution of is:

A

1

=T("Scores.csv")

2

=A1.groups(CLASS,SUBJECT; median(,SCORE):MEDIAN_SCORE, count(SCORE<60):FAIL_COUNT)

A1: import the score table.

A2: use the A.groups() function to aggregate the number of median and failed students by class and subject.

 

In SPL, the count() function can count the number of failed students directly by using the parameter “SCORE<60” without selecting the failed records in advance. This makes it easy to calculate both the median and the number.