Handling Alignment Grouping

 

In a lot of grouping tasks in daily analytic work, the result set is required to be aligned to a specified base set. The more generalized alignment grouping is enumeration grouping. This article illustrates how to handle alignment grouping conveniently and efficiently and provides sample program in esProc. Looking Handling Alignment Grouping for details.

 

An alignment grouping operation compares values of a field or an expression of members of the to-be-grouped set with members of a specified base set, and puts members matching a same member in the base set to same group. The result set will have same number of groups as the number of members in the base set. The alignment grouping operation may produce empty group(s) or leave one or more unmatching members.

1. Basic alignment grouping

1.1 Keep one matching members at most per group

Group records in a table by the order of a certain field and perform summarization by each group.

Example 1 Based on the associated SelectCourse table and Course table, find the unselected courses according to the order in Course table.

undefined

SPL script


A

B

1

=connect("db")

/Connect to database

2

=A1.query("select * from  SelectCourse")

/ Query SelectCourse table

3

=A1.query("select * from   Course")

/Query Course table

4

=A2.align(A3:ID,CourseID)

/ A.align() function groups records of   SelectCourse table by aligning them   to ID field of Course table, and   get one matching member for each group

5

=A3(A4.pos@a(null))

/ Get records of unselected courses   from Course table (whose   corresponding values in the grouping result set are null)

A5’s result:

ID

NAME

TeacherID

1

Environmental  protection and sustainable development

5

10

Music  appreciation

18

 

1.2 Keep all matching members per group

Group records in a table by the order of a certain field and perform aggregate on each group.

Example 2 Based on the associated EMPLOYEE table and DEPARTMENT table, calculate the number of employees in each department according to the order in DEPARTMENT table.

undefined

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  EMPLOYEE")

/ Query EMPLOYEE table

3

=A1.query("select * from  DEPARTMENT")

/ Query DEPARTMENT table

4

=A2.align@a(A3:ID, DEPARTMENT)

/ A.align@a()   function groups records of EMPLOYEE   table by aligning them to ID field of DEPARTMENT   table, and get all matching members for each group; @a option enables   returning all matching members for each group

5

=A4.new(DEPT, ~.count():COUNT)

/ Count employees in each department

A5’s result:

DEPT

COUNT

Admin

4

R&D

29

Sales

187

 

1.3 Create a new group for unmatching members

Group records in a table by the order of a certain field and put unmatching records into a new group.

Example 3 Based on the SALARY table (Below is a part of it), calculate the average salary for the states of [California, Texas, New York, Florida] and for other states as a whole, which are classified as “Other”.

ID

NAME

STATE

SALARY

1

Rebecca

California

7000

2

Ashley

New  York

11000

3

Rachel

New  Mexico

9000

4

Emily

Texas

7000

5

Ashley

Texas

16000

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  SALARY")

/ Query SALARY table

3

[California,Texas,New York,Florida]

/ Define a sequence of states

4

=A2.align@an(A3,STATE)

/ align@an function groups records of SALARY table by the specified states;   @a option enables returning all matching records for each group, and @n   option creates a new group to hold the unmatching records

5

=A4.new(if   (#>A3.len(),"Other",STATE):STATE,~.avg(SALARY):AvgSalary)

/ Calculate the average salary in each   group and generate a new table sequence; change the last group name to “Other”,   otherwise it is the state in the first record of the current group

A5’s result:

STATE

SALARY

California

7700.0

Texas

7592.59

New  York

7677.77

Florida

7145.16

Other

7308.1

 

2. Alignment grouping by sequence numbers

The sequence-number-based alignment grouping operation groups members in a set according to specified sequence numbers. It puts members of same sequence numbers in same group.

2.1 Keep one matching member at most per group

Find the records that are not referenced based on two associated table.

Example 4 Based on the associated Sales table and Customer table, list customers that have no orders in 2014.

undefined

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from   Sales")

/ Query Sales table

3

=A1.query("select * from  Customer")

/ Query Customer table

4

=A3.(ID)

/ Get customer IDs from Customer table

5

=A2.align(A4.len(),   A4.pos(CustomerID))

/ A.align(n,y) function groups Sales table by aligning its records to   customer IDs

6

=A3(A5.pos@a(null))

/ Get customer records having no   orders from Customer table (where   the order values are null)

A6’s result:

ID

Name

State

ALFKI

CMA-CGM

Texas

CENTC

Nedlloyd

Florida

 

2.2 Keep all matching records per group

Group records in a table by sequence numbers and perform aggregate on each group.

Example 5 Based on the following orders table (only a part of the data is shown), list the number of orders in each month of the year 2013.

ID

CustomerID

OrderDate

Amount

10248

VINET

2012/07/04

428.0

10249

TOMSP

2012/07/05

1842.0

10250

HANAR

2012/07/08

1523.5

10251

VICTE

2012/07/08

624.95

10252

SUPRD

2012/07/09

3559.5

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from Orders   where  year(OrderDate)=2013")

/ Get orders records of 2013

3

=A2.align@a(12,month(OrderDate))

/ A.align@a()   function divides the orders records of 2013 into 12 groups according to the   12 months; @a option enables returning all matching records for each group

4

=A3.new(#:Month,~.count():OrderCount)

/ Count orders in each month

A4’s result:

Month

OrderCount

1

33

2

29

3

30

4

31

5

32

6

30

7

33

8

33

9

37

10

38

11

34

12

48

 

2.3 Put a record in multiple groups

Get a sequence of sequence numbers and divide records by aligning records to it. A record could be put into more than one group during the process.

Example 6 Based on the post records table, group posts by tags and calculate the frequency of each tag. Below is part of the source table:

ID

Title

Author

Label

1

Easy  analysis of Excel

2

Excel,ETL,Import,Export

2

Early  commute: Easy to pivot excel

3

Excel,Pivot,Python

3

Initial  experience of SPL

1

Basics,Introduction

4

Talking  about set and reference

4

Set,Reference,Dispersed,SQL

5

Early  commute: Better weapon than Python

4

Python,Contrast,Install

 

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  PostRecord")

/ Query PostRecord table

3

=A2.conj(Label.split(",")).id()

/ Split each Label value by comma and   concatenate all labels into one sequence and get all unique labels

4

=A2.align@ar(A3.len(),A3.pos(Label.split(",")))

/ align function works with @r option   to put each post record into a corresponding group according to the sequence   number of each of its label in the label list

5

=A4.new(A3(#):Label,~.count():Count).sort@z(Count)

/ Count the posts under each label and   sort the result set in descending order

A5’s result:

Label

Count

SPL

7

SQL

6

Basics

5

 

2.4 Group records by ranges

Divide records in a table into multiple segments according to ranges of values in a specified field, and perform aggregate on each group.

Example 7 Based on the salary table, group records according to salary ranges <8000, 8000 &12000, >12000, and calculate the number of employees in each group. Below is part of the source table:

ID

NAME

BIRTHDAY

SALARY

1

Rebecca

1974-11-20

7000

2

Ashley

1980-07-19

11000

3

Rachel

1970-12-17

9000

4

Emily

1985-03-07

7000

5

Ashley

1975-05-13

16000

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  EMPLOYEE")

/ Query EMPLOYEE table

3

[0,8000,12000]

/ Define salary ranges

4

=A2.align@a(A3.len(),A3.pseg(SALARY))

/ A.pseg(x) function gets the range   for salary of each record

5

=A4.new(A3 (#):SALARY,~.count():COUNT)

/ Count the employees in each group

A5’s result:

SALARY

COUNT

0

308

8000

153

12000

39

 

The following task requires grouping records by specified ranges according to the expression result, and calculates average.

Example 8Based on the employee table, group records by ranges of hire durations, which are <10 years, 10 years & 20 years, and 20 years, and calculate average salary in each group. Below is part of the source table:

ID

NAME

BIRTHDAY

SALARY

1

Rebecca

1974-11-20

7000

2

Ashley

1980-07-19

11000

3

Rachel

1970-12-17

9000

4

Emily

1985-03-07

7000

5

Ashley

1975-05-13

16000

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  EMPLOYEE")

/ Query EMPLOYEE table

3

[0,10,20]

/ Define hire duration ranges

4

=now()

/ Get the current date and time

5

=A2.align@a(A3.len(),A3.pseg(elapse@y(A4,-~),    HIREDATE))

/ A.pseg(x,y) function get the range   where the hire date in each record falls

6

=A5.new(A3(#):EntryYears,~.avg(SALARY):AvgSalary)

/ Calculate the average salary

A6’s result:

EntryYears

AvgSalary

0

6777.78

10

7445.53

20

6928.57

 

3. Enumeration grouping

Enumeration grouping defines a set of enumerated conditions, calculates the conditions using members of the to-be-grouped set as parameters, and puts members making same condition true into one subset. There is a one-to-one relationship between the subsets in the result set and the enumerated conditions. 

3.1 Put each member to the first matching group

Group records in a table according to the enumerated conditional expressions and put each record only in the first matching group.

Example 9 Based on the table recording population information in China’s major cities, group the cities by population. Below is a part of the source table:

ID

City

Population

Province

1

Shanghai

12286274

Shanghai

2

Beijing

9931140

Beijing

3

Chongqing

7421420

Chongqing

4

Guangzhou

7240465

Guangdong

5

Hong  Kong

7010000

Hong  Kong Special Administrative Region

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  UrbanPopulation")

/ Query UrbanPopulation table

3

[?>2000000,?>1000000,?>500000,?<=500000]

/ Define populations ranges: Megacities:   >2 million, Super cities: >1 million & <2 million, Large cities:   >0.5 million & <1 million, and Other cities

4

=A2.enum(A3,Population)

/ A.enum()function groups records in UrbanPopulation table according to the   enumerated conditions defined in A3

A4’s result:

undefined

3.2 Create a new group for unmatching members

Group records in a table according to the specified enumerated conditional expressions, and put unmatching records in a new group.

Example 10 Based on the employee table, group records by age groups: < 35 years and < 45 years (put unmatching ones to a new group), and calculate average salary in each group. Below is part of the source table:

ID

NAME

BIRTHDAY

SALARY

1

Rebecca

1974-11-20

7000

2

Ashley

1980-07-19

11000

3

Rachel

1970-12-17

9000

4

Emily

1985-03-07

7000

5

Ashley

1975-05-13

16000

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from  EMPLOYEE")

/ Query EMPLOYEE table

3

[?<35,?<45]

/ Define two age groups: < 35 years   and < 45 years

4

=A2.enum@n(A3, age(BIRTHDAY))

/ A.enum@n()   function groups employee records by the enumerated age groups; @n option   enables creating a new group for unmatching records

5

=A4.new(if (#>A3.len(),  "Other",A3(#)):AGE,~.avg(SALARY):AvgSalary)

/ Set name of the last group as Other   and calculate average salary in each group

A5’s result:

AGE

AvgSalary

?<35

7118.18

?<45

7448.16

Other

7395.06

 

3.3 Put a record in multiple groups

Group records in a table according to enumerated sequences and perform calculations on each group. A record could be put into more than one group during the process.

Example 11 Based on the GDP table, calculate the GDP per capita for direct-controlled municipalities, first-tier cities and second-tier cities respectively. Below is part of the source table:

ID

City

GDP

Population

1

Shanghai

32679

2418

2

Beijing

30320

2171

3

Shenzhen

24691

1253

4

Guangzhou

23000

1450

5

Chongqing

20363

3372

SPL script


A

B

1

=connect("db")

/ Connect to database

2

=A1.query("select * from   GDP")

/ Query GDP table

3

[["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0,
["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0,
["Chengdu","Hangzhou","Chongqing","Wuhan","Xi’an","Suzhou",
"Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao",
"Shenyang","Ningbo","Kunming"].pos(?)>0]

/ Enumerate direct-controlled cities,   first-tier cities and second-tier cities respectively

4

=A2.enum@r(A3,City)

/ A.enum@r()   function records in GDP table   according to the enumerated sequences of cities; @r option allows putting a   record to more than one groups

5

=A4.new(A3(#):Area,~.sum(GDP)/~.sum(Population)*10000:CapitaGDP)

/ Calcualte GDP per capita in each   group

A5’s result:

Area

CapitaGDP

["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0

107345.03

["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0

151796.49

["Chengdu","Hangzhou","Chongqing","Wuhan","Xi’an","Suzhou","Tianjin","Nanjing",
"Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"].pos(?)>0

106040.57

 

Find more examples in SPL CookBook.