Comparison of SQL & SPL: Non-equi-grouping

The equi-grouping operations divides a data table according to same field values or expression results, while there are certain common grouping operations use special grouping conditions. Grouping customers according to the order of Beijing, Shanghai, Guangzhou and Shenzhen and calculate total sales amount in each group, and divide families into wealthy, well-off, comfortable, and poor according to the yearly income. This essay will explain the solutions and basic principles of SQL and SPL, the two commonly used programming languages, in dealing with this type of grouping scenarios, and find the simpler and more efficient way through sample programs. Looking Comparison of SQL & SPL: Non-equi-grouping for details.

A grouping operation is called equi-grouping if the grouping condition is same values or expression results. Equi-grouping operations are the most seen and they divide a data set completely. Their features are as follows:

(1)    No empty subset in the grouping result set;

(2)    Each member in the original set belongs to one and only one subset.

Does there exist the non-equi grouping and incomplete division? The answer is yes. Their features are as follows:

(1)    The grouping result set contains at least one empty subset;

(2)    The grouping result set does not contain all members of the original set;

(3)    One member may belong to more than one subset.

Ⅰ. Alignment grouping

An alignment grouping compares a specific field or expression of each member in the to-be-grouped set with every member in a predefined base set and puts members matching same member in the base set in the same group. The number of groups in the result set is the same as that of members in the base set. The alignment grouping may generate empty group or leaves one or more members in outside of any group.

Example 1According to the Course table and Select_Course table below, find the unselected courses according to the order in Course table.

COURSE

 ID NAME TEACHERID 1 Environmental protection and sustainable development 5 2 Mental health of College Students 1 3 Matlab 8 4 Electromechanical basic practice 7 5 Introduction to modern life science 3 … … …

SELECT_COURSE

 ID COURSEID STUDENTID 1 6 59 2 6 43 3 5 52 4 5 44 5 5 37 … … …

SQL solution:

SQL does not support the alignment grouping. We get the distinct selected courses from the Select_Course table and then select records that do not contain these selected courses. Below are the SQL queries:

select *

from COURSE

where

ID not in

(select DISTINCT COURSEID

from SELECT_COURSE)

order by rownum

SPL solution:

SPL offers A.align() function to perform alignment grouping.

 A 1 =T("Course.csv") 2 =T("SelectCourse.csv") 3 =A2.align(A1:ID, COURSEID) 4 =A1(A3.pos@a(null))

A1: Import Course table.

A2: Import SelectCourse table.

A3: Use A.align() function to group SelectCourse table according to Course table’s ID field.

A4: Get the courses that no students select from Course table.

Example 2According to DEPARTMENT table and EMPLOYEE table below, calculate the number of employees in each department according to the order of DEPARTMENT table. Below is part of the source tables:

DEPARTMENT

 DEPT MANAGER Administration 1 Finance 4 HR 5 Marketing 6 Production 7 … …

EMPLOYEE

 ID NAME BIRTHDAY STATE DEPT SALARY 1 Rebecca 1974/11/20 California R&D 7000 2 Ashley 1980/07/19 New   York Finance 11000 3 Rachel 1970/12/17 New   Mexico Sales 9000 4 Emily 1985/03/07 Texas HR 7000 5 Ashley 1975/05/13 Texas R&D 16000 … … … … … …

SQL solution:

For the alignment grouping operation, each group will contain all matching members in the to-be-grouped set. Since SQL does not directly support this type of grouping, we group EMPLOYEE table by department and count employees in each department, perform a left join between DEPARTMENT table and the result set of grouping and summarizing EMPLOYEE table, and query each department to get the number of employees. Note that the SQL JOIN operation between two result sets does not ensure that records are arranged in a specific order, so we need to sort them by row numbers in DEPARTMENT table; otherwise, the query result may not be ordered in the same way as DEPARTMENT table. Below are the SQL queries:

select

t1.DEPT, t2.COUNT

from

(select DEPT,rownum ordernum from DEPARTMENT) t1

left join

(select

DEPT, count(*) COUNT

from EMPLOYEE

group by DEPT) t2

on t1.DEPT=t2.DEPT

order by ordernum

SPL solution:

SPL offers A.align() function to handle the alignment grouping. It works with @a option to return all matching members for each group.

 A 1 =T("Department.csv") 2 =T("Employee.csv") 3 =A2.align@a(A1:DEPT, DEPT) 4 =A3.new(DEPT, ~.len():COUNT)

A1: Import Department table.

A2: Import Employee table.

A3: Group Employee table according to the order in Department table and return all matching members for each group.

A4: Count employees in each department.

Example 3According to Employee table (Below is a part of it), calculate the average salary for each state of [California, Texas, New York, Florida] and for all the other states as a whole, which are classified as “Other”.

 ID NAME BIRTHDAY STATE DEPT SALARY 1 Rebecca 1974/11/20 California R&D 7000 2 Ashley 1980/07/19 New   York Finance 11000 3 Rachel 1970/12/17 New   Mexico Sales 9000 4 Emily 1985/03/07 Texas HR 7000 5 Ashley 1975/05/13 Texas R&D 16000 … … … … … …

SQL solution:

In this case, the non-matching members are required to put into a new group. SQL does not have a method to do this. First, we define a set of the four specified state groups and the “Other” group, and then check which group each employee record belongs to using CASE WHEN statement. Below are SQL queries:

with cte1(ID,STATE) as

(select 1,'California' from DUAL

UNION ALL select 2,'Texas' from DUAL

UNION ALL select 3,'New York' from DUAL

UNION ALL select 4,'Florida' from DUAL

UNION ALL select 5,'OTHER' from DUAL)

select

t1.STATE, t2.AVG_SALARY

from cte1 t1

left join

(select

STATE,avg(SALARY) AVG_SALARY

from

( select

CASE WHEN

STATE IN ('California','Texas','New York','Florida')

THEN STATE

ELSE 'OTHER' END STATE,

SALARY

from EMPLOYEE)

group by STATE) t2

on t1.STATE=t2.STATE

order by t1.ID

SPL solution:

SPL has A.align() function to do this. It uses @n option to put the non-matching members into a new group.

 A 1 =T("Employee.csv") 2 [California,Texas,New York,Florida] 3 =A1.align@an(A2,STATE) 4 =A3.new(if   (#>A2.len(),"OTHER",STATE):STATE,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table.

A2: Define a constant set of regions.

A3: A.align() groups Employee table according to A2’s set of states. @a option is used to return all matching members for each group; @n option enables putting the non-matching members into a new group.

A4: Name the new group “OTHER” and calculate the average salary of each group.

SQL uses JOION to achieve an alignment grouping because it does not have a method to do it directly. As a SQL result set is unordered, we record the order of the records in the original table through their original row numbers. This makes SQL solution to an alignment grouping problem complicated. SPL offers A.align() function for specifically handling alignment grouping operations, which features concise syntax and efficient execution.

Ⅱ. Enumeration grouping

An enumeration grouping prespecifies a set of enumerated conditions, computes each conditional expression using every member of the to-be-grouped set as the parameter, and puts members that make the same condition true into the corresponding subset. The subsets in the result set correspond to the prespecified conditions one by one.

Example 4According to the following China urban population table, divide cities into groups and count the number in each group. Megacity: above 2 million population; Large city: between 1 to 2 million population; Major city: between 0.5 to 1 million population; Other cities. Below is part of the source table:

 ID CITY POPULATION PROVINCE 1 Shanghai 12286274 Shanghai 2 Beijing 9931140 Beijing 3 Chongqing 7421420 Chongqing 4 Guangzhou 7240465 Guangdong 5 HongKong 7010000 HongKong Special Administrative Region … … … …

SQL solution:

SQL lacks a direct method of handling enumeration grouping operations. We segment the source table using CASE WHEN statement, then group data by comparing it with the segmentation values and perform aggregation. CASE WHEN statement put each record into the first matching grouping only. Below are SQL queries:

with cte1(ID,CITY_TYPE) as

(select 1,'Megacity' from DUAL

UNION ALL select 2,'Large city' from DUAL

UNION ALL select 3,'Major city' from DUAL

UNION ALL select 4,'Other cities' from DUAL)

select

t1.CITY_TYPE, nvl(t2.COUNT,0) COUNT

from cte1 t1

left join

(select CITY_TYPE, count(*) COUNT from

(select

case when POPULATION>2000000 then 'Megacity'

when POPULATION>1000000 then 'Large city'

when POPULATION>500000 then 'Major city'

when POPULATION<=500000 then 'Other cities'

else 'OTHER' end CITY_TYPE

from URBAN_POPULATION)

group by CITY_TYPE) t2

on t1.CITY_TYPE=t2.CITY_TYPE

order by t1.ID

SPL solution:

SPL provides A.enum() function to perform grouping according to the enumerated conditions.

 A 1 =T("UrbanPopulation.txt") 2 [?>2000000,?>1000000,?>500000,?<=500000] 3 =A1.enum(A2,POPULATION) 4 [Megacity, Large city, Major city, Other cities] 5 =A3.new(A4(#):CITY_TYPE,~.count():COUNT)

A1: Import UrbanPopulation table.

A2: Define grouping conditions.

A3: A.enum() function performs enumeration grouping on A1’s table according to A1’s conditions and, by default, puts each record in the first matching grouping only.

A4: Define a name for each group.

A5: Count cities in each group according to the enumeration grouping result.

Example 5Divide the following employee table into three groups according to age groups <35, <45 and other ages. Below is part of the source table:

 ID NAME BIRTHDAY STATE DEPT SALARY 1 Rebecca 1974/11/20 California R&D 7000 2 Ashley 1980/07/19 New   York Finance 11000 3 Rachel 1970/12/17 New   Mexico Sales 9000 4 Emily 1985/03/07 Texas HR 7000 5 Ashley 1975/05/13 Texas R&D 16000 … … … … … …

SQL solution:

It is required, in this case, that the non-matching records be put into a new group. We use CASE WHEN … ELSE statement to define an “Others” group to achieve this. Below are SQL queries:

with cte1(ID,AGE_AREA) as

(select 1,'Below 35' from DUAL

UNION ALL select 2,'Below 45' from DUAL

UNION ALL select 3,'Others' from DUAL)

select

t1.AGE_AREA, nvl(t2.AVG_SALARY,0) AVG_SALARY

from cte1 t1

left join

(select

AGE_AREA, avg(SALARY) AVG_SALARY

from

(select

case when TRUNC(months_between(sysdate, BIRTHDAY)/12)<35

then 'Below 35'

when TRUNC(months_between(sysdate, BIRTHDAY)/12)<45

then 'Below 45'

else 'Others' end AGE_AREA,

SALARY

from EMPLOYEE)

group by AGE_AREA) t2

on t1.AGE_AREA=t2.AGE_AREA

order by t1.ID

SPL solution:

A.enum() function uses @n option to put the non-matching records into a new group.

 A 1 =T("Employee.csv") 2 [?<35,?<45] 3 =A1.enum@n(A2, age(BIRTHDAY)) 4 [Below 35,Below 45, Others] 5 =A3.new(A4(#):AGE_AREA,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table.

A2: Define grouping conditions.

A3: A.enum() function performs enumeration grouping by age groups and works with @n option to put non-matching records into a new group.

A4: Define a name for each group.

A5: Calculate the average salary in each group according to A3’s result.

Example 6According to the following cities’ GDB table, calculate per capita GDP for the direct-controlled municipalities, first-tier cities and second-tier cities. Groups may contain common members as Beijing is both a direct-controlled city and a first-tier city. Below is part of the source table:

 ID CITY GDP POPULATION 1 Shanghai 32679 2418 2 Beijing 30320 2171 3 Shenzhen 24691 1253 4 Guangzhou 23000 1450 5 Chongqing 20363 3372 … … … …

SQL solution:

In this case, one record could be put into more than one group. CASE WHEN statement can only put a record into the first matching group and does not support re-putting it to another group, so we need to calculate per capita GDP for each group and then concatenate the results. Below are SQL queries:

select

'Direct-controlled municipality' AREA,

SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP

from CITY_GDP

where

CITY in ('Beijing','Shanghai','Tianjin','Chongqing')

UNION ALL

select

'First-tier city' AREA,

SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP

from CITY_GDP

where

CITY in ('Beijing','Shanghai','Guangzhou','Shenzhen')

UNION ALL

select

'Second-tier city' AREA,

SUM(GDP*1000)/SUM(POPULATION) CAPITA_GDP

from CITY_GDP

where

CITY in ('Chengdu','Hangzhou','Chongqing','Wuhan','Xian','Suzhou','Tianjin','Nanjing','Changsha','Zhengzhou','Dongguan','Qingdao','Shenyang','Ningbo','Kunming')

In the SQL query, groups are respectively queried and summarized, and their results are concatenated one by one. Each group requires a piece of SQL code. When there are many groups, it’s complicated to write and maintain the SQL query.

SPL solution:

A.enum() function uses @r function to check whether all members in a group match a condition.

 A 1 =T("CityGDP.txt") 2 [["Beijing","Shanghai","Tianjin","Chongqing"].pos(?)>0,["Beijing","Shanghai","Guangzhou","Shenzhen"].pos(?)>0,["Chengdu","Hangzhou","Chongqing","Wuhan","Xian","Suzhou","Tianjin","Nanjing","Changsha","Zhengzhou","Dongguan","Qingdao","Shenyang","Ningbo","Kunming"].pos(?)>0] 3 =A1.enum@r(A2,CITY) 4 [direct-controlled municipality, first-tier city, second-tier city] 5 =A3.new(A4(#):AREA,~.sum(GDP)/~.sum(POPULATION)*10000:CAPITA_GDP)

A1: Import CityGDP table.

A2: Enumerate conditions for direct-controlled city, first-tier city and second-tier ciy.

A3: A.enum() groups A1’s table according to the enumerated conditions and works with @r option to check whether all members in a group match a condition.

A4: Define a name for each group.

A5: Calculate per capita GDP in each group according to A3’s result.

SPL needs just one option to handle the re-grouping of a member in an enumeration grouping operation. No matter how many groups there are, users only need to maintain the enumerated conditions, instead of modifying the code for grouping & aggregation.

Summary

SQL is not fit to handle both alignment grouping and enumeration grouping. There are two causes for this. One is SQL does not provide a special statement or function to handle them, the other is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets, which makes it hard to handle computations that involve the fixed order. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the two types of grouping operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.