# Comparison of SQL & SPL: Order-based Grouping

Sometimes the order of members in an ordered data set is important, such as the scenario where each record is compared with its next neighbor and put them into the same group if they have same value for a specific field or if they satisfy the same condition(s). Examples include finding the maximum frequency of consecutive winning of the champion title and the largest number of days when temperature rises continuously in Beijing. They involve the order-based grouping. This essay explains the solutions and basic principles of SQL and SPL, two commonly used programming languages, for handling the order-based grouping scenarios, and offers the simpler and more efficient way through SQL and SPL sample programs. Looking Comparison of SQL & SPL: Order-based Grouping for details.

Example 1Count the largest number of days when SSE Composite Index closing prices rise consecutively in the year 2020 (when the index in the first transaction date rises). Below is part of the source table:

 DATE CLOSE OPEN VOLUME AMOUNT 2020/01/02 3085.1976 3066.3357 292470208 3.27197122606E11 2020/01/03 3083.7858 3089.022 261496667 2.89991708382E11 2020/01/06 3083.4083 3070.9088 312575842 3.31182549906E11 2020/01/07 3104.8015 3085.4882 276583111 2.88159227657E11 2020/01/08 3066.8925 3094.2389 297872553 3.06517394459E11 … … … … …

SQL solution:

It is not a breeze to accomplish. The basic logic is this – divide records ordered by date into multiple groups, during which those having continuously rising indexes are put into one group. That is, if the closing price in the current date rises, put the corresponding record to the group that contains the previous record; and if it decreases, put the record into a new group. Then get the largest number of members among the groups, which is our target.

The grouping task is special because order of records plays a role here. SQL, which supports the equi-grouping only, needs to convert the order-based grouping into an ordinary equi-grouping for further handling. The process is this:

(1) Sort records by date and get the previous closing price for each record using a window function;

(2) Compare the current closing price with the previous one, make the result as 0 if the price rises and as 1 if the price falls;

(3) Cumulatively sum the marks before the current record, the cumulative sum is like 0,0,1,1,1,1,2,2,3,3,3…, which are the expected group numbers;

(4) Now we can perform an ordinary equi-grouping operation in SQL.

SQL queries are as follows:

SELECT

MAX(ContinuousDays) MAX_DAYS

FROM (

SELECT COUNT(*) ContinuousDays

FROM (

SELECT SUM(RisingFlag) OVER (ORDER BY "DATE") NoRisingDays

FROM (

SELECT

"DATE", CASE WHEN CLOSE>

LAG(CLOSE) OVER (ORDER BY "DATE")

THEN 0 ELSE 1 END RisingFlag

FROM SSEC

WHERE EXTRACT (YEAR FROM "DATE")=2020

)

)

GROUP BY NoRisingDays

)

SPL solution:

SPL A.group() function uses @i option to create a new group whenever the condition changes.

 A 1 =T("SSEC.csv") 2 =A1.select(year(DATE)==2020).sort(DATE) 3 =A2.group@i(CLOSE

A1: Import SSEC table.

A2: Get records of the year 2020 and sort them by date in ascending order.

A3: Create a new group when the current closing price is lower than the previous one.

A4: Get the largest number of consecutive rising dates.

SPL is outstandingly concise while SQL is rather roundabout. There are two reasons for this. SQL does not support the order-based grouping while SPL does; SQL cannot retain the post-grouping subsets and thus perform further operations on them while SPL can.

Example 2According to the following Olympic medal table, get the country that holds the champion title for the longest time in terms of the total number of medals and the related medal information. Below is part of the source table:

 Game Nation Gold Silver Copper 30 USA 46 29 29 30 China 38 27 23 30 UK 29 17 19 30 Russia 24 26 32 30 Korea 13 8 7 … … … … …

SQL solution:

It is the post-groping subsets and the order-based grouping that need to be retained and achieved in this case. But SQL does not support both, and it is complicated to accomplish them. Marks are needed to index the group before equi-grouping can be used. We get the champion for each Olympic game according to the medal counts, group records by nation in their original order, get the nation having the most number of records, select the first game the champion title was obtained and the how many games the title was held, and finally, get the corresponding records according to the selected information from the source table. SQL queries are as follows:

with cte1 as

(select GAME, max(NATION) keep (dense_rank first order by 1000000*GOLD+1000*SILVER+COPPER desc) NATION

from OLYMPIC

group by GAME),

cte2 as

(select min(NATION) NATION, min(GAME) GAMESTART, count(*) GAMECOUNT

from

(select GAME, NATION, row_number() over(ORDER by GAME) rowno, ROW_NUMBER()OVER(PARTITION BY NATION ORDER BY GAME) groupno

from cte1)

group by NATION, groupno-rowno),

cte3 as

(select min(NATION) keep (dense_rank first order by GAMECOUNT desc) NATION, min(GAMESTART) keep (dense_rank first order by GAMECOUNT desc) GAMESTART, max(GAMECOUNT) GAMECOUNT

from cte2

select t1.GAME,t1.NATION,t1.GOLD,t1.SILVER,t1.COPPER

from OLYMPIC t1

right join cte3

on t1.nation=cte3.nation and t1.game>=cte3.GAMESTART and t1.game<(cte3.GAMESTART+cte3.GAMECOUNT)

The SQL code is hard to understand. As the code is written in Oracle, the analytic function KEEP is used to get the maximum or minimum value from a subset. If any of the other databases is used, an additional query is needed to get the record containing the maximum or minimum value.

SPL solution:

SPL A.group() function works with @o function to create a new group whenever the field value is changed.

 A 1 =T("Olympic.txt") 2 =A1.sort@z(GAME,1000000*GOLD+1000*SILVER+COPPER) 3 =A2.group@o1(GAME) 4 =A3.group@o(NATION) 5 =A4.maxp(~.len())

A1: Import the Olympic table.

A2: Sort A1’s records by game and total number of medals in descending order.

A3: Get the champion in each game; since the records are already ordered, the first record is what we need.

A4: Create a new group when a new nation appears.

A5: Get the group containing the most members, which is the nation that holds the champion for the longest time.

The SPL script is much simpler than the SQL script because SPL’s grouping operation can keep the post-grouping subsets and perform the order-based grouping operation on the subsets. The logic for solving the problem is clear and SPL just expresses the logic naturally.

In the cases where the sequence numbers (a member should be given to which group) of subsets can be obtained directly or indirectly, we can group the records according to the sequence numbers.

Example 3Divide the following employee table evenly into three groups (put those with a remainder into a new group) and calculate the average salary in each group. Below is part of the source table:

 ID NAME BIRTHDAY ENTRYDATE DEPT SALARY 1 Rebecca 1974/11/20 2005/03/11 R&D 7000 2 Ashley 1980/07/19 2008/03/16 Finance 11000 3 Rachel 1970/12/17 2010/12/01 Sales 9000 4 Emily 1985/03/07 2006/08/15 HR 7000 5 Ashley 1975/05/13 2004/07/30 R&D 16000 … … … … … …

SQL solution:

SQL’s solution is this. Group employee records by the order of hire dates. The first one-third of the records is put into a group, the second one-third is put into the second group, and the last one-third is put into the third group. This is one example of data grouping by sequence numbers, which puts members having same sequence numbers into same group. As SQL does not support the grouping by sequence number, we can first obtain the sequence numbers, perform equi-grouping by them, and then sort groups by them. Below are SQL queries:

with cte1 as(

select count(*) COUNT

from EMPLOYEE

)

select

GROUP_NO, avg(SALARY) AVG_SALARY

from (

select

TRUNC((rn-1)*3/(select COUNT from cte1))+1 GROUP_NO, SALARY

from (

select

SALARY, ROW_NUMBER()OVER(ORDER BY ENTRYDATE) rn

from EMPLOYEE

)

)

group by GROUP_NO

order by GROUP_NO

SPL solution:

SPL A.group()function uses @n option to group records by sequence numbers, putting those having same sequence numbers into same group (records whose sequence numbers are N are put into the Nth group; N starts from 1).

 A 1 =T("Employee.csv").sort(ENTRYDATE) 2 =A1.group@n((#-1)*3\A1.len()+ 1) 3 =A2.new(#:GROUP_NO,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table and sort it by entry date.

A2: Get the group number according to the row numbers of sorted records and then perform grouping by group numbers.

A3: Calculate the average salary in each group.

## Summary

SQL is not fit to handle order-based grouping operations. The cause is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets. Window functions, however, are patches rather than the solution to address the SQL problem. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the order-based grouping operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.