Comparison of SQL & SPL: Order-based Grouping

 

Sometimes the order of members in an ordered data set is important, such as the scenario where each record is compared with its next neighbor and put them into the same group if they have same value for a specific field or if they satisfy the same condition(s). Examples include finding the maximum frequency of consecutive winning of the champion title and the largest number of days when temperature rises continuously in Beijing. They involve the order-based grouping. This essay explains the solutions and basic principles of SQL and SPL, two commonly used programming languages, for handling the order-based grouping scenarios, and offers the simpler and more efficient way through SQL and SPL sample programs. Looking Comparison of SQL & SPL: Order-based Grouping for details.

 

Example 1Count the largest number of days when SSE Composite Index closing prices rise consecutively in the year 2020 (when the index in the first transaction date rises). Below is part of the source table:

DATE

CLOSE

OPEN

VOLUME

AMOUNT

2020/01/02

3085.1976

3066.3357

292470208

3.27197122606E11

2020/01/03

3083.7858

3089.022

261496667

2.89991708382E11

2020/01/06

3083.4083

3070.9088

312575842

3.31182549906E11

2020/01/07

3104.8015

3085.4882

276583111

2.88159227657E11

2020/01/08

3066.8925

3094.2389

297872553

3.06517394459E11

 

SQL solution:

It is not a breeze to accomplish. The basic logic is this – divide records ordered by date into multiple groups, during which those having continuously rising indexes are put into one group. That is, if the closing price in the current date rises, put the corresponding record to the group that contains the previous record; and if it decreases, put the record into a new group. Then get the largest number of members among the groups, which is our target.

The grouping task is special because order of records plays a role here. SQL, which supports the equi-grouping only, needs to convert the order-based grouping into an ordinary equi-grouping for further handling. The process is this:

   (1) Sort records by date and get the previous closing price for each record using a window function;

   (2) Compare the current closing price with the previous one, make the result as 0 if the price rises and as 1 if the price falls;

   (3) Cumulatively sum the marks before the current record, the cumulative sum is like 0,0,1,1,1,1,2,2,3,3,3…, which are the expected group numbers;

   (4) Now we can perform an ordinary equi-grouping operation in SQL.

   SQL queries are as follows:

 

   SELECT

      MAX(ContinuousDays) MAX_DAYS

   FROM (

      SELECT COUNT(*) ContinuousDays

      FROM (

         SELECT SUM(RisingFlag) OVER (ORDER BY "DATE") NoRisingDays

         FROM (

            SELECT

               "DATE", CASE WHEN CLOSE>

               LAG(CLOSE) OVER (ORDER BY "DATE")

               THEN 0 ELSE 1 END RisingFlag

            FROM SSEC

            WHERE EXTRACT (YEAR FROM "DATE")=2020

         )

      )

       GROUP BY NoRisingDays

   )

 

 

SPL solution:

SPL A.group() function uses @i option to create a new group whenever the condition changes.


A

1

=T("SSEC.csv")

2

=A1.select(year(DATE)==2020).sort(DATE)

3

=A2.group@i(CLOSE<CLOSE[-1])

4

=A3.max(~.len())

A1: Import SSEC table.

A2: Get records of the year 2020 and sort them by date in ascending order.

A3: Create a new group when the current closing price is lower than the previous one.

A4: Get the largest number of consecutive rising dates.

 

SPL is outstandingly concise while SQL is rather roundabout. There are two reasons for this. SQL does not support the order-based grouping while SPL does; SQL cannot retain the post-grouping subsets and thus perform further operations on them while SPL can.

 

Example 2According to the following Olympic medal table, get the country that holds the champion title for the longest time in terms of the total number of medals and the related medal information. Below is part of the source table:

Game

Nation

Gold

Silver

Copper

30

USA

46

29

29

30

China

38

27

23

30

UK

29

17

19

30

Russia

24

26

32

30

Korea

13

8

7

 

SQL solution:

It is the post-groping subsets and the order-based grouping that need to be retained and achieved in this case. But SQL does not support both, and it is complicated to accomplish them. Marks are needed to index the group before equi-grouping can be used. We get the champion for each Olympic game according to the medal counts, group records by nation in their original order, get the nation having the most number of records, select the first game the champion title was obtained and the how many games the title was held, and finally, get the corresponding records according to the selected information from the source table. SQL queries are as follows:

 

   with cte1 as

      (select GAME, max(NATION) keep (dense_rank first order by 1000000*GOLD+1000*SILVER+COPPER desc) NATION

      from OLYMPIC

      group by GAME),

   cte2 as

      (select min(NATION) NATION, min(GAME) GAMESTART, count(*) GAMECOUNT

from

(select GAME, NATION, row_number() over(ORDER by GAME) rowno, ROW_NUMBER()OVER(PARTITION BY NATION ORDER BY GAME) groupno

      from cte1)

      group by NATION, groupno-rowno),

   cte3 as

      (select min(NATION) keep (dense_rank first order by GAMECOUNT desc) NATION, min(GAMESTART) keep (dense_rank first order by GAMECOUNT desc) GAMESTART, max(GAMECOUNT) GAMECOUNT

   from cte2   

   select t1.GAME,t1.NATION,t1.GOLD,t1.SILVER,t1.COPPER

   from OLYMPIC t1

   right join cte3

   on t1.nation=cte3.nation and t1.game>=cte3.GAMESTART and t1.game<(cte3.GAMESTART+cte3.GAMECOUNT)

 

The SQL code is hard to understand. As the code is written in Oracle, the analytic function KEEP is used to get the maximum or minimum value from a subset. If any of the other databases is used, an additional query is needed to get the record containing the maximum or minimum value.

 

SPL solution:

SPL A.group() function works with @o function to create a new group whenever the field value is changed.


A

1

=T("Olympic.txt")

2

=A1.sort@z(GAME,1000000*GOLD+1000*SILVER+COPPER)

3

=A2.group@o1(GAME)

4

=A3.group@o(NATION)

5

=A4.maxp(~.len())

A1: Import the Olympic table.

A2: Sort A1’s records by game and total number of medals in descending order.

A3: Get the champion in each game; since the records are already ordered, the first record is what we need.

A4: Create a new group when a new nation appears.

A5: Get the group containing the most members, which is the nation that holds the champion for the longest time.

 

The SPL script is much simpler than the SQL script because SPL’s grouping operation can keep the post-grouping subsets and perform the order-based grouping operation on the subsets. The logic for solving the problem is clear and SPL just expresses the logic naturally.

 

In the cases where the sequence numbers (a member should be given to which group) of subsets can be obtained directly or indirectly, we can group the records according to the sequence numbers.

   

Example 3Divide the following employee table evenly into three groups (put those with a remainder into a new group) and calculate the average salary in each group. Below is part of the source table:

ID

NAME

BIRTHDAY

ENTRYDATE

DEPT

SALARY

1

Rebecca

1974/11/20

2005/03/11

R&D

7000

2

Ashley

1980/07/19

2008/03/16

Finance

11000

3

Rachel

1970/12/17

2010/12/01

Sales

9000

4

Emily

1985/03/07

2006/08/15

HR

7000

5

Ashley

1975/05/13

2004/07/30

R&D

16000

 

SQL solution:

SQL’s solution is this. Group employee records by the order of hire dates. The first one-third of the records is put into a group, the second one-third is put into the second group, and the last one-third is put into the third group. This is one example of data grouping by sequence numbers, which puts members having same sequence numbers into same group. As SQL does not support the grouping by sequence number, we can first obtain the sequence numbers, perform equi-grouping by them, and then sort groups by them. Below are SQL queries:

 

   with cte1 as(

      select count(*) COUNT

      from EMPLOYEE

   )

   select

      GROUP_NO, avg(SALARY) AVG_SALARY

   from (

      select

         TRUNC((rn-1)*3/(select COUNT from cte1))+1 GROUP_NO, SALARY

      from (

         select

            SALARY, ROW_NUMBER()OVER(ORDER BY ENTRYDATE) rn

         from EMPLOYEE

      )

   )

   group by GROUP_NO

   order by GROUP_NO

 

 SPL solution:

SPL A.group()function uses @n option to group records by sequence numbers, putting those having same sequence numbers into same group (records whose sequence numbers are N are put into the Nth group; N starts from 1).


A

1

=T("Employee.csv").sort(ENTRYDATE)

2

=A1.group@n((#-1)*3\A1.len()+ 1)

3

=A2.new(#:GROUP_NO,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table and sort it by entry date.

A2: Get the group number according to the row numbers of sorted records and then perform grouping by group numbers.

A3: Calculate the average salary in each group.

 

Summary

SQL is not fit to handle order-based grouping operations. The cause is that SQL’s theoretical foundation, the relational algebra, is based on unordered sets. Window functions, however, are patches rather than the solution to address the SQL problem. SPL, however, is based on ordered sets, and is thus naturally good at handling order-based computations. Besides, it offers functions to specifically deal with the order-based grouping operations.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which makes the marriage of both Java and SQL advantages. A grouping operation in SPL will thus become simple and easy.