Comparison of SQL & SPL: Select Operation

 

Abstract

The select operation gets one or members from a set according to a specific condition. This essay explains the handling of select operations using SQL and SPL. By exploring solutions and basic principles of the two commonly used programming languages and providing sample programs written in them, we try to find the faster and more efficient way for you.

 

The select operation is so familiar for users who have had any experience with SQL or SPL. It seems similar to the locate operation. The latter cares about the position of one or more members in a specific set whole the select operation focuses on specific information of one or more members. Cases include getting names of customers whose sales amounts are above 10,000 and finding the name of student who has the highest total score.

1. Select members satisfying the specific condition

A common select operation is to get one or more members that meet a specified condition. Both SQL and SQL can handle such a select operation well. SQL uses WHERE sub statement to achieve the conditional filtering while SQL offers A.select() function to do that. Let’s look at how the two languages implement selection operations through some examples.

Example 1Get full names and salaries of employees in R&D department in the New York state. Below is part of the source data:

ID

NAME

SURNAME

STATE

DEPT

SALARY

1

Rebecca

Moore

California

R&D

7000

2

Ashley

Wilson

New   York

Finance

11000

3

Rachel

Johnson

New   Mexico

Sales

9000

4

Emily

Smith

Texas

HR

7000

5

Ashley

Smith

Texas

R&D

16000

 

SQL solution:

   select

      Concat(Concat(NAME,' '),SURNAME) FULLNAME, SALARY

   from

      EMPLOYEE

   where

      STATE='New York' and DEPT='R&D'

 

   SQL WHERE sub statement performs a conditional filtering according to the condition that state is New York and department is R&D. The keyword "and" is used when both or all conditions need to be met. With the Oracle database, we use Concat function to join up strings together when returning the full name.

 

SPL solution:


A

1

=T("Employee.csv")

2

=A1.select(STATE=="New   York"&&DEPT=="R&D")

3

=A2.new(NAME+" "+SURNAME:FULLNAME, SALARY)

A1: Import Employee table from the source file.

A2: A.select() selects records of employees of R&D department in the New York state.

A3: Use the plus sign “+” to join strings together to get the full name.

   SPL uses similar relational operators and logical operators to those high-level languages, such as Java and C, use. It uses two equal signs "==" to compare whether two operands are equal, and two and sings "&&" if two or more conditions need to be satisfied.

SPL supports retrieving a data table from the database. Suppose the above source data is stored in table "tbl" in database "db", A1 in the above script can be modified as follows:


A

1

=connect("db").query("select * from   tbl")

   

 

Example 2Count employees who are below 30 years old and above 50 years old in each department. Below is part of the source data:

ID

NAME

BIRTHDAY

STATE

DEPT

SALARY

1

Rebecca

1974/11/20

California

R&D

7000

2

Ashley

1980/07/19

New   York

Finance

11000

3

Rachel

1970/12/17

New   Mexico

Sales

9000

4

Emily

1985/03/07

Texas

HR

7000

5

Ashley

1975/05/13

Texas

R&D

16000

 

SQL solution:

    select

      DEPT, COUNT(*) COUNT

   from

      EMPLOYEE

   where

      TRUNC(months_between(sysdate, BIRTHDAY)/12)<30 or TRUNC(months_between(sysdate, BIRTHDAY)/12)>50

   group by DEPT

 

   SQL WHERE sub statement performs the conditional filtering according to the condition that the age is above 50 or below 30. The key word "or" is used when any of multiple conditions needs to be met. The months_between function is then used to calculate the difference of months between an employee’s birthday and the system time, divide it by 12 to get the number of years, and round it up using TRUNC function to get the age. As SQL cannot define a temporary variable in WHERE sub statement, it writes the same expression two times. The language needs a subquery in order to reuse a computing result.

 

SPL solution:


A

1

=T("Employee.csv")

2

=A1.select((age=age(BIRTHDAY), age<30 ||   age>50))

3

=A2.groups(DEPT; count(~):COUNT)

A1: Import Employee table.

A2: A.select() gets records of employees who are below 30 or above 50.

A3: Count the number of records meeting the specified condition in each department.

   SPL offers age() function to calculate the age according to the birthday. It allows using the double vertical lines "||" if any of the multiple conditions needs to be met. SPL supports using a temporary variable in the search condition to do the filtering in one step.

 

Example 3Find the differences of salaries and ages between Emma Taylor and Alexis Jones. Below is part of the employee table:

ID

NAME

SURNAME

STATE

DEPT

SALARY

1

Rebecca

Moore

California

R&D

7000

2

Ashley

Wilson

New   York

Finance

11000

3

Rachel

Johnson

New   Mexico

Sales

9000

4

Emily

Smith

Texas

HR

7000

5

Ashley

Smith

Texas

R&D

16000

 

SQL solution:

    select

      (select SALARY

   from EMPLOYEE

   where NAME='Emma' and SURNAME='Taylor')

      - (select SALARY

         from EMPLOYEE

         where NAME='Alexis' and SURNAME='Jones') as SALARY_GAP,

      (select TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE

         from EMPLOYEE

         where NAME='Emma' and SURNAME='Taylor')

      -(select TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE

         from EMPLOYEE

         where NAME='Alexis' and SURNAME='Jones') as AGE_GAP

      from EMPLOYEE

      where rownum=1

 

   The SQL code is rather complicated and requires multiple queries. We use a common table to simplify it:

    with cte1 as

       (select SALARY,TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE

      from EMPLOYEE

      where NAME='Emma' and SURNAME='Taylor'),

   cte2 as

      select SALARY,TRUNC(months_between(sysdate, BIRTHDAY)/12) as AGE

      from EMPLOYEE

      where NAME='Alexis' and SURNAME='Jones')

   select

      cte1.SALARY-cte2.SALARY as SALARY_GAP,

      cte1.AGE-cte2.AGE as AGE_GAP

   from cte1,cte2

 

   The use common table increases efficiency by reducing the frequency of queries. Yet there is another problem. The expected two records of employees are treated as two tables thanks to SQL’s lack of explicit record type data. A single record in SQL is treated as a temporary table having one record, which is a set consisting of a single member.

 

SPL solution:


A

1

=T("Employee.csv")

2

=A1.select@1(NAME=="Emma"&&SURNAME=="Taylor")

3

=A1.select@1(NAME=="Alexis"&&SURNAME=="Jones")

4

=A2.SALARY-A3.SALARY

5

=age(A2.BIRTHDAY)-age(A3.BIRTHDAY)

A1: Import Employee table.

A2: A.select() works with @1 option to get the record of Emma Taylor.

A3: A.select() works with @1 option to get the record of Alexis Jones.

A4: Calculate difference of their salaries.

A5: Calculate difference of their ages.

We can also use interval() function in A5 to calculate the interval between two dates:


A

5

=interval@y(A2.BIRTHDAY,A3.BIRTHDAY)

   The interval() function uses @y option to get the number of years between two dates.

   SPL offers true record data type. A2 and A3 respectively return records of the two employees. In the subsequent calculations, we can get a value directly from a record without treating it as a table.

 

SQL and SPL almost end in a draw in terms of their abilities to select desired members according to a simple condition. When the select condition becomes not that simple, SPL adds one or two lines according to the logic while SQL script gets extremely roundabout. Moreover, SPL can define a temporary variable in the condition and provides more functions to generate concise statements. 

2. Get the record containing the maximum/minimum value 

Example 4Get ID of the student who has the lowest math score in class one according to the following score table. Below is part of the source data:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

 

SQL solution:

   It is hard to get record containing the max/min value in SQL. Without the window function, we need to first calculate the max/min value and the perform a nested query using the max/min value as the condition.

 

Below is SQL script:

   select

      STUDENTID

   from

      SCORES

   where

      SUBJECT='Math' and and

      SCORE=(

         select MIN(SCORE)

         from SCORES

         where SUBJECT='Math' and

      )

   

   The record containing the min value could have been gotten through one loop, yet we use an inefficient method to get it due to SQL’s inability though this is not the right choice.

Some databases provides their compensations. Oracle, for instance, has analytic function KEEP for the job. The function makes the SQL statements simple and fast:

   select

      MIN(STUDENTID) keep (dense_rank first order by SCORE asc) STUDENTID

   from

      SCORES

   where

      SUBJECT='Math' and

 

   But KEEP function can only work with an aggregate function. It becomes awkward when there are multiple records containing the max/min value.

 

SPL solution:

SPL offers A.minp() function to get the record containing the minimum value.


A

1

=T("Scores.csv").select(SUBJECT=="Math"&&CLASS==1)

2

=A1.minp(SCORE).STUDENTID

A1: Import Scores table and get records Math scores in class one.

A2: A.minp() function gets record containing the lowest score and then gets the student ID from it.

Sometimes there are multiple records containing the maximum value. To return them all, we can use @a option with A.minp() function:


A

2

=A1.minp@a(SCORE).(STUDENTID)

A2: A.minp() function works with @a function to get all records containing the lowest score and then gets the student IDs from it.

 

Below is the result returned by A1.minp@a(SCORE):

CLASS

STUDENTID

SUBJECT

SCORE

1

5

Math

60

1

14

Math

60

   

   The SPL script is simple and clear because it selects the record(s) containing the max/min value directly. SQL, however, is not able to do that. Oracle KEEP function can only calculate the aggregate according to the sorting result.

 

Example 5Find the country that holds the champion in terms total number of medals for the most games and its detailed medal information according to the following Olympic medal table. Below is part of the source data:

Game

Nation

Gold

Silver

Copper

30

USA

46

29

29

30

China

38

27

23

30

UK

29

17

19

30

Russia

24

26

32

30

Korea

13

8

7

 

SQL solution:

   with cte1 as

      (select GAME, max(NATION) keep (dense_rank first order by 1000000*GOLD+1000*SILVER+COPPER desc) NATION

      from OLYMPIC

      group by GAME),

   cte2 as

      (select min(NATION) NATION, min(GAME) GAMESTART, count(*) GAMECOUNT

      from

         (select GAME, NATION, row_number() over(ORDER by GAME) rowno, ROW_NUMBER()OVER(PARTITION BY NATION ORDER BY GAME) groupno

      from cte1)

      group by NATION, groupno-rowno),

   cte3 as

      (select min(NATION) keep (dense_rank first order by GAMECOUNT desc) NATION, min(GAMESTART) keep (dense_rank first order by GAMECOUNT desc) GAMESTART, max(GAMECOUNT) GAMECOUNT

      from cte2)

   select t1.GAME,t1.NATION,t1.GOLD,t1.SILVER,t1.COPPER

   from OLYMPIC t1

   right join cte3

   on t1.nation=cte3.nation and t1.game>=cte3.GAMESTART and t1.game<(cte3.GAMESTART+cte3.GAMECOUNT)

 

   It is a great hassle to perform the task with SQL. The language selects the record of champion in each game by the total number of medals, groups all selected records by continuous country names, gets the group containing the most members and other information including the beginning game and the number of games it holds the champion continuously, and finally, selects records from the original table according to the related information.

 

SPL solution:

SPL offers A.maxp() function to get record containing the maximum value.


A

1

=T("Olympic.txt")

2

=A1.sort@z(GAME,   1000000*GOLD+1000*SILVER+COPPER)

3

=A2.group@o1(GAME)

4

=A3.group@o(NATION)

5

=A4.maxp(~.len())

A1: Import Olympic medal table.

A2: Sort A1’s table by game and the total number of medals.

A3: Get the first record from each group, which is the champion because records are already ordered by result in descending order.

A4: Create a new group when a different country appears.

A5: A.maxp() function gets the group containing the most members, which is the country that holds the champion for the longest time.

 

   The SPL script remains its conciseness while SQL code becomes more and more complicated. Besides the reason that it is hard to get record containing the max/min value in SQL, there is another cause. SQL cannot retain the post-grouping subsets for further grouping and other operations.

 

3. Get top/bottom N

Example 7Get scores of students who rank in top 2 for each subject in all classes based on the following score table. Below is part of the source data:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

 

SQL solution:

   SQL does not have a standard method of getting top/bottom N. Some databases support SELECT TOP statement while some, including Oracle, do not. As we use Oracle here, we handle the task using ORDER BY together with row numbers:

    select CLASS, STUDENTID, SUBJECT, SCORE 

   from (

       select CLASS, STUDENTID, SUBJECT, SCORE,

         ROW_NUMBER() OVER(PARTITION BY CLASS,SUBJECT

          ORDER BY SCORE DESC) grouprank

      from SCORES

   )

   where grouprank <= 2

 

SPL solution:

   SPL has A.top() function to get top/bottom N.


A

1

=T("Scores.csv")

2

=A1.group(CLASS,SUBJECT;~.top(-2;SCORE):TOP2)

3

=A2.conj(TOP2)

A1: Import Scores table.

A2: A.top() function gets students whose scores rank in top 2 in each class for each subject. -2 means getting two records in descending order.

A3: Concatenate all selected records.

 

4. Search for data by segment

In certain occasions we need to calculate the sequence number of the interval where a value falls and get corresponding members from a set. Such cases include returning grades (excellent, very good, good, pass) according to exam scores and returning the income level (poverty, comfortable, well-off, rich) according to the family’s annual income.

 

Example 6Based on the score table, find the number of students whose English scores fall in the following intervals respectively: Excellent (score >=90), Good (80=< score <90), OK (70<= score <80), Pass (60<= score <70) , and Failed (score <60). Below is part of the source data:

CLASS

STUDENTID

SUBJECT

SCORE

1

1

English

84

1

1

Math

77

1

1

PE

69

1

2

English

81

1

2

Math

80

Below is expected result:

ASSESSMENT

COUNT

EXCELLENT

7

GOOD

3

OK

8

PASS

6

FAIL

4

 

SQL solution:

    select

      ASSESSMENT, count(*) COUNT

   from

      (select

         case when score<60 then 'FAILED'

         when score>=60 and score<70 then 'PASS'

         when score>=70 and score<80 then 'OK'

         when score>=80 and score<90 then 'GOOD'

         when score>=90 then 'EXCELLENT'

         else 'OTHER' end ASSESSMENT

      from SCORES

      where subject='English')

   group by ASSESSMENT

 

   SQL uses CASE WHEN to divide the source table into segments, gets grades according to students’ scores, group records by grades, and count students in each group.

 

SPL solution:

   SPL supplies segmentation function A.segp() to return corresponding members in a sequence for each interval according to their sequence number.


A

1

=T("Scores.csv").select(SUBJECT:"English")

2

=create(ASSESSMENT,SCORE).record(["FAILed",0,"PASS",60,"OK",70,"GOOD",80,"EXCELLENT",90])

3

=A1.groups(A2.segp(SCORE,SCORE).ASSESSMENT;count(1):COUNT)

A1: Import Scores table and select English scores.

A2: Create the list of assessment & score.

A3: A.segp() function gets assessment according to sequence number of the interval in A2’s list where a score falls, and count students by assessment.

   

   Both SQL and SPL are able to search for data according to the specific intervals. For each interval, SQL adds a piece of CASE WHEN … THEN … statement. When there are a lot of intervals, the code will be lengthy because it needs to define a segmentation condition for each segment. SPL is different. It creates a comparison list to define the name and segmentation value for each segment, which is convenient to maintain.

Summary

It is inconvenient to get the record having the maximum/minimum value in SQL. Oracle is only able to perform an aggregate operation on the sorting result with the KEEP function but cannot trykt get the record containing the maximum/minimum. By contrast, SPL offers A.maxp() function and A.minp() function to get all things done with a one-stop calculation.

When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.

The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of grouping functions, which combines advantages of both Java and SQL. With SPL, the select operation will be convenient and easy.