Comparison of SQL & SPL: Set-oriented Operations
【Abstract】
Set-oriented operations include a set of basic operations, including concatenation, intersection, union, difference, XOR, and inclusion. As structured data often exists in the form of sets, both SQL and SPL, the two commonly used programming languages, provide a wealth of set-oriented methods and functions to handle the data. This essay explores solutions and basic principles of the two languages for handling set-oriented operations through examples and corresponding sample programs and tries to introduce the more convenient and more efficient way for you.
1. Concatenation
Concatenation of set A and set B is a set including all members of set A and all members of set B. It contains duplicate members.
【Example 1】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to calculate the total sales amount of each product in the year 2014. Below are parts of the two source tables:
SALES_ONLINE:
ID |
CUSTOMERID |
ORDERDATE |
PRODUCTID |
AMOUNT |
1 |
HANAR |
2014/01/01 |
17 |
3868.6 |
2 |
ERNSH |
2014/01/03 |
42 |
1005.9 |
3 |
LINOD |
2014/01/06 |
64 |
400.0 |
4 |
OTTIK |
2014/01/07 |
10 |
1194.0 |
5 |
OCEAN |
2014/01/09 |
41 |
319.2 |
… |
… |
… |
… |
… |
SALES_STORE:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
1 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
2 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
3 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
4 |
SAVEA |
2014/01/07 |
7 |
45 |
2018.2 |
5 |
FOLIG |
2014/01/08 |
8 |
30 |
1622.4 |
… |
… |
… |
… |
… |
… |
The two tables have different structures. SALES_STORE has SELLERID field for storing IDs of salespeople. But both have ID, CUSTOMERID, ORDERDATE, PRODUCTID and AMOUNT fields.
SQL solution:
select
PRODUCTID, sum(AMOUNT) AMOUNT
from
(select PRODUCTID, AMOUNT
from SALES_ONLINE
where extract (year from ORDERDATE)=2014
union all
select PRODUCTID, AMOUNT
from SALES_STORE
where extract (year from ORDERDATE)=2014)
group by PRODUCTID
order by PRODUCTID
SQL UNION operation is used to combine two or more sub result sets of same structure. They need to have same number of columns and each column should have the same data type. UNION ALL combines all records, including the duplicates.
Here we use ORCALE to do the SQL calculations. Since ORACLE does not have YEAR function, we use extract (year from date) to get the year.
SPL solution:
SPL uses the vertical line “|” to calculate concatenation of sets. A|B represents the concatenation of set A and set B.
A |
|
1 |
=T("SalesOnline.csv").select(year(ORDERDATE)==2014) |
2 |
=T("SalesStore.csv").select(year(ORDERDATE)==2014) |
3 |
=A1|A2 |
4 |
=A3.groups(PRODUCTID; sum(AMOUNT):AMOUNT) |
A1: Import SalesOnline table from the source file and select sales records of 2014.
A2: Import SalesStore table from the source file and select sales records of 2014.
A3: Use the sign “|” to calculate concatenation of the two sets.
A4: Group A3’s result set by product and calculate the total sales amount.
The SPL sequence supports members of different structures, so it does not require that the involved sets have same structure when concatenate them. It can directly access their common fields, like PRODUCTID and AMOUNT in this case, as it accesses an ordinary data table. It’s superbly convenient.
SPL supports retrieving a data table from the database, we can change A1 in the above script as follows:
A |
|
1 |
=connect("db").query("select * from SALES_STORE where extract (year from ORDERDATE)=2014") |
【Example 2】Based on the following scores table, find the top 4 math scores, top 3 English scores and top 2 PE scores. Below is part of the source table:
CLASS |
STUDENTID |
SUBJECT |
SCORE |
1 |
1 |
English |
84 |
1 |
1 |
Math |
77 |
1 |
1 |
PE |
69 |
1 |
2 |
English |
81 |
1 |
2 |
Math |
80 |
… |
… |
… |
… |
SQL solution:
SQL lacks a special method of calculating concatenation of two or more sets. It uses UNION ALL operator to do this. Oracle, for instance, defines an open table, uses ROW_NUMBER() OVER to calculate the rankings of all subjects ordered by scores in descending order, and then get top N for each subject:
with cte1 as
(select
CLASS, STUDENTID, SUBJECT, SCORE,
ROW_NUMBER() OVER(PARTITION BY SUBJECT
ORDER BY SCORE DESC) grouprank
from SCORES)
select CLASS, STUDENTID, SUBJECT, SCORE
from cte1
where grouprank <= 4 and SUBJECT='Math'
UNION ALL
select CLASS, STUDENTID, SUBJECT, SCORE
from cte1
where grouprank <= 3 and SUBJECT='English'
UNION ALL
select CLASS, STUDENTID, SUBJECT, SCORE
from cte1
where grouprank <= 2 and SUBJECT='PE'
SPL solution:
SPL uses A.conj() function to calculate the concatenation of member sets when A is a set of sets.
A |
|
1 |
=T("Scores.csv") |
2 |
[Math,English,PE] |
3 |
[4,3,2] |
4 |
=A3.conj(A1.select(SUBJECT==A2(A3.#)).top(-~;SCORE)) |
A1: Import student scores table.
A2: Define a set of subjects.
A3: Define a set of ranks corresponding to the set of subjects.
A4: Get top N scores for each specified subject, and use A.conj() function to concatenate multiple sets.
As there isn’t a special method of calculating concatenation of sets, we use UNION ALL to combine two sets each time. The code becomes more and more complicated when the number of sets involved increases. SPL has the special function A.conj() to do the job. It can calculate concatenation of any number of sets.
2. Intersection
The intersection of set A and set B is a set including all members belonging to both A and B.
【Example 3】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the customers who purchase through both sales channels in the year 2014. Below is part of the source table:
SALES_ONLINE:
ID |
CUSTOMERID |
ORDERDATE |
PRODUCTID |
AMOUNT |
1 |
HANAR |
2014/01/01 |
17 |
3868.6 |
2 |
ERNSH |
2014/01/03 |
42 |
1005.9 |
3 |
LINOD |
2014/01/06 |
64 |
400.0 |
4 |
OTTIK |
2014/01/07 |
10 |
1194.0 |
5 |
OCEAN |
2014/01/09 |
41 |
319.2 |
… |
… |
… |
… |
… |
SALES_STORE:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
1 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
2 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
3 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
4 |
SAVEA |
2014/01/07 |
7 |
45 |
2018.2 |
5 |
FOLIG |
2014/01/08 |
8 |
30 |
1622.4 |
… |
… |
… |
… |
… |
… |
SQL solution:
SQL INTERSECT operator is used to calculate the intersection of two or more sub result sets. The sub result sets must have same data structure and same number of columns, and their columns should contain data of same types. Below is SQL query:
select
DISTINCT CUSTOMERID
from
SALES_ONLINE
where
extract (year from ORDERDATE)=2014
INTERSECT
select
DISTINCT CUSTOMERID
from
SALES_STORE
where
extract (year from ORDERDATE)=2014
Early databases probably do not support INTERSECT operator. So we can first calculate concatenation (UNION ALL) and then perform a filtering to get the target through grouping operation and count operation:
select
CUSTOMERID, COUNT(*)
from
(select DISTINCT CUSTOMERID
from SALES_ONLINE
where extract (year from ORDERDATE)=2014
UNION ALL
select DISTINCT CUSTOMERID
from SALES_STORE
where extract (year from ORDERDATE)=2014)
group by
CUSTOMERID
HAVING
COUNT(*)>1
order by
CUSTOMERID
The additional layer of grouping and aggregation makes the SQL query extremely complicated.
SPL solution:
SPL uses the sign “^” to calculate intersection of sets. A^B represents the intersection of set A and set B.
A |
|
1 |
=T("SalesOnline.csv").select(year(ORDERDATE)==2014) |
2 |
=T("SalesStore.csv").select(year(ORDERDATE)==2014) |
3 |
=A1.id(CUSTOMERID)^A2.id(CUSTOMERID) |
A1: Import SalesOnline table from the source file and select sales records of 2014.
A2: Import SalesStore table from the source file and select sales records of 2014.
A3: A.id() function gets the set of unique customer IDs. The sign “^” is used to calculate intersection of the two sets, which are customers who purchase products in both in online and at stores.
【Example 4】Based on the following sales data table, find top 10 customers in terms of monthly sales amount in the year 2014. Below is part of the source table:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
10400 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
10401 |
HANAR |
2014/01/01 |
1 |
17 |
3868.6 |
10402 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
10403 |
ERNSH |
2014/01/03 |
4 |
42 |
1005.9 |
10404 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
… |
… |
… |
… |
… |
… |
SQL solution:
with cte1 as
(select
extract (month from ORDERDATE) ORDERMONTH,
CUSTOMERID, SUM(AMOUNT) AMOUNT
from SALES
where extract (year from ORDERDATE)=2014
group by extract (month from ORDERDATE),CUSTOMERID
order by ORDERMONTH ASC, AMOUNT DESC),
cte2 as
(select
ORDERMONTH,CUSTOMERID,AMOUNT,
ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank
from cte1)
select
CUSTOMERID,count(*)
from cte2
where grouprank<=10
group by CUSTOMERID
having count(*)=12
SQL has not a method of calculating intersection of sets. The language uses INTERSECT operator to do the job. It is unsuitable to get top 10 customers in each month and then use INTERSECT operator to get the combination. Here we group records and perform a count operation. If a customer ranking in monthly top 10 for 12 times, it enters in the top 10 in each month.
SPL solution:
SPL uses A.isect() function to calculate intersection of all member sets if A is a set of sets.
A |
|
1 |
=T("Sales.csv").select(year(ORDERDATE)==2014) |
2 |
=A1.group(month(ORDERDATE)) |
3 |
=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) |
4 |
=A3.(~.top(-10;AMOUNT).(CUSTOMERID)) |
5 |
=A4.isect() |
A1: Import Sales table from the source file and select records of the year 2014.
A2: Group A1’s table by month.
A3: Group records in each month by customer.
A4: Get customers whose monthly sales amounts rank in top 10 in each month.
A5: A.isect() function calculates intersection of all selected customers in all months.
Though we employ a trick to calculate the intersection, the SQL query is still complex. One reason is that SQL does not have a special method to calculate intersection of sets, the other is that SQL cannot retain the post-grouping subsets for a further grouping operation or other computations. The grouping operation will be explained separately later.
3. Union
The union of set A and set B is a set including all members of A and members of B, except for the duplicates.
【Example 5】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find the products whose total online sales amounts in the year 2014 is above 10,000 or that are purchased over 5 times at stores. Below are parts of the source tables:
SALES_ONLINE:
ID |
CUSTOMERID |
ORDERDATE |
PRODUCTID |
AMOUNT |
1 |
HANAR |
2014/01/01 |
17 |
3868.6 |
2 |
ERNSH |
2014/01/03 |
42 |
1005.9 |
3 |
LINOD |
2014/01/06 |
64 |
400.0 |
4 |
OTTIK |
2014/01/07 |
10 |
1194.0 |
5 |
OCEAN |
2014/01/09 |
41 |
319.2 |
… |
… |
… |
… |
… |
SALES_STORE:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
1 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
2 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
3 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
4 |
SAVEA |
2014/01/07 |
7 |
45 |
2018.2 |
5 |
FOLIG |
2014/01/08 |
8 |
30 |
1622.4 |
… |
… |
… |
… |
… |
… |
SQL solution:
select
PRODUCTID
from
(select PRODUCTID, sum(AMOUNT)
from SALES_ONLINE
where extract (year from ORDERDATE)=2014
group by PRODUCTID
having sum(AMOUNT)>10000)
union
select
PRODUCTID
from
(select PRODUCTID, count(*)
from SALES_STORE
where extract (year from ORDERDATE)=2014
group by PRODUCTID
having count(*)>5)
As we have mentioned, SQL UNION operator is used to combine results sets of two or more SELECT statements. When UNION is used without ALL, the operator removes duplicate records while combining records.
SPL solution:
SPL offers the and sign “&” to calculate the union of sets. A&B represents the union of set A and set B.
A |
|
1 |
=T("SalesOnline.csv").select(year(ORDERDATE)==2014) |
2 |
=T("SalesStore.csv").select(year(ORDERDATE)==2014) |
3 |
=A1.groups(PRODUCTID; sum(AMOUNT):AMOUNT).select(AMOUNT>10000) |
4 |
=A2.groups(PRODUCTID; count(~):COUNT).select(COUNT>5) |
5 |
=A3.(PRODUCTID)&A4.(PRODUCTID) |
A1: Import SalesOnline table from the source file and select records of the year 2014.
A2: Import SalesStore table from the source file and select records of the year 2014.
A3: Group A1’s records by product ID, calculate the total sales amount for each product, and select records where the totals are above 10,000.
A4: Group A2’s records by product ID, count purchase frequencies for each product, and select records where the frequency count is greater than 5.
A5: Use “&” to calculate the union of records purchased both through online and at stores.
【Example 6】Based on the following sales table, find the products whose yearly purchase frequency ranks in top 10. Below is part of the source table:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
10400 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
10401 |
HANAR |
2014/01/01 |
1 |
17 |
3868.6 |
10402 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
10403 |
ERNSH |
2014/01/03 |
4 |
42 |
1005.9 |
10404 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
… |
… |
… |
… |
… |
… |
SQL solution:
with cte1 as
(select
extract (year from ORDERDATE) ORDERYEAR,
PRODUCTID, COUNT(*) ORDERCOUNT
from SALES
group by extract (year from ORDERDATE),PRODUCTID
order by ORDERYEAR ASC, ORDERCOUNT DESC),
cte2 as
(select
ORDERYEAR,PRODUCTID,ORDERCOUNT,
ROW_NUMBER()OVER(PARTITION BY ORDERYEAR ORDER BY ORDERCOUNT DESC) grouprank
from cte1)
select
DISTINCT PRODUCTID
from cte2
where grouprank<=10
SQL does not offer a special method of calculating union of sets, so we group data by year, calculate the total purchase frequency of each product per year, sort the result records, get rankings through row numbers after sorting, then select IDs of products that rank in top 10 in each year in terms of purchase frequency, and finally, use DISTINCT operator to remove the duplicates to get the union.
SPL solution:
SPL uses A.union() function to calculate union of all member sets when A is a set of sets.
A |
|
1 |
=T("Sales.csv") |
2 |
=A1.group(year(ORDERDATE)) |
3 |
=A2.(~.groups(PRODUCTID;count(~):COUNT)) |
4 |
=A3.(~.top(-5;COUNT).(PRODUCTID)) |
5 |
=A4.union() |
A1: Import Sales table from the source file.
A2: Group A1’s table by year.
A3: Group records of each year by product and calculate their purchase frequency.
A4: Get IDs of products whose yearly purchase frequencies rank in top 10.
A5: A.union() function calculates union of desired records of products in all years.
4. Difference
The difference of set A and set B is a set including all members of A that do not belong to set B.
【Example 7】A vendor has two sales channels, online and store. The sales data for them is stored in two separate tables, SALES_ONLINE and SALES_STORE. The task is to find customers whose total purchase amount at stores is above 1000 but who have not any online purchase records. Below is part of the source table:
SALES_ONLINE:
ID |
CUSTOMERID |
ORDERDATE |
PRODUCTID |
AMOUNT |
1 |
HANAR |
2014/01/01 |
17 |
3868.6 |
2 |
ERNSH |
2014/01/03 |
42 |
1005.9 |
3 |
LINOD |
2014/01/06 |
64 |
400.0 |
4 |
OTTIK |
2014/01/07 |
10 |
1194.0 |
5 |
OCEAN |
2014/01/09 |
41 |
319.2 |
… |
… |
… |
… |
… |
SALES_STORE:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
1 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
2 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
3 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
4 |
SAVEA |
2014/01/07 |
7 |
45 |
2018.2 |
5 |
FOLIG |
2014/01/08 |
8 |
30 |
1622.4 |
… |
… |
… |
… |
… |
… |
SQL solution:
select
CUSTOMERID
from
(select
CUSTOMERID,count(*)
from
SALES_STORE
where
extract (year from ORDERDATE)=2014
group by
CUSTOMERID
having
count(*)>3)
MINUS
select
DISTINCT CUSTOMERID
from
SALES_ONLINE
where
extract (year from ORDERDATE)=2014
Take Oracle SQL as an example. The MINUS operator (some databases use EXCEPT) is used to calculate difference. MINUS requires that the involved sub result sets must have same data structures and same number of columns, and that corresponding columns should have same or similar data types.
SPL solution:
SPL uses the backslash “\” to calculate difference of sets. A\B represents the difference of set A and set B.
A |
|
1 |
=T("SalesOnline.csv").select(year(ORDERDATE)==2014) |
2 |
=T("SalesStore.csv").select(year(ORDERDATE)==2014) |
3 |
=A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3) |
4 |
=A3.id(CUSTOMERID)\A1.id(CUSTOMERID) |
A1: Import SalesOnline table from the source file and select records of the year 2014.
A2: Import SalesStore table from the source file and select records of the year 2014.
A3: Group A2’s records by customer IDs, calculate the total purchase frequency for each customer, and select records where the total frequency is greater than 3.
A4: Use “\” to calculate difference of customers who purchase products online and those who buy things at stores.
【Example 8】Based on the following sales table, find customers whose total sales amounts rank in top 10 only in January in the year 2014. Below is part of the source table:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
10400 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
10401 |
HANAR |
2014/01/01 |
1 |
17 |
3868.6 |
10402 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
10403 |
ERNSH |
2014/01/03 |
4 |
42 |
1005.9 |
10404 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
… |
… |
… |
… |
… |
… |
SQL solution:
with cte1 as
(select
extract (month from ORDERDATE) ORDERMONTH,
CUSTOMERID, SUM(AMOUNT) AMOUNT
from SALES
where extract (year from ORDERDATE)=2014
group by extract (month from ORDERDATE),CUSTOMERID
order by ORDERMONTH ASC, AMOUNT DESC),
cte2 as
(select
ORDERMONTH,CUSTOMERID,AMOUNT,
ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank
from cte1)
select CUSTOMERID
from cte2
where grouprank<=10 and ORDERMONTH=1
MINUS
select CUSTOMERID
from cte2
where grouprank<=10 and ORDERMONTH<>1
SQL does not have a particular method for calculating difference of sets, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers of January, and use MINUS operator to calculate difference January’s customers and the set of customers in other months.
SPL solution:
SPL offers A.diff() function to calculate difference of the first member set and all the other member sets when A is a set of sets.
A |
|
1 |
=T("Sales.csv").select(year(ORDERDATE)==2014) |
2 |
=A1.group(month(ORDERDATE)) |
3 |
=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) |
4 |
=A3.(~.top(-10;AMOUNT).(CUSTOMERID)) |
5 |
=A4.diff() |
A1: Import Sales table from the source file and select records of the year 2014.
A2: Group A1’s records by month.
A3: Group records of each month by customer IDs and calculate total sales of each customer.
A4: Get customers in each month whose total sales amounts rank in top 10.
A5: Use A.diff() function to calculate difference between customers of January and those in the other months.
5. XOR
XOR of set A and set B is a set made up members that belong to A or B but not belong to both.
【Example 9】Students’ scores in two semesters are stored in two tables respectively. The task is to find students in class 1 whose total scores rank in top 10 only once in both semesters. Below is part of the source table:
SCORES1_SEMESTER1:
CLASS |
STUDENTID |
SUBJECT |
SCORE |
1 |
1 |
English |
84 |
1 |
1 |
Math |
77 |
1 |
1 |
PE |
69 |
1 |
2 |
English |
81 |
1 |
2 |
Math |
80 |
… |
… |
… |
… |
SCORES2_SEMESTER2:
CLASS |
STUDENTID |
SUBJECT |
SCORE |
1 |
1 |
English |
97 |
1 |
1 |
Math |
64 |
1 |
1 |
PE |
97 |
1 |
2 |
English |
56 |
1 |
2 |
Math |
82 |
… |
… |
… |
… |
SQL solution:
SQL does not define an operator for performing XOR. There are two methods if we choose to use set operators to do this:
1. (A UNION B) EXCEPT (A INTERSECT B);
2. (A EXCEPT B) UNION (B EXCEPT A);
Both methods are not convenient because they use multiple views and thus drag performance down. In the following query, we use FULL OUTER JOIN to counter the disadvantage:
with cte1 as
(select STUDENTID,TOTALSCORE
from
(select STUDENTID, sum(SCORE) TOTALSCORE
from SCORES1
group by STUDENTID
order by TOTALSCORE DESC)
where rownum <= 10),
cte2 as
(select STUDENTID,TOTALSCORE
from
(select STUDENTID, sum(SCORE) TOTALSCORE
from SCORES2
group by STUDENTID
order by TOTALSCORE DESC)
where rownum <= 10)
select
COALESCE(cte1.STUDENTID, cte2.STUDENTID) STUDENTID,
cte1.TOTALSCORE, cte2.TOTALSCORE
from cte1
FULL OUTER JOIN cte2
ON cte1.STUDENTID=cte2.STUDENTID
where cte1.TOTALSCORE IS NULL
OR cte2.TOTALSCORE IS NULL
With Oracle SQL, we can use LEFT JOIN UNION RIGHT JOIND to implement FULL JOIN if the involved database is one that does not support FULL JOIN, such as MySQL (Detail query is skipped here).
SPL solution:
SPL uses the percent sign “%” to calculate XOR. A%B represents XOR of set A and set B.
A |
|
1 |
=T("Scores1.csv") |
2 |
=T("Scores2.csv") |
3 |
=A1.groups(STUDENTID; sum(SCORE):SCORE) |
4 |
=A2.groups(STUDENTID; sum(SCORE):SCORE) |
5 |
=A3.top(-10;SCORE).(STUDENTID) |
6 |
=A4.top(-10;SCORE).(STUDENTID) |
7 |
=A5%A6 |
A1: Import Scores1 table of from source file.
A2: Import Scores2 table of from source file.
A3: Group A1’s table by student ID and sum scores of each student.
A4: Group A2’s table by student ID and sum scores of each student.
A5: Get IDs of students whose total scores rank in top 10 in semester 1.
A6: Get IDs of students whose total scores rank in top 10 in semester 2.
A7: Use “%” to calculate XOR of student total scores in semester 1 and those in semester 2.
The SQL query is enormously complicated because there isn’t a particular operator in SQL to calculate XOR. SPL, however, is convenient by offering the percent sign “%” to do it.
【Example 10】Based on the following table, find whether customer RATTC ranked in top 3 in the year 2014 in terms of sales amount in a single month. Below is part of the source data:
ID |
CUSTOMERID |
ORDERDATE |
SELLERID |
PRODUCTID |
AMOUNT |
10400 |
EASTC |
2014/01/01 |
1 |
27 |
3063.0 |
10401 |
HANAR |
2014/01/01 |
1 |
17 |
3868.6 |
10402 |
ERNSH |
2014/01/02 |
8 |
70 |
2713.5 |
10403 |
ERNSH |
2014/01/03 |
4 |
42 |
1005.9 |
10404 |
MAGAA |
2014/01/03 |
2 |
74 |
1675.0 |
… |
… |
… |
… |
… |
… |
SQL solution:
with cte1 as
(select
extract (month from ORDERDATE) ORDERMONTH,
CUSTOMERID, SUM(AMOUNT) AMOUNT
from SALES
where extract (year from ORDERDATE)=2014
group by extract (month from ORDERDATE),CUSTOMERID
order by ORDERMONTH ASC, AMOUNT DESC),
cte2 as
(select
ORDERMONTH,CUSTOMERID,AMOUNT,
ROW_NUMBER()OVER(PARTITION BY ORDERMONTH ORDER BY AMOUNT DESC) grouprank
from cte1)
select count(*) CUSTOMER_EXISTS
from cte2
where grouprank<=3 and CUSTOMERID='RATTC'
SQL does not have a specific method for getting XOR, so we group data by month, calculate the total sales amount of each customer per month, sort the result records, get rankings through row numbers after sorting, select customers entering top 3 in each month, and count records of customer "RATTC". If the count result is 1, it means true; and if it is 0, it means false.
SPL solution:
SPL offers A.cor() function to calculate XOR of member sets when A is a set of sets.
A |
|
1 |
=T("Sales.csv").select(year(ORDERDATE)==2014) |
2 |
=A1.group(month(ORDERDATE)) |
3 |
=A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) |
4 |
=A3.new(~.top(-3; AMOUNT):TOP3) |
5 |
=A4.(TOP3.(CUSTOMERID).pos("RATTC")>0) |
6 |
=A5.cor() |
A1: Import Sales table from the source file and select records of the year 2014.
A2: Group Sales table by month.
A3: Group records of each month by customer and calculate total sales amount of each customer in each group.
A4: Get customers whose sales amounts rank in top 3 in each month.
A5: Find if there is customer "RATTC" in each group.
A6: Use A.cor() function to find XOR, that is, whether customer "RATTC" is included in each month. The customer exists if the result is true, and it does not exist if the result is false.
6. Belong to & include
Suppose there are two sets A and B. If all members of set A are members of set B, we call that B includes A. The “belong to” relationship is between a member and a set. When there is a member x in set A, we call that x belongs to A.
【Example 11】Based on the following employee table, calculate the average salary in each department in the states of California, New York, Texas and Washington. Below is part of the source table:
ID |
NAME |
SURNAME |
STATE |
DEPT |
SALARY |
1 |
Rebecca |
Moore |
California |
R&D |
7000 |
2 |
Ashley |
Wilson |
New York |
Finance |
11000 |
3 |
Rachel |
Johnson |
New Mexico |
Sales |
9000 |
4 |
Emily |
Smith |
Texas |
HR |
7000 |
5 |
Ashley |
Smith |
Texas |
R&D |
16000 |
… |
… |
… |
… |
… |
… |
SQL solution:
SQL IN operator is used to define a “belong to” relationship in WHERE sub statement. The SQL query is as follows:
select
DEPT, avg(SALARY) AVGSALARY
from
EMPLOYEE
where
STATE in ('California','New York','Texas','Washington')
group by
DEPT
SPL solution:
SPL supplies A.contain(x) function to check whether member x belongs to set A.
A |
|
1 |
=T("Employee.csv") |
2 |
[California,New York,Texas,Washington] |
3 |
=A1.select(A2.contain(STATE)) |
4 |
=A3.groups(DEPT; avg(SALARY):SALARY) |
A1: Import Employee table from the source file.
A2: Define a constant set of states.
A3: Select records from A1’s table where the states belong to A2’s set.
A4: Group the selected records in A3 by department and calculate the average salary in each department.
【Example 12】Based on COURSE table and SELECT_COURSE table, find students who select both Matlab and Modern wireless communication system. Below is part of the s source table:
COURSE:
ID |
NAME |
TEACHERID |
1 |
Environmental protection and sustainable development |
5 |
2 |
Mental health of College Students |
1 |
3 |
Matlab |
8 |
4 |
Electromechanical basic practice |
7 |
5 |
Introduction to modern life science |
3 |
… |
… |
… |
SELECT_COURSE:
ID |
STUDENT_NAME |
COURSE |
1 |
Rebecca Moore |
2,7 |
2 |
Ashley Wilson |
1,8 |
3 |
Rachel Johnson |
2,7,10 |
4 |
Emily Smith |
1,10 |
5 |
Ashley Smith |
5,6 |
… |
… |
… |
The task can be described in another way. It checks whether the COURSE field value in SELECT_COURSE table contains the set [3,6], which are IDs of Matlab and Modern wireless communication system.
SQL solution:
The SQL field does not support set data type, so we cannot use the set include relationship to get this task done. With Oracle database here, we use REGEXP_SUBSTR function to split each COURSE value string according to a specific regular expression, left join SELECT_COURSE table and COURSE table to get records selecting both courses, group these records by IDs, get groups containing at least two records, that is, those selecting both courses, and then locate corresponding records from SELECT_COURSE table according to the selected IDs. Below is SQL query:
with cte1 as
(SELECT ID,REGEXP_SUBSTR(t1.COURSE ,'[^,]+',1,l) COURSE
FROM SELECT_COURSE t1,
(SELECT LEVEL l
FROM DUAL
CONNECT BY LEVEL<=10) t2
WHERE l<=LENGTH(t1.COURSE) - LENGTH(REPLACE(COURSE,','))+1)
select *
from SELECT_COURSE t5
inner join (
select ID, count(*)
from (
select t3.ID, COURSE
from cte1 t3
inner join (
select ID
from COURSE
where NAME='Matlab' or
NAME='Modern wireless communication system'
) t4
on t3.COURSE=t4.ID
)
group by ID
having count(*)>=2
) t6
on t5.ID=t6.ID
SPL solution:
SPL uses A.pos(B) function locate the position of a member of set B in set A and returns null if the member does not exist in set A.
A |
|
1 |
=T("Course.txt") |
2 |
=T("SelectCourse.txt") |
3 |
=A1.select(NAME=="Matlab" || NAME=="Modern wireless communication system").(ID) |
4 |
=A2.run(COURSE=COURSE.split@cp()) |
5 |
=A4.select(COURSE.pos(A3)!=null) |
A1: Import Course table from the source file.
A2: Import SelectCourse table from the source file.
A3: Get the set of IDs of the target two courses.
A4: Split each Course value by comma and parse the numbers into a set.
A5: Use A.pos() function to locate IDs of the target courses in COURSE value of SELECT_COURSE table, and a record that does not make it return null is a desired one.
As SQL lacks support of set type field values, it is a little complicated to handle this case. The language does snot provide a method of checking a set include relationship, so it handles the job using the filtering join. SPL, however, supports set type field values and offers a rich library of functions to locate members of a set, which facilitates the handling of set include relationship judgment cases.
【Example 13】Based on the following weather data in a certain area, find the dates when west wind occurs and when north wind visits in the previous dates. Below is part of the source table:
WEATHER_DATE |
RAIN_FALL |
WIND_GUST_DIR |
WIND_GUST_SPEED |
RAIN_TODAY |
RAIN_TOMORROW |
2008/12/01 |
0.6 |
W |
44 |
No |
No |
2008/12/02 |
0.0 |
WNW |
44 |
No |
No |
2008/12/03 |
0.0 |
WSW |
46 |
No |
No |
2008/12/04 |
0.0 |
NE |
24 |
No |
No |
2008/12/05 |
1.0 |
W |
41 |
No |
No |
… |
… |
… |
… |
… |
… |
SQL solution:
The task is simple. It aims to find an ordered subset [N,N,W] in WIND_GUST_DIR set. SQL has a weak support for order-based calculations due to its unordered-set-based theoretic foundation (which is explained in Comparison of SQL & SPL: Order-based calculation). When the SQL you are using does not support window functions, you can only do this through table joins. Below is SQL query:
select
curr.WEATHER_DATE, RAIN_FALL,curr.WIND_GUST_DIR,
WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW
from
weather curr
inner join
(select
pre1.WEATHER_DATE,pre1.WIND_GUST_DIR
from weather pre1
left join
(select
WEATHER_DATE,WIND_GUST_DIR
from weather) pre2
on
pre1.WEATHER_DATE=pre2.WEATHER_DATE+1
where
pre1.WIND_GUST_DIR='N' and pre2.WIND_GUST_DIR='N') yest
on
curr.WEATHER_DATE=yest.WEATHER_DATE+1
where
curr.WIND_GUST_DIR='W'
order by WEATHER_DATE
The SQL query is roundabout. Each inter-row access requires a self-join. It is inefficient. SQL introduced window functions in the year 2003 and brought in the concept of order. That has made the order-based calculations slightly easier:
select
WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,
WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW
from
(select
WEATHER_DATE,RAIN_FALL,WIND_GUST_DIR,
WIND_GUST_SPEED,RAIN_TODAY,RAIN_TOMORROW,
LAG(WIND_GUST_DIR,1)
OVER (
ORDER BY WEATHER_DATE ASC
) YESTERDAY_DIR,
LAG(WIND_GUST_DIR,2)
OVER (
ORDER BY WEATHER_DATE ASC
) BEFORE_YESTERDAY_DIR
from WEATHER)
where
WIND_GUST_DIR='W' and YESTERDAY_DIR='N'
and BEFORE_YESTERDAY_DIR='N'
order by WEATHER_DATE
SPL solution:
SPL supports accessing a record previous to the current one or after it for order-based calculations.
A |
|
1 |
=T("weather.csv") |
2 |
=A1.select(WIND_GUST_DIR[-2]=="N" &&WIND_GUST_DIR[-1]=="N" && WIND_GUST_DIR=="W") |
A1: Import weather data from the source file.
A2: Select records where WIND_GUST_DIR values are west and the previous two values are north.
Summary
SQL gives a good support for concatenation, intersection, union, and difference of two sets but a not good one for XOR. The language does not support theses operations on more sets, and it resorts to roundabout ways to get do jobs. SPL provides special functions for each type of set-oriented operations. This makes SPL code concise, efficient, and easy to understand.
SQL does not support set type field values. When a field value is separated by a certain identifier, SQL cannot perform set include operation on it. The language is awkward and produces complicated code in handling the “belong to” relationship on an ordered subset even with the window function. SPL designs a lot of overall location functions to deal with the set “belong to” relationship. It also supplies a complete set of supports for order-based calculations, which makes it easy to handle inter-row operations.
When the query is complicated, the complexity of SQL query increases by multiple times. It involves the use of temporary table and nested query, etc, which makes it harder to write and maintain the SQL query. SPL, however, can compose succinct code step by step according to the natural way of thinking.
The SPL-driven esProc is the professional data computation engine. It is ordered-set-based and offers a complete set of set-oriented functions, which combines the advantages of both Java and SQL. With SPL, a set-oriented operation becomes simple and easy.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version