SPL: Set Operations

Set operations refer to the basic operations among various sets, including conjunction, intersection, union, difference, exclusive OR, containment and so forth. Structured data is often in the form of sets, and SPL provides rich methods and functions to facilitate the computation of such data. Let's take a look at how SPL solves set operation problems with some examples.

1. Conjunction

The conjunction of set A and set B is the set consisting of all members of set A or set B, duplicate members included. For example, the conjunction of set [1,2] and set [1,3] is the set [1,2,1,3], and the duplicate member 1 appears twice in the conjunction.

[e.g.1] A certain merchant has two sales channels, online sales and physical stores, and the sales data are stored in the SALES_ONLINE and SALES_STORE tables respectively. Now we want to query the total sales of each product in 2014. Some of the data are as follows:

SALES_ONLINE

 ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT 1 HANAR 2014/01/01 17 3868.6 2 ERNSH 2014/01/03 42 1005.9 3 LINOD 2014/01/06 64 400.0 4 OTTIK 2014/01/07 10 1194.0 5 OCEAN 2014/01/09 41 319.2 … … … … …

SALES_STORE

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 1 EASTC 2014/01/01 1 27 3063.0 2 ERNSH 2014/01/02 8 70 2713.5 3 MAGAA 2014/01/03 2 74 1675.0 4 SAVEA 2014/01/07 7 45 2018.2 5 FOLIG 2014/01/08 8 30 1622.4 … … … … … …

The data structure of above two tables are not exactly the same, for instance, SALES_STORE has the field SELLERID for storing sales people. But both have the same fields of order date, product ID, sales amount, etc.

The symbol "|" can be used in SPL to compute the conjunction of sets. For example, A|B is the conjunction of set A and set B.

The SPL script is as follows:

 A 1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014) 2 =T("SalesStore.csv").select(year(ORDERDATE)==2014) 3 =A1|A2 4 =A3.groups(PRODUCTID; sum(AMOUNT):AMOUNT)

A1: Import the online sales table from the file and select the sales records of 2014.

A2: Import the physical-store sales table from the file and select the sales records of 2014.

A3: Calculate the conjunction of the two sets using the symbol "|".

A4: Group and aggregate the total sales of each product.

It is worth noting that SPL does not require sets of the same data structure during calculation, because the sequences of SPL support heterogeneous members. When we want to access the common fields Product ID and Sales Amount of the sequence members, we can access them directly like a normal data table, which is very convenient for users.

[e.g.2] Query the scores of the top 4 in math, top 3 in English, and top 2 in P.E. according to the score table. Some of the data are as follows:

 CLASS STUDENTID SUBJECT SCORE 1 1 English 84 1 1 Math 77 1 1 PE 69 1 2 English 81 1 2 Math 80 … … … …

SPL provides the A.conj() function to calculate the conjunction of all member sets when A is a set composed of sets.

The SPL script is as follows:

 A 1 =T("Scores.csv") 2 [Math,English,PE] 3 [4,3,2] 4 =A3.conj(A1.select(SUBJECT==A2(A3.#)).top(-~;SCORE))

A1: Import the student score table.

A2: Define the set of subjects.

A3: Define the set of rankings corresponding to the subjects.

A4: Select the top N rankings of each subject separately and use the A.conj() function to calculate the conjunction of the member sets.

2. Intersection

The intersection of set A and set B is the set consisting of all members which belong to set A and set B at the same time. For example, the intersection of set [1,2,3] and set [1,3,5] is the set [1,3].

[e.g.3] A certain merchant has two sales channels, online sales and physical stores, and the sales data are stored in the SALES_ONLINE and SALES_STORE tables respectively. Now we want to query the customers who have placed orders in both online and physical stores in 2014. Some of the data are as follows:

SALES_ONLINE

 ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT 1 HANAR 2014/01/01 17 3868.6 2 ERNSH 2014/01/03 42 1005.9 3 LINOD 2014/01/06 64 400.0 4 OTTIK 2014/01/07 10 1194.0 5 OCEAN 2014/01/09 41 319.2 … … … … …

SALES_STORE

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 1 EASTC 2014/01/01 1 27 3063.0 2 ERNSH 2014/01/02 8 70 2713.5 3 MAGAA 2014/01/03 2 74 1675.0 4 SAVEA 2014/01/07 7 45 2018.2 5 FOLIG 2014/01/08 8 30 1622.4 … … … … … …

The symbol "^" can be used in SPL to calculate the intersection of sets. For example, A^B is the intersection of set A and set B.

The SPL script is as follows:

 A 1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014) 2 =T("SalesStore.csv").select(year(ORDERDATE)==2014) 3 =A1.id(CUSTOMERID)^A2.id(CUSTOMERID)

A1: Import the online sales table from the file and select the sales records of 2014.

A2: Import the physical-store sales table from the file and select the sales records of 2014.

A3: Use the A.id() function to calculate the set of distinct values of Customer IDs. Then use the symbol "^" to calculate the intersection of the two sets, i.e. customers who have placed orders in both online and physical stores.

[e.g.4] Count the names of the customers who rank the top 10 every month in 2014 based on the sales data table. Some of the data are as follows:

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 10400 EASTC 2014/01/01 1 27 3063.0 10401 HANAR 2014/01/01 1 17 3868.6 10402 ERNSH 2014/01/02 8 70 2713.5 10403 ERNSH 2014/01/03 4 42 1005.9 10404 MAGAA 2014/01/03 2 74 1675.0 … … … … … …

SPL provides the A.isect() function to calculate the intersection of all member sets when A is a set composed of sets.

The SPL script is as follows:

 A 1 =T("Sales.csv").select(year(ORDERDATE)==2014) 2 =A1.group(month(ORDERDATE)) 3 =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) 4 =A3.(~.top(-10;AMOUNT).(CUSTOMERID)) 5 =A4.isect()

A1: Import the sales table from the file and select the sales records of 2014.

A2: Group the sales table by month.

A3: Group the records of each month by customer.

A4: Count the top ten customers in total sales of each month.

A5: Use A.isect() function to calculate the intersection of all the customers selected from every month.

3. Union

The union of set A and set B is the set consisting of all members of set A or set B, duplicate members not included. For example, the union of the set [1,2] and set [1,3] is the set [1,2,3]. It is important to note that the duplicate member 1 only appears once in the union set.

[e.g.5] A certain merchant has two sales channels, online sales and physical stores, and the sales data are stored in the SALES_ONLINE and SALES_STORE tables respectively. Now we want to query which products have accumulated more than 10,000 in sales amount online or more than 5 in sales number offline in 2014. Some of the data are as follows:

SALES_ONLINE

 ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT 1 HANAR 2014/01/01 17 3868.6 2 ERNSH 2014/01/03 42 1005.9 3 LINOD 2014/01/06 64 400.0 4 OTTIK 2014/01/07 10 1194.0 5 OCEAN 2014/01/09 41 319.2 … … … … …

SALES_STORE

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 1 EASTC 2014/01/01 1 27 3063.0 2 ERNSH 2014/01/02 8 70 2713.5 3 MAGAA 2014/01/03 2 74 1675.0 4 SAVEA 2014/01/07 7 45 2018.2 5 FOLIG 2014/01/08 8 30 1622.4 … … … … … …

The symbol "&" can be used in SPL to calculate the union of sets. For example, A&B is the union of set A and set B.

The SPL script is as follows:

 A 1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014) 2 =T("SalesStore.csv").select(year(ORDERDATE)==2014) 3 =A1.groups(PRODUCTID; sum(AMOUNT):AMOUNT).select(AMOUNT>10000) 4 =A2.groups(PRODUCTID; count(~):COUNT).select(COUNT>5) 5 =A3.(PRODUCTID)&A4.(PRODUCTID)

A1: Import the online sales table from the file and select the sales records of 2014.

A2: Import the physical-store sales table from the file and select the sales records of 2014.

A3: Group and aggregate the total sales amount of each online product and select the records that are greater than 10,000.

A4: Group and aggregate the total sales number of each product in the physical store, and select the records that are greater than 5.

A5: Use the symbol "&" to calculate the union of the product sets of online and physical stores.

[e.g.6] Query which products rank in the top 10 of the year in total sales number based on the sales data table. Some of the data are as follows:

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 10400 EASTC 2014/01/01 1 27 3063.0 10401 HANAR 2014/01/01 1 17 3868.6 10402 ERNSH 2014/01/02 8 70 2713.5 10403 ERNSH 2014/01/03 4 42 1005.9 10404 MAGAA 2014/01/03 2 74 1675.0 … … … … … …

SPL provides the A.union() function to calculate the union of all member sets when A is a set composed of member sets.

The SPL script is as follows:

 A 1 =T("Sales.csv") 2 =A1.group(year(ORDERDATE)) 3 =A2.(~.groups(PRODUCTID;count(~):COUNT)) 4 =A3.(~.top(-5;COUNT).(PRODUCTID)) 5 =A4.union()

A1: Import the sales table from the file.

A2: Group the sales table by year.

A3: Group the records of each year by product to count the sales number.

A4: Select the top 10 product IDs in sales number of each year.

A5: Use A.union() function to calculate the union of the selected products of all years.

4. Difference

The difference set of set A and set B is the set consisting of all members belonging to set A and not belonging to set B. For example, the difference of the set [1,2,3] and set [1,3,5] is the set [2].

[e.g.7] A certain merchant has two sales channels, online sales and physical stores, and the sales data are stored in the SALES_ONLINE and SALES_STORE tables respectively. Now we want to query the customers who have spend more than 1,000 in physical store but never placed any order online. Some of the data are as follows:

SALES_ONLINE

 ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT 1 HANAR 2014/01/01 17 3868.6 2 ERNSH 2014/01/03 42 1005.9 3 LINOD 2014/01/06 64 400.0 4 OTTIK 2014/01/07 10 1194.0 5 OCEAN 2014/01/09 41 319.2 … … … … …

SALES_STORE

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 1 EASTC 2014/01/01 1 27 3063.0 2 ERNSH 2014/01/02 8 70 2713.5 3 MAGAA 2014/01/03 2 74 1675.0 4 SAVEA 2014/01/07 7 45 2018.2 5 FOLIG 2014/01/08 8 30 1622.4 … … … … … …

The symbol \ can be used in SPL to calculate the difference set of sets. For example, A\B is the difference set of set A and set B.

The SPL script is as follows:

 A 1 =T("SalesOnline.csv").select(year(ORDERDATE)==2014) 2 =T("SalesStore.csv").select(year(ORDERDATE)==2014) 3 =A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3) 4 =A3.id(CUSTOMERID)\A1.id(CUSTOMERID)

A1: Import the online sales table from the file and select the sales records of 2014.

A2: Import the physical-store sales table from the file and select the sales records of 2014.

A3: Group and aggregate the total sales number of each customer of the physical store and select the records with more than 3.

A4: Use the symbol \ to calculate the difference set of physical-store and online customers.

[e.g.8] Query the customers who rank in the top 10 in total sales of January only in 2014 based on the sales data table. Some of the data are as follows:

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 10400 EASTC 2014/01/01 1 27 3063.0 10401 HANAR 2014/01/01 1 17 3868.6 10402 ERNSH 2014/01/02 8 70 2713.5 10403 ERNSH 2014/01/03 4 42 1005.9 10404 MAGAA 2014/01/03 2 74 1675.0 … … … … … …

SPL provides the A.diff() function to calculate the difference set of the first member set and other member sets when A is a set composed of sets.

The SPL script is as follows:

 A 1 =T("Sales.csv").select(year(ORDERDATE)==2014) 2 =A1.group(month(ORDERDATE)) 3 =A2.(~.groups(CUSTOMERID;sum(AMOUNT):AMOUNT)) 4 =A3.(~.top(-10;AMOUNT).(CUSTOMERID)) 5 =A4.diff()

A1: Import the sales table from the file and select the sales records of 2014.

A2: Group the sales table by month.

A3: Group and aggregate the records of each month by customer.

A4Count the top ten customers in total sales of every month.

A5: Use the A.diff() function to calculate the difference set of customers in January and other months.

5. Exclusive OR

There are set A and set B. The members that belong to either A or B but do not belong to both A and B constitute the exclusive OR set of A and B. For example, the exclusive OR set of set [1,2,3] and set [1,3,5] is set [2,5].

[e.g.9] The scores of two semesters are stored in separate tables. Query the students whose overall scores rank in the top 10 only once in the two semesters of Class One. Some of the data are as follows:

First semester SCORES1

 CLASS STUDENTID SUBJECT SCORE 1 1 English 84 1 1 Math 77 1 1 PE 69 1 2 English 81 1 2 Math 80 … … … …

Next semester SCORES2

 CLASS STUDENTID SUBJECT SCORE 1 1 English 97 1 1 Math 64 1 1 PE 97 1 2 English 56 1 2 Math 82 … … … …

The symbol "%" can be used in SPL to calculate the exclusive OR of sets. For example, A%B is the exclusive OR set of set A and set B.

The SPL script is as follows:

 A 1 =T("Scores1.csv") 2 =T("Scores2.csv") 3 =A1.groups(STUDENTID; sum(SCORE):SCORE) 4 =A2.groups(STUDENTID; sum(SCORE):SCORE) 5 =A3.top(-10;SCORE).(STUDENTID) 6 =A4.top(-10;SCORE).(STUDENTID) 7 =A5%A6

A1: Import the score table of first semester from the file.

A2: Import the score table of next semester from the file.

A3: Group and aggregate the total scores of each student in the first semester by student ID.

A4: Group and aggregate the total scores of each student in the next semester by student ID.

A5: Select the top 10 student IDs in total score of the first semester.

A6: Select the top 10 student IDs in total score of the next semester.

A7: Use the symbol "%"to calculate the exclusive OR set of the first and next semesters.

[e.g.10] Query if customer RATTC has been ranked in the top three of monthly sales in 2014. Some of the data are as follows:

 ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT 10400 EASTC 2014/01/01 1 27 3063.0 10401 HANAR 2014/01/01 1 17 3868.6 10402 ERNSH 2014/01/02 8 70 2713.5 10403 ERNSH 2014/01/03 4 42 1005.9 10404 MAGAA 2014/01/03 2 74 1675.0 … … … … … …

SPL provides the A.cor() function to calculate the exclusive OR set of multiple member sets when A is a set composed of sets.

The SPL script is as follows:

 A 1 =T("Sales.csv").select(year(ORDERDATE)==2014) 2 =A1.group(month(ORDERDATE)) 3 =A2.(~.groups(CUSTOMERID; sum(AMOUNT):AMOUNT)) 4 =A3.new(~.top(-3; AMOUNT):TOP3) 5 =A4.(TOP3.(CUSTOMERID).pos("RATTC")>0) 6 =A5.cor()

A1: Import the sales table from the file and select the sales records of 2014.

A2: Group the sales table by month.

A3: Group the monthly sales records by customer.

A4: Count the top three customers in total sales per month.

A5: Calculate whether "RATTC" exists among the customers of each group.

A6: Use the A.cor() function to calculate the exclusive OR set whether each month contains the customer "RATTC". "RATTC" exists if the result is true; "RATTC" does not exist if the result is false.

6. Belonging and containment

For two sets A and B, B is said to contain A if any member of set A is a member of set B. Belonging is the relationship between a member and a set, and we say that x belongs to A when there is a member x in set A. For example, set [1,2,3] contains set [1,2], and member 1 belongs to set [1,2,3].

[e.g.11] Calculate the average salary of each department in California, New York, Texas, and Washington based on the employee table. Some of the data are as follows:

 ID NAME SURNAME STATE DEPT SALARY 1 Rebecca Moore California R&D 7000 2 Ashley Wilson New   York Finance 11000 3 Rachel Johnson New   Mexico Sales 9000 4 Emily Smith Texas HR 7000 5 Ashley Smith Texas R&D 16000 … … … … … …

SPL provides the A.contain(x) function to determine whether x belongs to set A.

The SPL script is as follows:

 A 1 =T("Employee.csv") 2 [California,New York,Texas,Washington] 3 =A1.select(A2.contain(STATE)) 4 =A3.groups(DEPT; avg(SALARY):SALARY)

A1: Import the employee table from the file.

A2: Define a constant set of these states.

A3: Select the records from the employee table whose state names belong to the A2 set.

A4: Group the records selected by A3 to calculate the average salary of each department.

[e.g.12] Query the students who took both Matlab and Modern wireless communication system courses based on the course table and course selection table. Some of the data are as follows:

COURSE

 ID NAME TEACHERID 1 Environmental   protection and sustainable development 5 2 Mental   health of College Students 1 3 Matlab 8 4 Electromechanical   basic practice 7 5 Introduction to modern life science 3 6 Modern wireless communication system 14 … … …

SELECT_COURSE

 ID STUDENT_NAME COURSE 1 Rebecca   Moore 2,7 2 Ashley   Wilson 1,8 3 Rachel   Johnson 2,7,10 4 Emily   Smith 1,10 5 Ashley   Smith 5,6 … … …

This problem is to determine whether the course fields of course selection table contain the course set [3,6] (the course IDs of Matlab and Modern wireless communication system), which is not complicated.

SPL provides the A.pos(B) function to obtain the position of a set B member in set A, and return null if the member does not exist in set A.

The SPL script is as follows:

 A 1 =T("Course.txt") 2 =T("SelectCourse.txt") 3 =A1.select(NAME=="Matlab" || NAME=="Modern wireless   communication system").(ID) 4 =A2.run(COURSE=COURSE.split@cp()) 5 =A4.select(COURSE.pos(A3)!=null)

A1: Import the course table from the file.

A2: Import the course selection table from the file.

A3: Select the IDs of the two courses to create a set .

A4: Split the selected courses by comma and then parse the numbers to create the set.

A5: Use the A.pos() function to locate the IDs of the two courses in the selection table, and the returned result that is not null is just the record of selecting two courses at the same time.