3.4 Get difference of two sets

 

The difference of set A and set B is made up of members that exist in set A but not in set B. For instance, the difference of [1,2,3] and [1,3,5] is [2].

Here is an example. A business has two sales channels – online and offline stores. Their sales data is stored in SALES_ONLINE table and SALES_STORE table respectively. The task is to find customers whose total order amount in the offline store exceeds 1000 but who do not have orders in the online store in the year 2014. Below is part of data in the two tables:

SALES_ONLINE:

ID CUSTOMERID ORDERDATE PRODUCTID AMOUNT
1 HANAR 2014/01/01 17 3868.6
2 ERNSH 2014/01/03 42 1005.9
3 LINOD 2014/01/06 64 400.0
4 OTTIK 2014/01/07 10 1194.0
5 OCEAN 2014/01/09 41 319.2

SALES_STORE:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
1 EASTC 2014/01/01 1 27 3063.0
2 ERNSH 2014/01/02 8 70 2713.5
3 MAGAA 2014/01/03 2 74 1675.0
4 SAVEA 2014/01/07 7 45 2018.2
5 FOLIG 2014/01/08 8 30 1622.4

SPL can use character “\” to calculate difference of sets. A\B, for instance, calculates difference of set A and set B.

SPL script:

A
1 =T(“SalesOnline.xlsx”).select(year(ORDERDATE)==2014)
2 =T(“SalesStore.xlsx”).select(year(ORDERDATE)==2014)
3 =A2.groups(CUSTOMERID; count(~):COUNT).select(COUNT>3)
4 =A3.id(CUSTOMERID)\A1.id(CUSTOMERID)

A1: Import SALES_ONLINE table from the original file and select sales records of the year 2014.
A2: Import SALES_STORE table from the original file and select sales records of the year 2014.
A3: Group A2 by CUSTOMERID and count purchase frequencies of each customer, and select records where the number of purchases are greater than 3.
A4: Use operator “\” to calculate difference of A3 and A1.