SPL: Many to One Join

 

JOIN is used to combine records from two or more tables. This article will explore the solutions and basic principles of SPL for join problems.

The data interdependence between tables is called the inter table association relationship. The association relationships between tables can be divided into the following types: one-to-one, many-to-one, one-to-many, and many-to-many. We can join two or more tables through inter table association relationships to achieve the goal of multi table association queries.

Many to one association refers to any number of records of one table to correspond to one record of another table. A many-to-one association relationship, commonly found in two tables with foreign key relationships. Some fields of Table A are associated with the primary key of Table B. The fields associated with the primary key of Table B in Table A are called foreign keys pointing to Table B, and B is also known as the foreign key table of Table A.

For example, there are employee table and department table, where the department ID field of the employee table points to the ID field of the department table. Each employee has a department, while each department can have multiple employees, so the employee table and department table have a many-to-one relationship.

We often mix one to many and many to one together, actually we are misled by SQL. There is no distinction between these two types of association relationships in SQL, nor are separate solutions provided.

The idea of solving many to one associations is completely different from that of one to many. We can convert foreign key values to corresponding records in the join, or attach the required field values to the “many” table, and then treat it as one table for query operations. SPL provides functions such as A.switch()and A.join() to optimize and speed up foreign key table joins (many to one).

1. Foreign key Objectification

[Example 1] Based on the employee table and department table, query the names of all employees and their department names. Part of the data is as follows:

EMPLOYEE:

ID NAME BIRTHDAY DEPARTMENTID SALARY
1 Rebecca 1974/11/20 6 7000
2 Ashley 1980/07/19 2 11000
3 Rachel 1970/12/17 7 9000
4 Emily 1985/03/07 3 7000
5 Ashley 1975/05/13 6 16000

DEPARTMENT:

ID NAME MANAGER
1 Administration 18
2 Finance 2
3 HR 4
4 Marketing 6
5 Production 7

SPL provides the function A.switch() to convert the key values of encoded fields to corresponding records.

The SPL script is as follows:

A
1 =T(“Employee.txt”)
2 =T(“Department.txt”)
3 =A1.switch(DEPARTMENTID, A2:ID)
4 =A3.new(NAME, DEPARTMENTID.NAME:DEPT_NAME)

A1: Import employee table.
A2: Import department table.
A3: Use the function A.switch() to objectify the foreign key of the department ID, that is, convert the field value of the department ID to the corresponding department record.
A4: Returns the employee name and department name, where the department name can be obtained through the NAME field of the department record.

After foreign key objectification, the value of the department field in the employee table is the corresponding record of the employee in the department table. You can directly use “department.name” in the employee table to obtain the department name, or “department.manager” to obtain the name of the department manager, and so on.

Since this example is relatively simple, some people may have the doubts: “we only need to import the foreign key field during the join, why do we objectify the foreign keys?” Let’s use a slightly more complex example to illustrate.

[Example 2] Based on the employee table and department table, query which American employees have a Chinese manager. Part of the data is as follows:

EMPLOYEE:

ID NAME BIRTHDAY DEPARTMENTID SALARY
1 Rebecca 1974/11/20 6 7000
2 Ashley 1980/07/19 2 11000
3 Rachel 1970/12/17 7 9000
4 Emily 1985/03/07 3 7000
5 Ashley 1975/05/13 6 16000

DEPARTMENT:

ID NAME MANAGER
1 Administration 18
2 Finance 2
3 HR 4
4 Marketing 6
5 Production 7

We can still solve the problem by objectification the foreign key through the function A.switch().

The SPL script is as follows:

A
1 =T(“Employee.txt”).keys(ID)
2 =T(“Department.txt”).keys(ID)
3 =A2.switch(MANAGER, A1)
4 =A1.switch(DEPARTMENTID, A2)
5 =A4.select(NATION==“American” && DEPARTMENTID.MANAGER.NATION==“Chinese”)

A1: Import employee table with ID as the primary key.
A2: Import department table with ID as the primary key.
A3: Use the function A.switch() to objectify the foreign key and convert the manager field of the department table to the corresponding employee record.
A4: Use the function A.switch() to objectify the foreign key and convert the department field of the employee table to the corresponding department record.
A5: Select the American employees who have a Chinese manager.

In this example, there is an additional layer of association, but the SPL statement does not become complicated, only a foreign key objectification is added according to the logic. We don’t need to figure out which fields are needed in advance for each foreign key join. We can directly obtain them from the record object, and when selecting them, it also follows natural logic very well. The nationality of the department manager (DEPARTMENTID.MANAGER.NATION) is Chinese.

【 Example 3 】 Based on the course table and course selection table, query how many students have taken the “Matlab” course. Part of the data is as follows:

COURSE:

ID NAME TEACHERID
1 Environmental protection and sustainable development 5
2 Mental health of College Students 1
3 Matlab 8
4 Electromechanical basic practice 7
5 Introduction to modern life science 3

SELECT_COURSE:

ID COURSEID STUDENTID
1 6 59
2 6 43
3 5 52
4 5 44
5 5 37

We want to delete unmatching records during join (selecting courses that are not Matlab), and we can use inner join. In function A.switch(), option @i is used to delete unmatching records during join.

The SPL script is as follows:

A
1 =T(“Course.csv”)
2 =T(“SelectCourse.csv”)
3 =A1.select(NAME:“Matlab”)
4 =A2.switch@i(COURSEID, A3:ID).count()

A1: Import the course table.
A2: Import the course selection table.
A3: Select the record named Matlab in the course table.
A4: Use option @i of function A.switch() to delete unmatching records during join, that is, all remaining records are those that have selected Matlab course, and then count the quantity.

[Example 4] Based on the sales table and customer table, query the sales status of newly added customers in 2014 (i.e. records where the customer ID in the sales table is not in the customer table). Part of the data is as follows:

SALES:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10248 VINET 2013/7/4 5 59 2440
10249 TOMSP 2013/7/5 6 38 1863.4
10250 HANAR 2013/7/8 4 65 1813
10251 VICTE 2013/7/8 3 66 670.8
10252 SUPRD 2013/7/9 4 46 3730

CUSTOMER:

ID NAME CITY POSTCODE TEL
ALFKI Sanchuan Industrial Co., Ltd Tianjin 343567 (030) 30074321
ANATR Southeast industries Tianjin 234575 (030) 35554729
ANTON Tanson trade Shijiazhuang 985060 (0321) 5553932
AROUT Guoding Co., Ltd Shenzhen 890879 (0571) 45557788
BERGS Tongheng machinery Nanjing 798089 (0921) 9123465

We want to keep only records that do not match during join (records where the customer ID in the sales table is not in the customer table). In function A.switch(), option @d only retain unmatching records.

The SPL script is as follows:

A
1 =T(“Sales.csv”)
2 =T(“Customer.txt”)
3 =A1.select(year(ORDERDATE)==2014)
4 =A3.switch@d(CUSTOMERID, A2:ID)

A1: Import sales table.
A2: Import customer table.
A3: Select the records of 2014 in the sales table.
A4: Use option @d of function A.switch() to only retain unmatching records during join, that is, sales records for new customers added in 2014.

[Example 5] Based on the organizational structure table, query the name of each institution’s superior institution. Part of the data is as follows:

ID ORG_NAME PARENT_ID
1 Head Office 0
2 Beijing Branch Office 1
3 Shanghai Branch Office 1
4 Chengdu Branch Office 1
5 Beijing R&D Center 2

In the organizational structure table, the parent organization ID (PARENT_ID) still points to this table, which is an example of self join. We can still use the function A.switch() to objectify the foreign key, and the foreign key table is itself.

The SPL script is as follows:

A
1 =T(“Organization.txt”)
2 =A1.switch(PARENT_ID, A1:ID)
3 =A2.new(ID, ORG_NAME, PARENT_ID.ORG_NAME:PARENT_NAME)

A1: Import organizational structure table.
A2: Use the function A.switch() to objectify the foreign key and convert the value of the parent organization ID field to the corresponding parent organization record.
A3: Returns the names of all institutions and the names of the parent institution.

2. Scenarios where Foreign key Objectification is not supported

When the foreign keys have multi fields, foreign key objectification is not supported. This can be solved by adding fields or records of the foreign key table.

【 Example 6 】 Based on the order table, select the product names with a received payment exceeding 500 in 2014. The relationship between tables is as follows:

11111png

In the above figure, the order details table and product table are associated through a single field foreign key product ID, while the order payment table and order details table are associated through two foreign key fields: order ID and order number. When using multiple field foreign keys, SPL provides the function A.join() to join the foreign key table and add the field values or records of the foreign key table to the newly added fields. Solution: Step 1: Objectify the foreign key of the ProductID field in the order details table and replace it with the corresponding record in the product table. Step 2: Join the order payment table with the order details table, and store the corresponding records in the newly added field DETAIL.

The SPL script is as follows:

A
1 =T(“OrderPayment.txt”)
2 =T(“OrderDetail.txt”)
3 =T(“Product.txt”)
4 =A2.switch(PRODUCTID, A3:ID)
5 =A1.join(ORDER:ORDER_NUMBER,A2:ID:ORDER_NUMBER,~:DETAIL)
6 =A5.select(year(PAY_DATE)==2014 && DETAIL.PRICE>500)
7 =A6.new(ORDER,PAY_DATE,DETAIL.PRODUCTID.NAME:PRODUCT_NAME,DETAIL.PRICE:PRICE)

A1: Import order payment table.
A2: Import order details table.
A3: Import product table.
A4: Use the function A.switch() to objectify the foreign key and convert the product ID field value to the corresponding product record.
A5: Use the function A.join() to join the order payment table and order details table through two fields, and add the corresponding records of the order details table to the DETAIL field.
A6: Select the records of received payment exceeding 500 in 2014.
A7: Generate a table with fields order ID, payment date, product name, and product price.

When the referenced foreign key table field may be null, and we want to keep the original value of the foreign key field, we cannot use foreign key objectification. We can use the function A.join() to join the foreign key table and add the field values of the foreign key table to the newly added fields.

[Example 7] Based on the sales table and customer table, query the total sales revenue of each customer in 2014 (some customers may not have been registered in the customer table). Part of the data is as follows:

SALES:

ID CUSTOMERID ORDERDATE SELLERID PRODUCTID AMOUNT
10248 VINET 2013/7/4 5 59 2440
10249 TOMSP 2013/7/5 6 38 1863.4
10250 HANAR 2013/7/8 4 65 1813
10251 VICTE 2013/7/8 3 66 670.8
10252 SUPRD 2013/7/9 4 46 3730

CUSTOMER:

ID NAME CITY POSTCODE TEL
ALFKI Sanchuan Industrial Co., Ltd Tianjin 343567 (030) 30074321
ANATR Southeast industries Tianjin 234575 (030) 35554729
ANTON Tanson trade Shijiazhuang 985060 (0321) 5553932
AROUT Guoding Co., Ltd Shenzhen 890879 (0571) 45557788
BERGS Tongheng machinery Nanjing 798089 (0921) 9123465

The SPL script is as follows:

A
1 =T(“Sales.csv”)
2 =T(“Customer.txt”)
3 =A1.select(year(ORDERDATE)==2014)
4 =A3.join(CUSTOMERID, A2:ID, NAME:CUSTOMERNAME)
5 =A4.groups(if (CUSTOMERNAME,CUSTOMERNAME,CUSTOMERID):CUSTOMER; sum(AMOUNT):AMOUNT)

A1: Import sales table.
A2: Import customer table.
A3: Select the sales records of 2014.
A4: Use the function A.join() to join the sales table and the customer table, and add the customer name field of the customer table to the CUSTOMERNAME field of the sales table.
A5: Summarize the total sales revenue of each customer by grouping them by customer name, and use the customer ID when the customer name does not exist.

From the above examples, we can see that there is a significant difference between many to one and one to many relationships. In SPL, the differences between the two are clearly distinguished, and different solutions are provided to support the association relationships of many to one and one to many.