SPL: Many to Many Join

 

JOIN is used to combine records from two or more tables. This article will explore the solutions and basic principles of SPL for join problems.

The data interdependence between tables is called the inter table association relationship. The association relationships between tables can be divided into the following types: one-to-one, many-to-one, one-to-many, and many-to-many. We can join two or more tables through inter table association relationships to achieve the goal of multi table association queries.

Many to many association: A record of Table A can correspond to any number of records of Table B; At the same time, a record of Table B can also correspond to any number of records of Table A.

To handle many to many association relationship, cross join can be used. Each record in Table A will form a new record with each record in Table B, so the number of rows in the returned result set is equal to the product of the number of rows in the two tables. The function xjoin() is provided in SPL for cross join. Many to many association relationship is rarely encountered in practical applications. Let’s briefly introduce the example of matrix multiplication:

Example 1: Calculate the product of two matrices. Two matrix table data are stored in MATRIXA and MATRIXB respectively:

MATRIXA:

ROW COL VALUE
1 1 1
1 2 2
1 3 3
2 1 4
2 2 5
2 3 6

MATRIXB:

ROW COL VALUE
1 1 1
1 2 4
2 1 2
2 2 5
3 1 3
3 2 6

The Formula of this example is as follows:

imagepng

Matrix multiplication is a many to many relationship. SPL provides the function xjoin () for cross join.

The SPL script is as follows:

A
1 =T(“MatrixA.csv”)
2 =T(“MatrixB.csv”)
3 =xjoin(A1:A; A2:B, A.COL==A2.ROW)
4 =A3.groups(A.ROW, B.COL; sum(A.VALUE * B.VALUE):VALUE)

A1: Import matrix table A.
A2: Import matrix table B.
A3: Use the function xjoin() to cross join and filter according to conditions during join.
A4: Group and summarize the sum of the products of values with the same row and column numbers.

When the join condition is not an equivalence comparison, it is called a non-equivalence join. For example, whether the age is within a certain age range, whether the income is within a certain income range, and so on.

【 Example 2 】 Based on the community residents table and age range table, query the age range of community residents. Part of the data is as follows:

COMMUNITY:

ID NAME AGE
1 David 28
2 Daniel 15
3 Andrew 65
4 Rudy

AGE_GROUP:

GROUP_NAME START END
Children 0 15
Youth 16 40
Middle 41 60
Old 61 100

In the community residents table, some residents have not registered age, so this example can be implemented through left join during cross join. The function xjoin() is used for cross join, and option @1 is used for the left join.

The SPL script is as follows:

A
1 =T(“Community.txt”)
2 =T(“AgeGroup.txt”)
3 =xjoin@1(A1:C; A2:A, A2.START<=C.AGE && A2.END>=C.AGE)
4 =A3.new(C.ID, C.NAME, C.AGE,A.GROUP_NAME)

A1: Import the community residents table.
A2: Import the age range table.
A3: Use function xjoin@1() to left join according to the first table, and select records with age within the corresponding range during join. When the age range cannot be matched, keep community resident records and set the age range to null.
A4: Return to the age group of community residents.

AgeGroup.txt

Community.txt

MatrixA.csv

MatrixB.csv