Python vs. SPL 11 -- Many-to-One Association

 

In Python vs. SPL 10 -- One-to-N Association, we introduce one-to-one and one-to-N association. And this article will compare the computational abilities of Python and SPL in many-to-one association.

 

Foreign key association

When some fields of table A are associated with the primary key of table B, the associative fields of table A can be many, and the associative field of table B is distinct. Such scenario is a many-to-one association, also known as foreign key association, that is, table A is a fact table, and table B is a dimension table. The fields of table A associated with the primary key of table B are called the foreign keys of A to B, and table B is also called the foreign key table of A. For example:

 

There is a sale record table and a product information table. The calculation task is to aggregate the sale amount of each kind of product.

 

Some of the data in sale record table, and product information table are as follows:

 

sale record table (fact table):

recordid

product

sale_city

amount

sr100001

p1003

c104

380

sr100002

p1005

c103

400

sr100003

p1003

c104

626

 

Product information table (dimension table):

productid

pclass

p1001

A

p1002

A

p1003

B

 

Python

import pandas as pd

sr_file1="D:\data\SaleRecord.csv"

pt_file1="D:\data\Product.csv"

record1=pd.read_csv(sr_file1)

product1=pd.read_csv(pt_file1)

r_pt=pd.merge(record1,product1,left_on="product",right_on="productid")

pclass_sale=r_pt.groupby('pclass',as_index=False).amount.sum()

print(pclass_sale)

 

 

 

Fact table

 

Dimension table

Associate fact table with the foreign keys of dimension table

 

 

The merge function in Python associates two tables; sale record table “record1” is the fact table, and product information table “product1” is the dimension table. Many records in “record1” correspond to one record in “product1”, and the names of associative fields in two tables are different, so left_on and right_on mark the associative field of two tables respectively so that the two tables are associated as a wide table, then group and aggregate the records to get the final result.

 

SPL


A

B

1

D:\data\SaleRecord.csv


2

D:\data\Product.csv


3

=file(A1).import@tc()


4

=file(A2).import@tc()


5

=A3.switch(product,A4:productid)

/convert foreign keys to records of dimension table

6

=A5.groups(product.pclass;sum(amount):amount)


 

The switch function in SPL converts the foreign keys to corresponding records of the dimension table, and since they are records now, they can certainly reference to fields which can be used to perform grouping and aggregation operations during grouping.

 

One fact table & multiple dimension tables

One fact can be associated with multiple dimension tables, for example:

 

We continue to use the sale record table (fact table) and product information table (dimension table 1), and a new city information table (dimension table 2) is added. The calculation task is to count the sale amount of each kind of product in each province.

 

City information table (dimension table 2):

cityid

name

province

c101

Beijing

Beijing

c102

Tianjin

Tianjin

c103

Harbin

Heilongjiang

 

Python

#continue to use sr_file1 and pt_file1

 

ct_file1="D:\data\City.csv"

ct1=pd.read_csv(ct_file1)

r_ct=pd.merge(record1,ct1,left_on="sale_city",right_on="cityid")

r_ct_pdt=pd.merge(r_ct,product1,left_on="product",right_on="productid")

ct_pdt_sale=r_ct_pdt.groupby(['province','pclass'],as_index=False).amount.sum()

print(ct_pdt_sale)

 

 

 

Associate fact table with dimension table 2

Associate fact table with dimension table 1

Group and aggregate

 

When associating multiple dimension tables, Python usually associates one table first and then the other table. After executing the merge function twice, a big wide table is generated which is used to perform the grouping and aggregate operations.

 

SPL


A

B

/A3 is sale record table, and A4 is product information table


8

D:\data\City.csv


9

=file(A8).import@tc()


10

=A3.switch(product,A4:productid;sale_city,A9:cityid)

/set primary key

11

=A10.groups(sale_city.province,product.pclass;sum(amount):amount)

/group and aggregate

 

The switch function in SPL can create many foreign key associations simultaneously such as the ID number of product “product” and the “productid” in product information table, and ID number of city “sale_city” and “cityid” in city information table. More associations can be created if needed, and the fields of records can be used to perform grouping and aggregate operations after being associated. Different from Python, SPL can parse multiple associative relations at a time, which makes the association explicit and more efficient.

 

Reuse dimension table

One fact table may use the same dimension table multiple times, for example:

 

Based on sale record table and city information table, select the sale record whose sale city and producing city are the same one.

 

sale record table 2 (fact table):

recordid

product

product_city

sale_city

amount

sr100001

p1006

c105

c103

603

sr100002

p1005

c105

c102

1230

sr100003

p1003

c102

c102

885

City information table 2 (dimension table):

cityid

name

province

c101

Beijing

Beijing

c102

Tianjin

Tianjin

c103

Harbin

Heilongjiang

 

Python

sr_file2="D:\data\SaleRecord2.csv"

ct_file2="D:\data\City2.csv"

record2=pd.read_csv(sr_file2)

ct2=pd.read_csv(ct_file2)

r_ct2=pd.merge(record2,ct2,left_on="sale_city",right_on="cityid")

r_ct_ct=pd.merge(r_ct2,ct2,left_on="product_city",right_on="cityid",suffixes=('_s',   '_p'))

r_ct_p_ct= r_ct_ct[r_ct_ct['province_s']==r_ct_ct['province_p']].recordid

print(r_ct_p_ct)

 

 

 

 

Associate fact table with dimension table for the first time

Associate fact table with dimension table for the second time

 

 

 

 

The sale records in the example include the ID numbers of sale city and producing city, both of which can be associated with the “cityid” of city information table. Python still uses the same method, executing the merge function twice, and merge function will generate the same field names in the second time, but Python can handle such a problem successfully by adding a different suffix.

 

SPL


A

B


13

D:\data\SaleRecord2.csv


14

D:\data\City2.csv


15

=file(A13).import@tc()


16

=file(A14).import@tc()


17

=A15.switch(sale_city,A16:cityid;product_city,A16:cityid)

/associate fact table with dimension table

18

=A17.select(sale_city.province==product_city.province).(recordid)


 

SPL uses the switch function to associate the same dimension table, but with different foreign keys (sale_city and product_city), and then uses the associated record fields to select the target result.

 

Multi-layer dimension table

Foreign key association may involve more than one layer of dimension table. In other words, there are scenarios of multiple layers of the dimension table. For example:

 

Based on the sale record table, product information table, and city information table, select the sale record whose sale city and producing city are in the same province.

 

Sale record table (fact table):

recordid

product

sale_city

amount

sr100001

p1003

c104

380

sr100002

p1005

c103

400

sr100003

p1003

c104

626

 

City information table (dimension table 1):

cityid

name

province

c101

Beijing

Beijing

c102

Tianjin

Tianjin

c103

Harbin

Heilongjiang

 

Product information table (dimension table 2):

productid

product_city

p1001

c104

p1002

c103

p1003

c102

 

Python

sr_file3="D:\data\SaleRecord3.csv"

ct_file3="D:\data\City3.csv"

pt_file3="D:\data\Product3.csv"

record3=pd.read_csv(sr_file3)

product3=pd.read_csv(pt_file3)

ct3=pd.read_csv(ct_file3)

pdt_ct=pd.merge(product3,ct3,left_on="product_city",right_on="cityid")

r_pdt_ct=pd.merge(record3,pdt_ct,left_on="product",right_on="productid")

r_pdt_ct_ct=pd.merge(r_pdt_ct,ct3,left_on="sale_city",right_on="cityid",suffixes=('_s',   '_p'))

r_ct_p_ct2=r_pdt_ct_ct[r_pdt_ct_ct['province_s']==r_pdt_ct_ct['province_p']].recordid

print(r_ct_p_ct2)

 

 

 

 

 

 

Associate producing city with city

Associate sale record with product

Associate sale city with city

 

 

 

The city information is the dimension table of both product and sale record; product information is also the dimension table of sale record, which constitutes multiple layers of dimension tables together, and there is dimension table that is associated multiple times. Python uses the merge function three times for three associations.

 

SPL


A

B


20

D:\data\SaleRecord3.csv


21

D:\data\City3.csv


22

D:\data\Product3.csv


23

=file(A20).import@tc()


24

=file(A21).import@tc()


25

=file(A22).import@tc()


26

=A25.switch(product_city,A24:cityid)

/associate producing city with city

27

=A23.switch(sale_city,A24:cityid;product,A26:productid)

/associate sale record with city and product

28

=A27.select(sale_city.province==product.product_city.province).(recordid)


 

Once an association is created, SPL can use it all the time, even when the association is created again. For example, we create associations on producing city and city in A26, and on sale record and product in A27; besides, the association between producing city and city still exists. Therefore, we can have reference of product.product_city.province in A28, which is quite convenient for multiple table association.

 

Self-association

Sometimes we may also encounter a scenario where a table is both a fact table and a dimension table, i.e., the table associates with itself. For example:

 

There is an employee information table, and the calculation task is to list names of all employees and their superiors.

 

Some of the employee information table are as follows:

empid

name

superior

7902

FORD

7566

7788

SCOTT

7566

7900

JAMES

7698

 

Python

emp_file="D:\data\Employee_.csv"

emp=pd.read_csv(emp_file)

emp_s=pd.merge(emp,emp,left_on="superior",right_on="empid",suffixes=('',  '_m'),how="left")

emp_s_name=emp_s[['name','name_m']]

print(emp_s_name)

 

 

Self associate

 

 

 

 

The operation of Python is still two-table association essentially.

 

SPL


A

B


30

D:\data\Employee_.csv


31

=file(A30).import@tc()


32

=A31.switch(superior,A31:empid)

/self associate

33

=A32.new(name,superior.name:s_name)


 

SPL also follows the same operation logic, using switch function to associate “superior” and “empid”.

 

Circle association

When associative relation is complex, circle association may occur. For example:

 

There is an employee information table and a department information table, and the calculation task is to select Beijing employees of Beijing manager.

 

Employee information tale:

empid

name

dept

province

1

Rebecca

6

Beijing

2

Ashley

2

Tianjin

3

Rachel

7

Heilongjiang

 

Department information table:

deptid

name

manager

1

Administration

20

2

Finance

2

3

HR

162

 

Python

emp_file2="D:\data\Employee_2.csv"

dept_file2="D:\data\Department2.csv"

emp2=pd.read_csv(emp_file2)

dept2=pd.read_csv(dept_file2)

d_emp=pd.merge(dept2,emp2,left_on="manager",right_on="empid")

emp_d_emp=pd.merge(emp2,d_emp,left_on="dept",right_on="deptid",suffixes=('',  '_m'))

beijing_emp_m=emp_d_emp[(emp_d_emp['province']=="Beijing") & (emp_d_emp['province_m']=="Beijing")].name

print(beijing_emp_m)

 

 

 

 

Associate department table with employee table

Associate employee table with department table

 

Select

 

 

The above two associations are relatively independent from each other in Python. These two associations constitute a circle association to generate a wide table, and then the target result is selected.

 

SPL


A

B


35

D:\data\Employee_2.csv


36

D:\data\Department2.csv


37

=file(A35).import@tc()


38

=file(A36).import@tc()


39

=A38.switch(manager,A37:empid)

/associate department table with employee table

40

=A37.switch(dept,A38:deptid)

/associate employee table with department table

41

=A40.select(province=="Beijing"&&dept.manager.province=="Beijing").(name)

/select

 

SPL handles such association in three steps: first, it creates association on department and employee; second, it creates association on employee and department; third, it directly selects the target result using the created associations. The association operations in the previous examples are all done with the switch function which possesses a feature: the original field values will be replaced with the associated records once the association is done, and the original record values will not exist any longer. If we want to keep the original record values, the join function can be used to perform the association. For example, A40 in the example can be written as:

A40=A37.join(dept,A38:deptid,~:dpt). At this time, “dept” is the associated records which can be referenced to perform the subsequent operations. And the switch function in the previous examples can all be used in this way.

 

Mixed association

During data analysis, we may encounter mixed associations where homo-dimension, primary-sub, and foreign key associations occur at the same time, and it is when the associative relations are very complex and need to be clearly sorted out. For example:

 

Based on the order table, order detail table, product information table, employee information table, travel information table, client information table, and city information table, the task is to calculate the sale amount of Heilongjiang products sold in each province by post-90s salesman who travel for more than 10 days.

 

The associative relations are shown below:

..

 

Python

emp4 = pd.read_csv("D:\data\Employee4.csv")

trv4 = pd.read_csv("D:\data\Travel4.csv")

emp_inf = pd.merge(emp4,trv4,on=["empid","name"])

years = pd.to_datetime(emp_inf.birthday).dt.year

emp_inf_c = emp_inf[(years>=1990) & (years<2000)&(emp_inf.time>=10)]

clt4 = pd.read_csv("D:\data\Client4.csv")

city4 = pd.read_csv("D:\data\City4.csv")

sale_location = pd.merge(clt4,city4,left_on='city',right_on='cityid')

pdt4 =   pd.read_csv("D:\data\Product4.csv")

pdt_location = pd.merge(pdt4,city4,left_on='city',right_on='cityid')

detail4 = pd.read_csv("D:\data\Detail4.csv")

order4 = pd.read_csv("D:\data\Order4.csv")

detail_pdt = pd.merge(detail4,pdt_location,on='productid',how="left")

order_sale_location = pd.merge(order4,sale_location,on='clientid',how="left")

order_sale_location_emp = pd.merge(order_sale_location,emp_inf_c,left_on='saleid',right_on='empid',how="left",suffixes=('_c', '_e'))

order_inf = order_sale_location_emp[order_sale_location_emp.empid.notnull()]

order_detail = pd.merge(order_inf,detail_pdt,on='orderid',how="left",suffixes=('_s', '_p'))

order_detail_Hljp = order_detail[order_detail.province_p=="Heilongjiang"]

res = order_detail_Hljp.groupby(['empid','name_e','province_s'],as_index=False).price.sum()

print(res)

 

 

Employee table and travel table

 

Select the post-90s employees

 

 

Client table and city table

 

Product table and city table

 

 

Order detail and producing city

Order and sale city

 

Order and employee

 

 

Order and order detail

Select

 

Group and aggregate

 

There are many tables in this example with complex associative relations which are homo-dimension association (one-to-one), primary-sub association (one-to-many), and foreign key association (many-to-one), respectively. If there exists an association of many-to-many, it is most likely wrong, and the association needs to be rechecked, otherwise, many-to-many association probably leads to memory explosion. As for such complex associations, the best method provided in Python is to use the merge function to associate every two tables and parse each association step by step, which may be a bit troublesome but less prone to errors.

 

SPL


A

B


43

=file("D:/data/Employee4.csv").import@tc()


44

=file("D:/data/Travel4.csv").import@tc()


45

=A44.join(empid,A43:empid,birthday)


46

=A45.select((y=year(birthday),y>=1990&&y<2000&&time>=10))


47

=file("D:/data/Client4.csv").import@tc()


48

=file("D:/data/City4.csv").import@tc()


49

=A47.join(city,A48:cityid,province)


50

=file("D:/data/Product4.csv").import@tc()


51

=A50.join(city,A48:cityid,province)


52

=file("D:/data/Detail4.csv").import@tc()


53

=file("D:/data/Order4.csv").import@tc()


54

=A52.join(productid, A51:productid,province:product_province)


55

=A54.group(orderid)


56

=A53.switch(orderid, A55:orderid;saleid, A46:empid;clientid, A49:clientid)


57

=A56.select(saleid).new(saleid.empid:empid,saleid.name:sale_name,clientid.province:sale_location,orderid.select(product_province=="Heilongjiang").sum(price):price)


58

=A57.groups(empid,sale_name,sale_location;sum(price):price).select(price)


 

SPL is quite capable to handle such complex associations, in which the homo-dimension, primary-sub, and foreign key associations are all very clear. SPL can also associate two associations simultaneously, which is very fast and less error-prone.

 

Summary

When performing foreign key association, Python still copies data, and only parses one association at a time. In addition, every association is independent, so the association created previously can not be reused later, instead, it has to be re-associated, which results in low efficiency in association.

 

On the contrary, SPL can reuse the associations that created previously, making the operation much more effective.