Python vs. SPL 10 -- One-to-N Association

 

In data analysis, we usually encounter the scenarios of associating two or more tables, and the association of tables can be divided into the following categories: one-to-one, one-to-many, many-to-one, and many-to-many. The association of one-to-one means that one record of a table corresponds to only one record of another table; the association of one-to-many indicates that one record of a table corresponds to multiple records of another table. This article will compare the computational abilities of Python and SPL in one-to-N association.

 

Homo-dimension association

One-to-one association is also known as homo dimension association, and the associated two tables are homo dimension tables to each other. For example:

 

Based on the final score table and the usual score table, calculate students’ scores of the semester (the percentages of final scores and usual scores are 0.7 and 0.3 respectively).

 

Final score table

studentid

fscore

s1087

68.36422

s1049

78.17515

s1018

82.61762

 

Usual score table

studentid

uscore

s1050

75.11157

s1039

88.02989

s1091

87.95997

 

Python

import pandas as pd

fexam_file="D:/data/FExamRes.csv"

uperf_file="D:/data/UPerformance.csv"

fexam=pd.read_csv(fexam_file)

uperf=pd.read_csv(uperf_file)

scores=pd.merge(fexam,uperf,on="studentid",how="outer").fillna(0)

scores["total"]=0.7*scores["fscore"]+0.3*scores["uscore"]

print(scores)

 

 

 

The scores of final exam

The usual scores

Add 0 after associating

Calculate the scores

 

 

The merge function in Python can associate two Dataframes. Some students may lack a certain score, so we use outer join, then the score is recorded as 0 when there is no score; in this way, the final score can be obtained after weighed sum. After associating, a new Dataframe will be generated in Python, in which data are copied. In short, this whole processing takes up both a lot of time and plenty of memory.

 

SPL


A

B

1

D:/data/FExamRes.csv


2

D:/data/UPerformance.csv


3

=file(A1).import@tc()


4

=file(A2).import@tc()


5

=join@f(A3:fexam,studentid;A4:uperf,studentid)

Homo dimension association

6

=A5.new([fexam.studentid,uperf.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,0.7*fexam.fscore+0.3*uperf.uscore:total)


 

The join function in SPL performs the association, @f option is outer join, and in A6 the final scores are calculated. The association in SPL only uses the primary key (studentid in this example) to create association on two tables and does not copy the data, so it occupies a little memory and performs very fast.

 

Multiple homo-dimension tables

During associative calculation, we may often encounter scenarios where three or more homo-dimension tables need to be associated. For example:

 

Apart from the final scores and usual scores, students also have the scores of elective courses. And the percentages of the three are 0.6, 0.3, and 0.1 respectively.

 

The final score table and usual score table are the same as the previous example, and the elective score table is shown below:

 

Elective score table

studentid

escore

s1051

74.66936

s1035

57.33134

s1047

61.29318

 

Python

#continue to use fexam and uperf

elec_file="D:/data/ElectiveScore.csv"

fexam=fexam.set_index(keys="studentid")

uperf=uperf.set_index(keys="studentid")

elec = pd.read_csv(elec_file,index_col='studentid')

student_score = pd.concat([fexam,uperf,elec],axis=1,join='outer',sort=True).fillna(0).reset_index()

student_score['total'] = 0.6*student_score.fscore+0.3*student_score.uscore+0.1*student_score.escore

print(student_score)

 

 

Set the index

 

Set the index while reading data

Associate multiple homo-dimension tables

 

 

Aggregate the scores

 

 

The merge function in Python can’t be used any longer for associating multiple homo-dimension tables because it only supports association of two tables. It will be a bit troublesome to execute the merge function twice when associating three tables. However, Python offers the concat function which can perform association based on the index, making association of multiple homo-dimension tables easier. However, in this way, we are required to master the usage of another function.

 

SPL


A

B

/A3 is final score table, and A4 is usual score table


8

D:/data/ElectiveScore.csv


9

=file(A8).import@tc()


10

=A3.keys(studentid)

/set the primary key

11

=A4.keys(studentid)


12

=A9.keys(studentid)


13

=join@f(A3:fexam;A4:uperf;A9:elec)

/associate multiple homo-dimension tables

14

=A13.new([fexam.studentid,uperf.studentid,elec.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,elec.escore:escore,0.6*fexam.fscore+0.3*uperf.uscore+0.1*escore:total)

/calculate the total scores

 

The join function in SPL can associate two tables as well as three or more tables. After the primary key is set, the function will perform association according to the primary key automatically, and the subsequent operation is the same as that of two-table association. In SPL, it is very easy for us to adopt one method to other operations, which sparing the trouble of memorizing the usage of other functions.

 

Primary-sub tables

One-to-many association indicates that one record of a table is able to correspond to any record of another table. In such association, we call the table at the “one” end as primary table, and the table at the “many” end as sub table. For example, in the order table and order detail table, one order corresponds to one distinct orderid, but every orderid may correspond to many pieces of order detail information. And the order table is primary table and order detail table is sub table.

 

Based on the above two tables, calculate the consumption of customers of each province in s150 stores.

 

Order table (the primary table of order detail table)

storeid

orderid

client

location

s101

o10001

c104231

Beijing

s101

o10002

c107032

Beijing

s101

o10003

c108875

Beijing

 

Order detail table (the sub table of order table)

storeid

orderid

productid

price

s101

o10001

p1078

1273

s101

o10002

p1091

131

s101

o10002

p1008

780

 

Python

detail_file="D:/data/Detail.csv"

order_file="D:/data/Order.csv"

detail=pd.read_csv(detail_file)

order=pd.read_csv(order_file)

detail150=detail.query("storeid=='s150'")

order150=order.query("storeid=='s150'")

order_detail=pd.merge(order150,detail150,on="orderid",how="inner")

loc_consume=order_detail.groupby("location").price.sum()

print(loc_consume)

 

 

 

 

The order details of s150 stores

The orders of s150 stores

Associate two tables

Group and aggregate

 

 

The association in this example is performed on two tables, and the associative relation is also very clear. We can just use orderid to associate two tables and then group and aggregate the records according to location. Here the operation is one-to-many association, which can be converted to one-to-one association as well:

1. Aggregate the consumption amount of each order in the order detail table;

2. Associate order table with the aggregation result of Step 1, then the association is about one-to-one;

3. Group and aggregate according to location. The code is written as:

1.      order_sum=detail150.groupby(“orderid”).price.sum()

2.      order_detail=pd.merge(order150, order_sum,on="orderid",how="inner")

3.      loc_consume=order_detail.groupby("location").price.sum()

 

SPL


A

B


16

D:/data/Detail.csv


17

D:/data/Order.csv


18

=file(A16).import@tc()


19

=file(A17).import@tc()


20

=A18.select(storeid=="s150")

/select

21

=A19.select(storeid=="s150")


22

=A20.groups(orderid;sum(price):sprice)

/group and aggregate

23

=join(A21:order,orderid;A22:consume,orderid)

/associate

24

=A23.groups(order.location;sum(consume.sprice):consume)

/group and aggregate

 

As for primary-sub tables association, the elaborately designed SPL advocates the method of grouping and aggregation -- homo-dimension association. Because in this way the associative relation is even clearer and the operation is less prone to errors when associated tables are too many and too complex. Also, SPL can associate first and aggregate later in the same way that Python does, and the code is like:

A25=join(A21:order,orderid;A20:detail,orderid)

A26=A25.groups(order.location;sum(detail.price):consume)

 

Multi-layer primary-sub tables

There are also primary-sub tables of multiple layers such as:

 

Apart form the same order table and order detail table of last example, a new store information table is added.

 

Store information table (primary table)

storeid

province

s101

Heilongjiang

s102

Beijing

s103

Tianjin

 

Based on the above three tables, aggregate the consumption of stores in each province.

 

Python

#continue to use detail and order tables

store_file="D:/data/Store.csv"

store = pd.read_csv(store_file)

detail_sum = detail.groupby(['storeid','orderid']).price.sum()

order_detai = pd.merge(order,detail_sum,on=['storeid','orderid'])

store_amount = order_detai.groupby('storeid',as_index=False).price.sum()

store_inf = pd.merge(store,store_amount,on=["storeid"])

province_sale = store_inf.groupby('province',as_index=False).price.sum()

print(province_sale)

 

 

 

Aggregate the sales amount of each order

 

Associate orders with the aggregated sales amount

Aggregate the sales amount of each store

 

Associate store information with sales amount of stores

Group and aggregate according to the provinces where stores locate

 

 

 

The order detail table is the sub table of order table, and order table is the sub table of store table, which form two layers of primary-sub tables together. The operation follows the method of aggregation on sub tables -- association on primary table, which make the associative relation clearer, and does not lead to too big associated tables or wrong association of many-to-many. This example shows a multi-field association, which can be easily done in Python as well.

 

SPL


A

B


28

D:/data/Store.csv


29

=file(A28).import@tc()


30

=A18.groups(storeid,orderid;sum(price):sprice)

/aggregate the sales amount of each order

31

=join(A19:order,[storeid,orderid];A30:detail_sum)

/associate orders with aggregated sales amount of orders

32

=A31.groups(order.storeid;sum(detail_sum.sprice):sales)

/aggregate sales amount of each store

33

=join(A29:store,storeid;A32:store_amount)

/associate store information with sales amount of stores

34

=A33.groups(store.province;sum(store_amount.sales):sales)

/group and aggregate according to the provinces where stores locate

 

It is also easy for SPL to complete the association with the same method introduced previously.

 

One primary table & multiple sub tables

It is also very common that there is one primary table and multiple sub tables, such as order table, order detail table and order return table. For example:

 

Part of the order table, order detail table, and payment collection table are shown below:

 

Order table (primary table)

orderid

clientid

date

10012

100658

2019/2/13

10023

103478

2019/1/12

10040

108013

2019/1/4

 

Order detail table (sub table)

orderid

productid

price

10012

3018

428.5

10012

3019

349.2

10023

3019

349.2

 

Payment collection table (sub table)

orderid

term

payment

10012

1

259.2

10012

2

259.2

10012

3

259.3

 

Based on the above three tables, calculate the orders whose payments are not fully collected.

 

Python

detail_file="D:/data/Detail_2.csv"

order_file="D:/data/Order_2.csv"

pay_file="D:/data/Payment_2.csv"

detail_2 = pd.read_csv(detail_file)

order_2 = pd.read_csv(order_file,index_col='orderid')

pay = pd.read_csv(pay_file)

detail_order = detail_2.groupby('orderid').price.sum()

pay_order = pay.groupby('orderid').payment.sum()

order_d_p = order_2.join([pay_order,detail_order])

nopay_order = order_d_p.query('price>payment')

print(nopay_order)

 

 

 

 

 

 

Aggregate sales amount of orders

Aggregate amount of collected payments

Associate primary table with two sub tables

Select

 

Order detail table is the sub table of order table, and order return table is also the sub table of order table; so there is one primary table and two sub tables. This operation still follows the method of aggregation on sub tables -- association on primary table. But Python does differently to complete the operation because the merge function cannot associate three tables simultaneously, instead, the join function is provided to associate three tables using index. It creates a new wide table, and uses the wide table to select the result, whose associative logic is slightly different from the previous one.

 

SPL


A

B


36

D:/data/Detail_2.csv


37

D:/data/Order_2.csv


38

D:/data/Payment_2.csv


39

=file(A36).import@tc()


40

=file(A37).import@tc()


41

=file(A38).import@tc()


42

=A39.groups(orderid;sum(price):amount)

/aggregate sales amount of orders

43

=A41.groups(orderid;sum(payment):payment)

/aggregate amount of collected payments

44

=join(A40:order,orderid;A42:amount;A43:payment)

Associate the primary table with two sub tables

45

=A44.select(amount.amount>payment.payment).(order)

/select

 

SPL still follows the logic described before, using the join function to associate multiple tables step by step, then just displaying the order table after selecting.

 

Summary

When performing association, Python copies two tables to create a new wide table, and uses the wide table to complete the subsequent operations, which consumes more memory and degrades the efficiency. Besides, when associating multiple tables, Python needs to use other functions and some particular conditions, otherwise, it has to use the merge function many times, which is a bit cumbersome.

 

While SPL only establishes associative relation on two tables rather than copying data, and uses the relation to complete the operation quickly, which occupies less memory and achieves high efficiency. In addition, when associating multiple tables, SPL can still use the same method of two-table association to deal with N tables without complex detour.

 

When associating primary table with sub tables (one-to-many), we first aggregate the sub tables and then associate them with primary table, which forms the homo-dimension association, so they are introduced together.