Association calculation in SPL - external storage join

 

The previous article “Association calculation in SPL - In-memory join” (In-memory join for short) presents the classification of association calculations in SPL and the programming methods for in-memory join.

When one or more association tables have a large amount of data and need to be stored in external storage, the in-memory join algorithms cannot be used. For this reason, SPL specifically provides external storage join algorithms.

When solving external storage join problems, there are similarities with in-memory join:

1. Clearly distinguish the type of join, and find the (logical) primary key participating in association;

2. Choose different SPL functions to calculate based on different join types (foreign key join and primary key join).

Foreign key join

We has introduced in In-memory join that the foreign key join refers to the equivalence join between a certain field of table A and the primary key of table B; table A is called the fact table and table B is called the dimension table; the field in table A that associates with the primary key of table B is called the foreign key that A points to B, and B is called the foreign key table of A.

For the foreign key join of external storage, SPL offers different algorithms according to three different situations of fact table and dimension table sizes.

Situation 1: the fact table is large and the dimension table is small

The ‘large’ and ‘small’ referred to here is relative to memory. The dimension table is small, which means that it can be entirely loaded into memory; the fact table is large, which means its size exceeds the capacity of memory.

In practice, situation 1 is quite common because the fact table is used to store increasingly growing events and easily become very large, while the dimension table is relatively stable in size as it is used to store the code information that changes little.

We know that the address is a concept in memory. Since only the dimension table is in memory, the foreign key of fact table cannot be addressized, which makes it impossible to perform pre-association described in In-memory join, and hence we have to establish an association temporarily - we call it instant addressization.

SPL uses the cs.switch (cs stands for cursor) function to implement instant addressization.

Assume the order table ‘orders’ (fact table) is associated with the product table ‘product’ (dimension table) on the product number ‘pid’, and now we want to group the orders by the provider ‘vendor’ in product table and aggregate the order amount of each group.

The code for instant addressization is roughly as follows:

A
1 =file("product.btx").import@b().keys@i(pid)
2 =file("orders.ctx").open().cursor(pid,amount)
3 =A2.switch(pid,A1)
4 =A3.groups(pid.vendor;sum(amount))

A1: load the entire dimension table into memory and create the primary key with index;

A2: create a cursor for the fact table and read only the required fields;

A3: apply the cs.switch function on the cursor of fact table to implement foreign key addressization. Since a delayed cursor is returned, the actual association will be performed only when the data is fetched;

A4: the grouping and aggregating calculation in this step can only be carried out based on the result returned from A3, which is different from the all-in-memory operation where the calculation can be performed based on the original data table. The reason is that the switch function used in all-in-memory operation will change the foreign key field of original data table, yet the current association is done during the process of fetching data with cursor.

Similar to the foreign key addressization of in-memory join, using cs.switch to implement foreign key join will also cause the loss of foreign key value, and is the same unable to implement the association of multi-field foreign keys. However, such problems can be solved with cs.join function.

Assume the order table and the product table are associated on two primary key fields ‘ptype’ and ‘pid’ of product table, the code for grouping and aggregating is roughly as follows:

A
1 =file("product.btx").import@b().keys@i(ptype,pid)
2 =file("orders_new.ctx").open().cursor(ptype,pid,amount)
3 =A2.join(ptype:pid,A1,~:pid_fk)
4 =A3.groups(pid_fk.vendor;sum(amount))

A1: load the entire dimension table into memory and create primary keys of the two fields;

A3: use the cs.join function to associate the product table on primary keys. Besides the existing fields, add a new field ‘pid_fk’ to store the reference address of the corresponding records. If no corresponding record is found, fill in the pid_fk field with null;

A4: reference the vendor field of product table using pid_fk to perform grouping.

Similar to in-memory join, the cs.switch and cs.join also provide @i and @d options. When the foreign key of fact table does not find a corresponding record in dimension table, @i means this fact table record is deleted from the result, while @d means that only the fact table record with no corresponding foreign key record is retained.

cs.switch@i is to associate the cursor of fact table with the dimension table and to filter out the cursor record that cannot be associated. However, it needs to fetch a record from the cursor before judging whether this record can be associated. Even if a record cannot be associated, this record is already generated.

SPL provides a specialized method for the cursor of composite table to improve the performance of filtering calculation. For details, visit: pre-cursor filtering. We can apply this method to the foreign key association of composite table cursor, which is SPL’s composite table cursor association and filtering mechanism.

Take the association between an order table and an employee table as an example, the code of the said mechanism is roughly as follows:

A
1 =file("employee.btx").import@b().keys@i(eid)
2 =file("orders.ctx").open().cursor(o_eid,amount;o_eid:A1
3 =A3.groups(o_eid.vendor;sum(amount))

In order to more easily distinguish the employee number ‘eid’ in the employee table, we write the employee number in the order table as ‘o_eid’.

In A2, write o_eid:A1 at the position of filter condition parameter of the composite table cursor, which allows SPL to determine whether a record can be associated before generating the record. If it can be associated, the address of the corresponding dimension table record will be assigned to o_eid at the same time, hereby implementing foreign key addressization.

If we only need to judge whether an order is associated with an employee, and don’t perform foreign key addressization, we can employ conventional conditional syntax:

A
1 =file("employee.btx").import@b().keys@i(eid)
2 =file("orders.ctx").open().cursor(amount;A1.find(o_eid))
3 =A3.total(sum(amount))

The pre-cursor filter condition in A2 is that A1.find(o_eid) is not null.

Since SPL does not provide the syntax corresponding to the cs.join function, we have to read records before processing in case of multi-field foreign key.

The cs.switch and cs.join functions can implement parallel computing based on multi-cursor. In this example, the cursor@m option can be used in A2 to enable multi-cursor, and the following code does not need to be changed.

For multiple small dimension tables or multi-layer small dimension tables, we can read the dimension tables into memory in advance, create an index and perform pre-association between dimension tables before traversing the fact table for calculation. The index on dimension table and the pre-association between dimension tables can be reused multiple times.

In addition to instant addressization, SPL also provides a series of high-performance algorithms for situation 1, such as the foreign key sequence-numberization mechanism. This mechanism is to convert the value of foreign key of fact table to the sequence number of record in dimension table in advance, allowing us to perform association with the faster sequence number positioning method.

Still, take the association between an order table and an employee table as example. We can convert the values of foreign key field ‘o_eid’ of order table to the sequence numbers of employees in employee table (dimension table) in advance to prepare for the association on sequence number. The code is roughly as follows:

A
1 =file("employee.btx").import@b().keys@i(eid)
2 =file("orders.ctx").open().cursor(oid,o_eid,amount,…)
3 =A2.run(o_eid=A1.pfind(o_eid))
4 =file("orders_new.ctx").create(#oid,o_eid,amount,…)
5 =A4.append@i(A3)

A3: use the pfind to convert the values of ‘o_eid’ of order table to the sequence numbers in dimension table;

A4, A5: create new order tables, and write the result of A3 to the new composite table to implement foreign key sequence-numberization.

At this point, the new order table can be associated with the employee table on sequence number. The code is:

A
1 =file("employee.btx").import@b()
2 =file("orders_new.ctx").open().cursor(o_eid,amount)
3 =A2.switch(o_eid,A1:#)
4 =A3.groups(o_eid.vendor;sum(amount))

A1: the employee table is used for sequence number positioning; there is no need to establish a primary key and index, which saves the memory space;

A3: the switch function uses ‘#’, which means the association is performed on sequence number. The directly use of ‘o_eid’ as sequence number to fetch the corresponding record in dimension table can obtain better performance than instant addressization.

If @i is appended to the switch in A3, the order record that cannot be associated on sequence number will be deleted. In this case, we can use the composite table cursor to associate and filter. The code is roughly as follows:

A
1 =file("employee.btx").import@b().keys@i(eid)
2 =file("orders.ctx").open().cursor(o_eid,amount;o_eid:A1:#)
3 =A3.groups(o_eid.vendor;sum(amount))

A2: the pre-cursor filtering parameter of composite table is written as o_eid:A1:#, which means the records of A1 are fetched by taking o_eid as sequence number. If the result is null, this fact table record will not be generated.

Of course, the sequence-numberization method can also be used for in-memory join, yet in-memory join can pre-associate the value of foreign key as address, and there is little difference in performance between using address and using sequence number to access the dimension table. Nevertheless, if the sequence-numberization is already made to dimension table, using the sequence-numberization technology is still more advantageous when performing the pre-association in memory.

Sometimes the dimension table will also be filtered. In view of this, SPL provides high-performance algorithms, which will not be described in detail here. For further information, please refer to: Index reuse, aligned sequence.

Situation 2: the dimension table is large and the fact table is small

In this case, the fact table can be entirely loaded into memory and the dimension table is stored in external storage. Associating the fact table with dimension table is essentially a batch search operation on external storage.

SPL provides the T.joinx@q function to accomplish the foreign key association of situation 2. This function requires the large dimension table to be ordered by primary key.

Assume the order table (fact table) is small and the customer table (dimension table) is large, we need to store the customer table in order by primary key in advance. After preprocessing in this way, the code for associating the two tables on customer number is roughly as follows:

A
1 =file("customer.ctx").open()
2 =file("orders.btx").import@b(cid,amount)
3 =A2.joinx@q(cid,A1:cid,area,discount)
4 =A3.groups(area;amount*discount)

It should be noted that T.joinx@q requires that the large dimension table cannot participate in calculation as cursor. For the composite table, it should participate in the calculation as the form in A1 (real table), which is determined by its computing principle. For detailed introduction on the principal as well as more scenarios, visit: Search large dimension table.

Once the table in external storage is stored in order, it also needs to consider minor modifications on data, regular data appending, and bulk deletion. For details, visit: Ordered storage of SPL. For any orderly storage of data to external memory that appears in the subsequent introduction, you can refer to this article to solve the problem related to data change.

Situation 3: both the fact table and dimension table are large

When both the fact table and dimension table are large (the fact table is usually larger), it’s difficult to calculate at high speed anyway. SPL provides the one-side partitioning mechanism to obtain the best possible performance.

SPL’s cs.joinx@u function encapsulates the one-side partitioning mechanism, and also requires the dimension table to be ordered by primary key. Assume both the order table and dimension table are large, we need to store the customer table in order by primary key in advance.

After preprocessing in this way, the code for associating the two tables on customer number is:

A
1 =file("customer.btx")
2 =file("orders.btx").cursor@b(c_id,amount)
3 =A2.joinx@u(c_id,A1:cid,area,discount)
4 =A3.groups(area;amount*discount)

Note that the T.joinx@u also requires that the large dimension table cannot appear as cursor. For bin file, it should participate in calculation as the form in A1 (file).

For detailed introduction on the principle as well as more scenarios, visit: one-side partitioning.

Primary key join

In-memory join has presented that the primary key join refers to the equivalence join between the primary key of table A and the primary key or part of primary keys of table B. The primary key join can be subdivided into homo-dimension join and primary-sub join.

Homo-dimension join

The association between the primary key of table A and the primary key of table B is called the homo-dimension join, and A and B are mutually referred to as homo-dimension table. The homo-dimension tables are in one-to-one relationship, and can be treated as a single table logically.

Since multiple tables in homo-dimension relationship often grow simultaneously, it is unlikely to have a significant difference in size.

SPL offers the ordered merge algorithm (joinx function) to implement homo-dimension join in external storage. This algorithm requires both the homo-dimension tables to be stored in order by primary key in advance.

Assume the primary keys of both the customer table ‘customer’ and the customer information table ‘customer_info’ are customer number, and store different attributes of customer respectively, now we want to calculate the sum of “acctbal” in the customer and “fund” in the customer_info.

To solve this problem, we need to store both tables in order by the customer number ‘cid’ in advance, and then use the joinx function to implement homo-dimension join. The code is:

A
1 =file("customer.ctx").open().cursor(cid,acctbal)
2 =file("customer_info.ctx").open().cursor(cid,fund)
3 =joinx(A1:c,cid;A2:ci,cid)
4 =A3.new(c.acctbal+ci.fund:newValue)

A1, A2: define the cursor for the customer and customer information composite tables;

A3: ordered merge and join of the two cursors on primary key. When using joinx, the data of each cursor participating in association must be ordered by the association field;

A4: take out the fields ‘c’ and ‘ci’ from A3 and add them together to form a new cursor.

The joinx function without option, with @1 option and with @f option represent the inner join, left join and full join respectively.

The joinx function supports multi-thread parallel computing. When performing parallel segmentation, we must specify one cursor to act as the base cursor in order to ensure the primary key records of two files with the same primary key are assigned to their corresponding segments.

Therefore, the above code needs to be modified as follows. Only in this way can parallel computing proceed correctly.

A
1 =file("customer.ctx").open().cursor(cid,acctbal;;4)
2 =file("customer_info.ctx").open().cursor(cid,fund;;A1)
3 =joinx(A4:c,cid;A5:ci,cid)
4 =A6.new(c.acctbal+ci.fund:newValue)

A1: use the basic method to define multi-cursor, with the parallel number being 4;

A2: generate multi-cursor based on A1. Since the composite table can define the primary key (the field name with #) at creation, there’s no need to write field names explicitly and the search is performed by automatically matching the primary keys (names of primary key fields in the two tables can be different).

For the introduction on the principal of ordered merge and join, visit: SPL Order-based Merge Join.

Primary-sub join

Primary-sub join refers to the association between the primary key of table A and part of the primary keys of table B; A is called the primary table and B is called the sub table. The primary-sub tables are in one-to-many relationship.

Similar to homo-dimension tables, primary-sub tables generally grow simultaneously and need also to be stored to external storage.

SPL also adopts the ordered merge mechanism to calculate primary-sub join, and the primary-sub tables need also be stored in order by primary key in advance.

Assume the order table ‘orders’ (primary key ‘oid’) and the order detail table ‘order_detail’ (primary keys ‘oid’ and ‘pid’) are associated on the primary key ‘oid’ of orders and part of primary keys ‘oid’ of order_detail; the orders is the primary table and the order_detail is the sub table.

Now we want to filter out the orders with a discount greater than 0.05, and then group the orders filtered out by pid, and finally aggregate the amount of each group and average the discount.

To solve this problem, we need to store the two tables in order by primary key as composite tables in advance, the code to associate the two tables is roughly as follows:

A
1 =file("orders.ctx").open().cursor(oid,amount)
2 =file("orders_detail.ctx").open().cursor(oid,pid;discount>0.05)
3 =joinx(A1:o,oid;A2:od,oid)
4 =A3.groups(od.pid;sum(amount),avg(discount))

A2: the order_detail adopts the pre-cursor filtering algorithm, which has better performance;

Similar to homo-dimension join, it also needs to determine the base table when calculating primary-sub join with multi-cursor, except that only the primary table can be used as base table for primary-sub join. The code is roughly as follows:

A
1 =file("orders.ctx").open().cursor@m(oid,amount)
2 =file("orders_detail.ctx").open().cursor(oid,pid;discount>0.05;A1)
3 =joinx(A1:o,oid;A2:od,oid)
4 =A3.groups(od.pid;sum(amount),avg(discount))

A1: it does not specify the parallel number and uses SPL’s default one;

A2: muilt-curor of sub-table is based on primary table of A1.

For the primary key join in external storage, SPL also provides other high-performance algorithms, visit: association positioning, attached table for details.