Performance Optimization - 7.4 [Merge and join] Attached table

 

The foreign key association may be established between a certain fact table and multiple dimension tables arbitrarily, even multiple foreign key associations can be established between a same pair of fact table and dimension table. However, the homo-dimension tables association and the primary-sub table association established based on the primary keys of two tables can’t be so arbitrary.

The homo-dimension tables association is an equivalent relationship (If A is homo-dimension with B, then B is homo-dimension with A; If A is homo-dimension with B, and B is homo-dimension with C, then A is homo-dimension with C), and hence we can take advantage of the homo-dimension relationship to divide all the data tables into several groups. In this way, the data tables in the same group are associated with each other in the homo-dimension relationship, rather than with the tables outside the group. In other words, if the homo-dimension association is to be performed, it definitely occurs between the tables within the same group. Among the homo-dimension tables in the same group, there is usually a largest table whose primary key values are complete, and the primary key values of other tables belong to a subset of this complete values. For example, the customer table has the primary key values of all customers, while the VIP customer table has only the primary key values of part customers but has more attributes of the VIP customers.

The primary-sub table association is slightly more complicated, but there will be some fixed characteristic. In a reasonable data structure design, the sub-table will only establish the association with the unique primary table. For example, the primary table of the order detail table will only be the order table, and will not be another, hence the primary table is unique from the perspective of the sub-table.

Based on this understanding, we can bind together the homo-dimension table and primary-sub table of the same group for storage.

For the homo-dimension tables in the same group, first find out the table with complete primary key values, which is called the base table, and the remaining homo-dimension tables are called the attached tables. After the base table is determined, the fields of attached tables will be stored as the additional fields of base table records; or as the fields of base table, but these fields have no value for many records.

For the primary-sub table relationship, take the primary table as the base table, and sub-table as the attached table. The fields of the sub-table are also considered as the additional field of primary table records. The difference is that the value of these additional fields is a set, and the length of the value set of the additional fields from the same sub-table is the same (as the additional field of the same primary table record). Similarly, these additional fields may also have no value.

Due to the relatively fixed characteristic of the homo-dimension tables association and the primary-sub table association, binding storage will not affect the association relationship, nor will it affect the foreign key association with other tables.

In doing so, we can obtain the following benefits in terms of performance:

1)The base table and the attached table have common primary keys. When the fields of attached table are stored as the additional fields of base table records, only one set of primary keys need to be stored, and there is no need to store primary keys (associated with the base table) of attached table again. As a result, the storage amount will be smaller. When the columnar storage method is adopted, the amount of data to be read during association will also become less.

2)The fields of attached table, as the additional field of base table records, can be directly referenced (the fields from the sub-table is a set, using different reference method), and there is no need to do association and comparison, thereby reducing the amount of calculation. In particular, if the base table is filtered, the attached table will be filtered automatically, therefore, there is no need to use the association location method introduced in the previous section, and vice versa.

3) As the field of base table records, the attached table fields are bound together with the base table records, hence it is naturally synchronized during segmentation, and no special segmentation following is required.

However, this storage method also has disadvantages. Because the storage scheme becomes more complicated, there will be a lot of unnecessary judgments when referencing additional fields.

Generally, when the primary key or the association is relatively complex, for example, there are multiple primary key fields, or the N is bigger in the 1:N ratio of primary-sub table association (it means that there will be more comparisons in conventional association), using the attached table scheme will have greater advantages. For the homo-dimension tables association, if the primary key is single and simple, the advantages of the attached table scheme are not obvious, and may even have disadvantages.

Theoretically, in the primary-sub table association relationship, the attached table can have its own attached table, but it is not very common.

Let’s take the primary-sub table as an example.

SPL implements the attached table function on the composite table, and it needs to specify the additional field when the composite table is created.

A
1 =file(“orders.ctx”).open().cursor()
2 =file(“details.ctx”).open().cursor()
3 =file(“order_detail.ctx”).create(#ID,…)
4 =A3.attach(detail,#seq,…)
5 =A3.append@i(A1)
6 =A4.append@i(A2)

In A3, a conventional composite table is created, and an attached table is added based on A3 in A4 where there will be a name and fields of the attached table. For the sub-table, you need to design its own primary key (if it has a common primary key with the primary table, there is no need to specify), and then append the data like a normal data table. In this way, SPL will attach the records to correct primary table records based on the primary key of sub-table. It should be noted that except for the primary key, the base table and attached table cannot have fields with the same name, otherwise confusion will occur.

The reason why it is called the composite table is because it is a combination of the base table and the attached table. The base table and the attached table in composite table is called the real table.

After creating the composite table with attached table, the field of the attached table can be referenced in calculations.

A
1 =file(“order_detail.ctx”).open()
2 =A1.cursor(dt,detail.sum(quantity):quantity)
3 =A2.groups(dt;sum(quantity))

The quantity is the field of attached table detail. Because it is the sub-table, the fetched data is a set. When referencing from the primary table, the aggregation operation is required.

The records of the sub-table can also be restored:

A
1 =file(“order_detail.ctx”).open()
2 =A1.cursor(dt,detail{price,quantity}:amount)
3 =A2.run(amount=amount.sum(price*quantity))
4 =A3.groups(dt;sum(amount))

Since there are multiple sub-table records, which will be used as one field of the cursor of primary table after restoring, so the fetched data is a table sequence. Actions such as generating a table sequence are more complicated, and will lose performance and may offset the advantages of reducing associations.

The base table field can also be referenced from the attached table:

A
1 =file(“order_detail.ctx”).open().attch(detail)
2 =A1.cursor(dt,price,quantity)
3 =A2.groups(dt;sum(price*quantity))

Just write directly when referencing the base table fields, and the performance will be better than the above method of generating table sequence field.

All these operations can support multi-cursor.

It should be noted that when we introduce these association algorithms, we often say that they may not always achieve better performance. Association is a complex operation, and the implementation code of its optimization algorithm is also very complex. Although most of these algorithms can reduce the computational complexity from the theoretical analysis point of view, when the actual code is very complex, the impact on engineering practice cannot be ignored.

All these algorithms have been verified by actual testing and can indeed optimize the performance in some scenarios, but not in all scenarios. As for which algorithm to use, it should be selected according to the actual situation after being familiar with these algorithms.