SPL Programming - 10.3 [Association] Merge

 

If there are two Excel files, such as the order table mentioned before, they may be made by two persons respectively, and there may be some duplicate records. We need to pick out these duplicate records and merge them into a table without duplicate records.

Originally, this is a very simple task of set union. We have learned set operations and can directly use operators such as intersection and difference. However, we also know that different records in the table sequence will never be equal, and the records in different table sequences will never be equal. Of course, the table sequences read from Excel files are different table sequences. In this case, they are not suitable for set operations.

SPL provides merge operation function to complete this task:

A B
1 =T(“data1.xlsx”) =T(“data2.xlsx”)
2 =[A1,B1].merge@uo() >T(“data.xlsx”,A2)

Quite simply, the merge()function compares the contents of the records. The @u option indicates to get the union. Without this option, it will directly concatenate the two table sequences, that is, to get the concatenation (i.e. |). We’ll talk about @o later.

Similarly, the merge() function can also get the intersection, that is, to get the duplicate records, just replace @u with @i.

A B
1 =T(“data1.xlsx”) =T(“data2.xlsx”)
2 =[A1,B1].merge@io() >T(“data.xlsx”,A2)

The merge() function will compare all fields of the whole record. If any field is different, it will be considered that the two records are different. Such comparisons are time-consuming and sometimes unnecessary. For example, for this order problem, we believe that the producer is conscientious and responsible. As long as the order number is correct, the order content will not be wrong. However, there may be duplication and omission in the order collection process, resulting in duplicate order numbers in the two files. And we only need to pick out the records of duplicate order numbers.

A B
1 =T(“data1.xlsx”) =T(“data2.xlsx”)

By adding a parameter to the merge() function, we can make it compare only this field.

Or for records with primary key set, the merge() function will only compare the primary key.

A B
1 =T(“data1.xlsx”).keys(ID) =T(“data2.xlsx”).keys(ID)
2 =[A1,B1].merge@uo() >T(“data.xlsx”,A2)

The operation is same for intersection, just replace @u with @i.

If the situation is more complicated, both producers may make mistakes. We need to see if the records with the same order number in the two files have different contents. Therefore, we need to select the records with the same order number and compare the whole contents to see if they are the same.

A B
1 =T(“data1.xlsx”) =T(“data2.xlsx”)
2 =[A1,B1].merge@io(ID) =[B1,A1].merge@io(ID)
3 =[A2,B2].merge@do() =A3.(ID)

A2 will calculate those records in A1 with duplicate order numbers as B1, but retain the record sequence composed of A1 records. Correspondingly, B2 will retain the records in B1 that duplicate the order number of A1. A2 and B2 are both the intersection of A1 and B1, but they are only the intersection in the sense of comparing the order number, and the whole contents are not necessarily the same. Then use merge@d() in A3 to calculate the full content difference set between A2 and B2, so that we can get the records in A2 that have different content with that of the corresponding orders in B2, and get the order numbers of these orders in B3, that is, the orders to be reviewed.

This complexity occurs when set operations involve structured data. The same primary key (or a relatively critical field) does not mean that the whole record is the same. When we need to judge that the records are the same when doing intersection, union or difference, we may use only the primary key or the whole record. If only the primary key is used, the exchange law will not be satisfied. This is not only that the order of members may be different, but also the content may be totally different. Pay special attention when writing code.

Now let’s talk about this @o option. By default, the merge()function assumes that the data is ordered (for example, from small to large), thus the comparison speed is very fast, and no sorting is needed. The @o option tells the merge function that the data is out of order and needs to be sorted first. In our current example, the orders are arranged in order according to the order number. It doesn’t matter if @o is not written.

Another note is that the merge()function is preceded by a sequence ([A1,B1]), which can calculate the union or intersection results of multiple record sequences at the same time. However, it is relatively uncommon, and we won’t give examples here.

The merge()function is also valid for sequences, but single valued sequences generally use the set operator directly, and rarely use the merge() function.


SPL Programming - Preface
SPL Programming - 10.2 [Association] Foreign key
SPL Programming - 10.4 [Association] Join