SPL Practice: Data Comparison between Different Types of Databases

 

Data comparison between databases of different types means comparing data between two tables having same logical structures in different types of databases to find the differences.

Problem statement

The difficulty of data comparison needs to overcome is the differences of data types and the data processing approaches between different database products. The problem is that different results will be obtained after the same data is imported to and exported from different databases.

Common data types mainly include numeric, string and date. Now we use Oracle and MySQL to respectively create data tables that both contain all the three commonly seen data types and have same simple logical structure, compare differences of the two databases, and explain how to compare data between different types of databases using SPL.

Create a data table in Oracle:

CREATE TABLE "TEST001" (
"F1" NUMBER(2, 0),
"F2" VARCHAR2(100),
"F3" DATE,
"F4" BINARY_DOUBLE
)

Create a data table in MySQL:

CREATE TABLE `test001` (
`f1` int,
`f2` varchar(100),
`f3` datetime,
`f4` decimal(10,2)
)

Import data to the database

SPL code of generating the test data and importing it to the database according to the above structures:

A
1 =rand@s(1)
2 =10.new(~:f1,rands("qwertyuiopasdfghjklzxcvbnm",5):f2,datetime("2011-11-11 11:11:11"):f3,rand(99999)/100:f4)
3 >A2(3).run(#2=null),A2(5).run(#2="")
4 =connect@l("mysql8")
5 =A4.update(A2,test001,f1,f2,f3,f4;f1)
6 >A4.close()
7 =connect@l(“oracle12c”)
8 =A7.update(A2,test001,f1,f2,f3,f4;f1)
9 =A7.close()

Differences between different types of databases

Below is data imported to the database:

Oracle:

f1 f2 f3 f4
1 kqipd 2011-11-11 11:11:11 156.82
2 xvvak 2011-11-11 11:11:11 17.53
3 null 2011-11-11 11:11:11 708.24
4 twtgq 2011-11-11 11:11:11 825.2
5 null 2011-11-11 11:11:11 628.82
6 hdjeb 2011-11-11 11:11:11 3.12
7 zurnb 2011-11-11 11:11:11 463.28
8 khcdv 2011-11-11 11:11:11 903.9
9 dhdhn 2011-11-11 11:11:11 956.73
10 vkacd 2011-11-11 11:11:11 647.69

Myqsl:

f1 f2 f3 f4
1 kqipd 2011-11-11 11:11:11 156.82
2 xvvak 2011-11-11 11:11:11 17.53
3 null 2011-11-11 11:11:11 708.24
4 twtgq 2011-11-11 11:11:11 825.2
5 2011-11-11 11:11:11 628.82
6 hdjeb 2011-11-11 11:11:11 3.12
7 zurnb 2011-11-11 11:11:11 463.28
8 khcdv 2011-11-11 11:11:11 903.9
9 dhdhn 2011-11-11 11:11:11 956.73
10 vkacd 2011-11-11 11:11:11 647.69

In Oracle, the record where f1 field is 5 has null value instead of an empty string under f2. This is because Oracle does not distinguish the empty string (‘’) from null.

Approach 1: Processing in SQL

First convert the empty string in MySQL to null and then make the comparison; otherwise, data in both sides will never be equal. If the field name is already known, we can perform the conversion directly in SQL. For example, select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1.

Approach 2: Processing in SPL

When only the table name is known but data structure is unknown, we can first write SQL to return the table’s structure (column names and data types) and then compose the SQL of converting the empty string to null in SPL. Below is the code:

A
1 =connect@l("mysql8")
2 =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'")
3 =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c()
4 =A1.query@x("select"/A3/"from test001")

Small-scale data comparison

This refers to the comparison scenarios where the size of data to be compared is relatively small and can fit into the memory.

When primary key is specified

Still use the above data structure and suppose f1 is the primary key, the data comparison code is as follows:

A
1 =connect@l("mysql8")
2 =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1")
3 =connect@l("oracle12c")
4 =A3.query@x("select * from test001 order by 1")
5 =join@f(A2:t1,f1;A4:t2,f1)

A5 performs full join on A2’s table sequence and A4’s table sequence according to f1 (the primary key).

The result shows that no records are intersected. Yet as f1 values in both sides are from 1 to 10, the two tables should be equal. What is the reason behind this?

Because, by default, SPL’s join()uses hash values to make the comparison. Oracle only has the number type. Though we can specify the range of numbers through the parameter defined using DDL number (for determining the corresponding data type in Java), Oracle JDBC will not handle the data type but always return the decimal type. As a result, the f1 values on both sides are equal but they have different data types as well as different hash values.

Approach 1: Modify data type

A
1 =connect@l("mysql8")
2 =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1")
3 =connect@l("oracle12c")
4 =A3.query@x("select * from test001 order by 1").run(#1=int(#1))
5 =join@f(A2:t1,f1;A4:t2,f1)

A4 converts f1 values of decimal type to the int type.

Approach 2: Merge

A
1 =connect@l("mysql8")
2 =A1.query@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1")
3 =connect@l("oracle12c")
4 =A3.query@x("select * from test001 order by 1")
5 =join@fm(A2:t1,f1;A4:t2,f1)

As both tables are ordered by f1, A5 uses the merge join to compare numeric values instead of hash values.

Then we full-join them to view the newly-added records, the deleted ones and the modified ones. Below is the code:

A
6 =A5.select(!t1).(t2)
7 =A5.select(!t2).(t1)
8 =A5.select(t1 && t2 && (${A2.fname().("t1."/~/"!=t2."/~).concat("||")}))

A6 finds records that exist in t1 but that do not exist in t2.

A7 finds records that do not exist in t2 but that exist in t1.

A8 finds records of t1 and t2 that have same primary key values but that have different values in other fields.

When no primary key is specified

When no primary keys are specified for data tables to be compared, we need to compare the whole columns of data:

A
1 =connect@l("mysql8")
2 =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'")
3 =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c()
4 =A1.query@x("select"/A3/"from test001 order by"/A2.(#).concat@c())
5 =connect@l("oracle12c")
6 =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'")
7 =A5.query@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c())
8 =join@mf(A4:t1,${A4.fname().concat@c()};A7:t2,${A7.fname().concat@c()})
9 =A8.select(!t1).(t2)
10 =A8.select(!t2).(t1)

As no primary key is specified, we just find out the records on both sides where there are any fields whose values are not equal.

A6 finds records that do not exist in t1 but that exist in t2.

A7 finds records that do not exist in t2 but that exist in t1.

Large-scale data comparison

This refers to the comparison scenarios where the size of data to be compared is too large to fit into the memory.

When primary key is specified

As the cursor is traversed once, we need to use SPL’s channel functionality in order to find the added, deleted and modified records at a time. The code is as follows:

A
1 =connect@l("mysql8")
2 =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'")
3 =A1.cursor@x("select f1,if(f2='',null,f2) as f2,f3,f4 from test001 order by 1")
4 =connect@l("oracle12c")
5 =A4.cursor@x("select * from test001 order by 1")
6 =joinx@f(A3:t1,f1;A5:t2,f1)
7 =channel(A6)
8 =channel(A6)
9 =channel(A6)
10 >A7.select(!t1).(t2)
11 >A7.fetch()
12 >A8.select(!t2).(t1)
13 >A8.fetch()
14 >A9.select(t1 && t2 && (${A2.(#1).("t1."/~/"!=t2."/~).concat("||")}))
15 >A9.fetch()
16 =A6.skip()
17 =A7.result()
18 =A8.result()
19 =A9.result()

Unlike data tables (table sequences) having small volume of data, joinx(), be default, requires that the cursors be ordered by the joining fields and will not use hash value comparison method. This means result error due to different data types can be avoided. The cs.group() function is similar. It also requires that the cursors be ordered by the grouping fields by default and will not perform hash value comparisons.

A17 finds records that do not exist in t1 but that exist in t2.

A18 finds records that do not exist in t2 but that exist in t1.

A19 finds records of t1 and t2 that have same primary key values but that have different values in other fields.

When no primary key is specified

We still use channel to make the comparison. The code is as follows:

A
1 =connect@l("mysql8")
2 =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'")
3 =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c()
4 =A1.cursor@x("select"/A3/"from test001 order by"/A2.(#).concat@c())
5 =connect@l("oracle12c")
6 =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'")
7 =A5.cursor@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c())
8 =joinx@f(A4:t1,${A2.(#1).concat@c()};A7:t2,${A2.(#1).concat@c()})
9 =channel(A8)
10 =channel(A8)
11 >A9.select(!t1).(t2)
12 >A9.fetch()
13 >A10.select(!t2).(t1)
14 >A10.fetch()
15 =A8.skip()
16 =A9.result()
17 =A10.result()

A16 finds records that do not exist in t1 but that exist in t2.

A17 finds records that do not exist in t2 but that exist in t1.

Export comparison result

ch.result returns a table sequence. When the comparison result is also large, we can export the result to a file to avoid that there is not enough memory, as shown below:

A
1 =connect@l("mysql8")
2 =A1.query("SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME ='test001'")
3 =A2.(if(#2=="varchar","if("/#1/"='',null,"/#1/") as"/#1,#1)).concat@c()
4 =A1.cursor@x("select"/A3/"from test001 order by"/A2.(#).concat@c())
5 =connect@l("oracle12c")
6 =A5.query("SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE TABLE_NAME ='TEST001'")
7 =A5.cursor@x("select"/A6.(#1).concat@c()/"from test001 order by"/A6.(#).concat@c())
8 =joinx@f(A4:t1,${A2.(#1).concat@c()};A7:t2,${A2.(#1).concat@c()})
9 =channel(A8)
10 =channel(A8)
11 >A9.select(!t1).(t2)
12 =A9.fetch(file("t2.b"))
13 >A10.select(!t2).(t1)
14 >A10.fetch(file("t1.b"))
15 =A8.skip()

In ch.fetch(f) in both A12 and A14, f is a bin file.