Methods of dumping data among databases

 

Dumping data among databases first needs to export the data of the original database to a certain file and then import the file to the target database. There are also some more complicated situations which need to process the exported data firstly and import them to the target database afterward. This article will explore several types of file used for data dumping and compare them in detail.

 

Database tools

The database itself possesses some tools for exporting and importing data, and it can quickly perform the simple dumping of offline data with these tools. For example, the backup and recovery of data, data dump of the same structure tables in the same type of databases and some other relatively simple data migration.

 

Take MySQL for example, to dump the data of the same structure tables in the same kind of databases, we can first export the table data as a text file, and the SQL script is as follows:

select

            *

into

            outfile "/home/raqsoft/lineitem.txt"

from

            LINEITEM;

 

Then we can import the data of text file to the data table of the same structure, and the SQL script goes like this:

load data infile "/home/raqsoft/lineitem.txt"

into

            table LINEITEM1;

MySQL can also be used to export and import data with command mysqldump and mysqlimport. Similarly, Oracle can use spool and sqlldr.

 

Code to dump data

 

Nonetheless, we cannot use the database tools directly in some situations. For example, the structure of the exported data are not the same as that of the target table, the data needs to be modified before importing, the dumping has dynamic query conditions, etc. Such dumps even need to be provided successfully in the application in order to be invoked. This is when we need to export the data of the original database to the intermediate file and then import the file (or after some processings) to the target database by coding.

Text file

Text file is the most common dumping medium, which has high universality and no limitations for the content size. But the data exported to text files may lose the data type. The data read from text files are transformed to strings, which need to be converted to the specified data type for using. In practice, it is also difficult to write an universal class for reading text files.

 

Excel file

Excel can also be used as an intermediate file for data dumping, and its cell has the data type compared to text files. However, the Excel files have limits on size (xls files: 65536 rows and 256 columns in maximum; xlsx files: 1048576 rows and 16384 columns in maximum), and can be quite difficult to process when the amounts of data to be exported exceed the limits. The Excel files are more complex compared to text files and the speed of exporting and importing the same data is slower than text files.

 

Bin file of esProc

If the bin files (btx) of esProc SPL are adopted, then the problems of the previous files can be all avoided. btx is a special binary format file that automatically records the data type, so there is no possibility of losing data type like text files. And we no longer need to parse or determine the data type when reading bin files, which is faster than text files. Also, bin files have no limitation for content size, and the data to be exported don’t have to be stored in segments due to the content size limit compared with Excel files. Moreover, we can also append data in the already-existing bin files as needed. When reading the relatively big data, we can segment the data to perform in parallel, thus improving efficiency.

 

No matter what types of files are used as intermediate files, eventually, programming is still needed to process the data.

Data dump in SPL

SPL is a programming language specially designed for structured (semi-structured) data, which has complete abilities of processing text files and Excel files. The code to read data from files are as follows:

 

txt files: =file("Department.txt").import@t(DEPT,MANAGER:int,...)

csv files: =file("Department.csv").import@tc(DEPT,MANAGER:int,...)

Excel files: =file("Department.xlsx").xlsimport@t(DEPT,MANAGER,...; "sheet1")

 

The code to export data to files are as follows:

txt files: =file("Department.txt").export@t(data)

csv files: =file("Department.csv").export@tc(data)

Excel files: =file("Department.xlsx").xlsexport@t(data; "sheet1")

 

When the data are big, it can also create a cursor on the file and calculate based on the cursor, or read the data of the cursor and write them to the file.

 

With SPL code, data dump can be easily done regardless of which type of intermediate files is chosen.

 

For example, if the target table data are in different structures, SPL can write like this: =db.update(file,tablename,field1,filed2,CUSTID:CUSTKEY,field4,field5,REMARK:null). The CUSTKEY field of the file is renamed as CUSTID, and REMARK is the newly added field assigned null.

 

Another example is the situation of modifying data first and then importing them. The file has a column named data_date and the data type is string. The strings need to be converted to dates in yyyyMMdd format and imported into a new data table, then the SPL can write like this: =db.update(file,tablename,field1,field2...,data_date:date(data_date,"yyyyMMdd"),...,fieldn).

 

Dumping the offline data may also be complex, such as filtering the data of files according to dynamic conditions and then importing the results to the target database. This kind of situation requires writing a program that can parse dynamic expressions, which is difficult and complicated to debug. We can use SPL with the tools that come with the database to achieve the dumping, for example:


A

B

1

=file("info.txt").cursor@t(id:int,...,type:int,sdate:date)

//info is the text file exported with the database tools, create a cursor to read the data from the text file, and specify the field type

2

=A1.select(type==3 && sdate>=date("2021-01-01"))

//filter the data based on the conditions which can be regarded as dynamically given parameters

If the filtered result is small, it can be imported to the database directly through jdbc:

3

=connect("mysql")

//connect to database

4

=A3.update@i(A2,INFO)

//insert the data of A2 into the table with the same field name as the table

5

>A3.close()

//close the database connection

 

Or we can export the processed results to text files and finally import them to the target table with database tools.

 

SPL can be easily integrated with JAVA to enable data dump in applications. For further information, please refer to How to Call an SPL Script in Java.