SQL-style Operations on Structured Text Files

 

Structured text files, such as TXT and CSV files, are stored in common row-wise format. We can hardcode any operations on such a file in Java in rather complicated ways. We can use open-source packages such as Commons CSV, OpenCSV, and SuperCSV to parse them but need to load data to the data base to in order to perform the subsequent computations. This is inconvenient. Yet with SPL (Structured Process Language), it is convenient for us to compute text files and integrate the relevant code or script into a Java program.

 

1.   Retrieval

In a text file, each row is a record and all rows have same columns. The file is similar to a database table.

SPL has file function to open a text file and import function to read it as a table sequence for later computations. Below are some examples of text file retrieval.

1. Tab(\t)-separated, with header rows, and of txt format

undefined 

=file("scores.txt").import@t()

The parameter in file function can be the target file’s absolute path or relative path (which is relative to the mainPath configured in SPL configuration file). import function works with @t option to read headers in and uses the default separator tab (\t).

 

2. Comma-separated, without header rows, and of CSV format

undefined 

=file("d:/order/orders.txt").import@c()

@c option means that the targe file is of CSV format. The absence of @t option means no header row. A file without the header row does not have column names and columns will be referenced using their ordinals, such as #1 and #2.

 

3. Vertical-line-separated, with header row, of txt format, and using utf-8 charset

undefined 

=file("d:/txt/employee.txt":"utf-8").import@t(;,"|")

We can add a charset parameter to file function since garbled code may be shown if wrong charset is used. The OS default will be used if no charset is specified. In the import function, @t option enables reading headers in and parameter “|” represents the vertical line separator.

 

Read Structured Text File Parsing in SPL to learn more about text file retrieval.

 

2.   Operations

The retrieved data is stored as a table sequence in the memory, on which any operations, such as filtering, aggregate, grouping & summarization, sorting, joins, distinct, and getting computed column, can be performed.

Below is a student scores table (Students_scores.txt):

undefined 

 


A

Description

1

=file("Students_scores.txt").import@t()

Import data as a table sequence; @t option reads the first row as column headers

2

=A1.select(CLASS==10)

Get records of class 10

3

=A1.groups(CLASS;min(English),max(Chinese),sum(Math))

Calculate the lowest English score, the highest Chinese score, and the total math score in each class

4

=A1.sort(CLASS,-Math)

Sort recrods by class number in ascending order and then by math score in descending order

 

Read General SQL-style Operations in SPL to learn how to perform more operations.

 

3.   Returning result set to Java

The SPL script can be called by a Java program through JDBC using the following code:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection con = DriverManager.getConnection("jdbc:esproc:local://");
Statement st = con.createStatement ();
ResultSet rs = st.executeQuery("=file(\"Students_scores.txt\").import@t().select(CLASS==10)");

 

Read How to Call an SPL Script in Java to learn more.

 

4.   Using SQL

We can also apply SQL directly on a text file, for instance:


A

1

$select * from E:/txt/Students_scores.txt where CLASS=10

2

$select min(English),max(Chinese),sum(Math) from E:/txt/Students_scores.txt

3

$select * from E:/txt/students_scores.txt order by CLASS, Math desc

SPL will automatically choose the corresponding separator according to file extension. By the way, the file must have column headers for the use of SQL.

 

We can also execute the avove SQL statement directly in JDBC and return the result.

ResultSet rs = st.executeQuery("select * from E:/txt/Students_scores.txt where");

 

Read Examples of SQL File Queries to learn more related information.

 

5.   Big file handling

SPL handles big text files that cannot fit into the memory with the cursor. The way is similar to that of handling smaller files. We just need to replace import function with cursor function to return a cursor. SPL supports attaching operations on a cursor.

Here we use the example table in Section 2:


A

Description

1

=file("E:/txt/Students_scores.txt").cursor@t()

Import data as a cursor; @t option reads the first row as column headers

2

=A1.select(CLASS==10)

Get records of class 10

3

=A1.groups(CLASS;min(English),max(Chinese),sum(Math))

Group records by class and perform aggregates and use groups function to return a table sequence if the result set is not so large

4

=A1.groupx(CLASS;min(English),max(Chinese),sum(Math))

Use groupx function to return a cursor if the result set is large

5

=A1.sortx(CLASS,-Math)

Sort records and return a cursor

 

When the result set of handling a big text file is really large, it can be returned as a cursor for further processing. To get the first records from A4’s result set, for instance, SPL has the following statement:

=A1.groupx(CLASS;min(English),max(Chinese),sum(Math)).fetch(10)

To save the sorting result set in A5, for instance, SPL uses statement as follows:

=file("E:/txt/sorted_scores.txt").export@t(A1.sortx(CLASS,-Math))

Read Structured Text File Parsing in SPL to learn more about data retrieval from large text files.

Read General SQL-style Operations on Cursors in SPL to learn more about performing operations on large text files.

 

The final cursor result of a SPL script can be returned to a JAVA program for invocation. The method of calling it is the same as that calling one via JDBC interface.

Like handling small file, we can use SQL on large files.