How to Use SQL in esProc

 

esProc allows users to directly and separately use SQL within it, along with simple SPL syntax, to query source files of unusual or complicated formats. Once you understand that data of various formats that is input into esProc will all be in the form of tables, and that these data table objects are essentially the same as the database tables, you can use SQL to query one or multiple tables in esProc dynamically and flexibly. Here are some examples.

 

Basic scenarios

esProc SQLs basic data sources are text files and Excel files. There are two commonly-used types of separator used by text files, the tab and the comma. Tab-separated files have the txt extension, and comma-separated files use the csv extension. esProc can automatically identify them through the extension. There exist older versions of Excel saved in xls format and newer versions of Excel using the xlsx format. esProc is also able to distinguish the two versions by their extensions.

The basic data format used by esProc SQL is the two-dimensional table with column headers, as the following sales.txt shows:

OrderID    Client    SellerId    Amount    OrderDate

26         TAS      1         2142.4      2009-08-05

33         DSGC    1         613.2       2009-08-14

84         GC       1         88.5        2009-10-16

In esProc, SQL statements are written in cells to be executed. A click on a cell will show its computing result on the right value viewing section. Below is an example:

undefined 

Execute SQL at command line

esProc supports executing SQL from the command line when graphical environment is inconvenient. Use esprocx.exe file to start the command prompt under Windows, for example:

esprocx.exe -r select * from Order.txt

The leading sign $ is a must before a SQL statement written on IDE, but it is not when writing SQL at command line.

Use esprocx.sh file under Linux to start the command prompt. For example:

./esprocx.sh -r select * from Order.txt

When column headers are absent

Some files dont have column names (headers). In that case we can embed esProc SPL statements into SQL to handle the column names in a more flexible way. Below is part of the sOrderNT.txt file:

26      TAS     1       2142.4  2009-08-05

33      DSGC   1        613.2   2009-08-14

84      GC      1        88.5    2009-10-16

133     HU      1        1419.8  2010-12-12

To query data by a value in column 2, the esProc SQL is as follows:

$select *  from {file("sOrderNT.txt").import()} where _2='TAS'

A column name is represented by _+column number, such as _2 for column 2. The braces contains SPL code, where import() function can read in a file of more complicated format. Read http://doc.raqsoft.com/esproc/func/fimport.html to learn more.

esProc supports changing name before doing the calculation. That makes the use of a column name intuitively. Standard SQL syntax, however, uses a subquery to change the column name, as shown below:

$select * from (

select _1 OrderID ,_2  Client, _3 SellerId, _4 Amount, _5 OrderDate from  {file("sOrderNT.txt").import()})

where Client='TAS'

The subquery makes the code inefficient and difficult to read. Using a SPL statement to modify column name can avoid the troubles:

$select * from

{file("sOrderNT.txt").import().rename(_1:OrderID,  _2:Client, _3:SellerId, _4:Amount, _5:OrderDate)}

where Client='TAS'

Special separators

Some files use unusual separators, like colon, semicolon or even multiple characters. In this case we should use SPL import() function to specify the separator.

For example, sep.txt uses double vertical lines (||) as the separator. So we can read it using the following SQL:

$select * from  {file("sep.txt").import@t(;,"||")} 

 

When line break is absent

Some files use other signs rather than the carriage return to separate records. They look like a large string, as the following SOrderNL.txt file shows:

OrderID Client SellerId Amount OrderDate|26 TAS 1 2142.4 2009-08-05|33 DSGC 1  613.2  2009-08-14…

To read such a file, we first read data as strings, replace the vertical line by the carriage return and then use import function to parse it data as a two-dimensional table.

$select * from  {replace(file("sOrderNL.txt").read(),"|","\r\n").import@t()}

See http://doc.raqsoft.com/esproc/func/read.html to learn more about read() function.

Skip empty lines

Files may have empty lines. Those lines need to be skipped when retrieving data. Generally we embed an SPL statement in SQL to read the file and get the valid data rows.

There is the simplest scenario where the first row is valid (which can be parsed as a two-dimensional table by import function) and a location an indefinite number of rows after it has an empty row. We use the following SQL to handle the retrieval:

$select * from {file("sOrderb.txt").import@t().select(#1)}

The second select function is a SPL query function. The expression select(#1) means that the first field can’t be empty. The function can be used to get valid data rows of more complicated format. See http://doc.raqsoft.com/esproc/func/select.html  to learn more about select() function.

When the first row is invalid, import function cant be used. Instead, we use read() function to read the file as rows of data, find the valid data and convert it to a two-dimensional table. Suppose the first 3 rows are invalid and the valid data begins from row 4, esProc SQL will be like this:

$select * from {file("sOrderb.txt").read@n().to(4:).concat("\r\n").import@t()}

The expression to(4:) reads data through to the end from row 4. Thats the valid data. concat("\r\n") joins rows of data into a large string using carriage return. import() function converts the string to a two-dimensional table.

In more complicated cases, a file has empty rows in unfixed locations. SPL select() function is used to query such a file. The script is as follows:

$select * from {file("sOrderb.txt").read@n().select(~!="").concat("\r\n").import@t()}  where Client='TAS'

When data source is an SPL intermediate result

As we see in those examples, a longer SPL query is needed to parse a file when the files format is nonstandard. If the query is too long, the advantage of the cellset layout is diminished. That makes the code not as convenient to read and debug as expected. In order to make the most use of the cellset layout, esProc SQL allows using an SPL intermediate result as the data source.

So we can handle a file having empty rows in unfixed places with the script below:


A

B

1

=file("sOrderb.txt").read@n()

/ Read in the file row by row

2

=A1.select(~!="")

/ Get the non-null rows

3

=A2.concat("\r\n").import@t()

/ Concatenate the rows into a large string and parse it as a two-dimensional table

4

$select * from {A3}  where Client='TAS'

/ Use A3s result as the data source to execute SQL

The ability of using a SPL intermediate result as a data source gives strong flexibility to SQL and makes it more capable of parsing files of unusual formats, handling unusual data sources, computing big files, making judgement by loop, dealing with multi data source computations and other complicated cases.

 

When data source comes from Excel

One of the basic uses of SQL is to use Excel as the data source. So with esProc we can make data copied from an Excel file the data source through SPL.

Below is an Excel worksheet that store student scores:


A

B

C

D

1

name

math

english

physics

2

lily

97

100

99

3

Joshua

100

99

100

4

Sarah

98

99

96

5

Bertram

94

95

85

6

Paula

91

88

91

7

Sophia

92

81

76

8

Ben

87

80

76

9

Ruth

92

91

87

10

Pag

95

87

87

First select the area of A1:D10 in the Excel worksheet, copy and paste the column headers and the detailed data onto cell A1 in the esProc script, write the SPL code and then the SQL code, as shown below:


A

B

1

name math english  physics lily 97 100 99 Joshua 100 99 100…

/ Data copied from Excel

2

=A1.import@t()

/ Parse data as a two-dimensional table

3

$select * from {A2} where math>=80

/ Execute SQL

Note: You should copy data with A1 being the editing status, otherwise the clipboard content will occupy a large stretch of cells.  That will affect the code layout though computation is workable.

Multi-line records

A file where each record consists of multiple lines has nonstandard format. In the following file, every three lines is a record. The first record is 26\TAS\1\2142.4\2009-08-05.

26TAS    1    2142.42009-08-0533DSGC    1    613.22009-08-1484GC    1    88.52009-10-16

 

The following SPL code parses the file as a standard two-dimensional file and then takes the role of SQL data source:


A

B

1

=file("D:\\data.txt").import@si()

Read in the file as a sequence without distinguishing fields

2

=A1.group((#-1)\3)

Group data every 3 lines

3

=A2.new(~(1):OrderID,  (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate  )

Create a two-dimensional table; ~(1) represents the first member of the current group

4

$select * from {A3} where year(OrderData)=2009


Mixed data sources

You can use SPL to achieve mixed computations in SQL.

Oracle sales table stores orders records and emp.xlsx stores employee information in each department. To calculate the sales amount in each department, we use the following script:


A

B

1

=file("emp.xlsx").xlsimport@t()

/Excel data source

2

=connect("orcl").query@x("select  * from sales")

/ Database data source

3

$select  e.dept,sum(s.amount) from {A2} s left join {A1} e on s.sellerid=e.eid group  by e.dept

/Mixed computation

 

Use of high-performance file format

When the target file is large and SQL performance on it is low, we can convert the file format to esProc high-performance format btx. The following script converts orders.txt to orders.btx:


A

B

1

=file("orders.txt").cursor@t()

/ Open a large textfile with the cursor

2

=A1.sortx(OrderID)

/Sort

2

=file("sales.btx").export@z(A2)

/ Convert to btx format; @z option means that data is ordered

A btx file is similar to a text file or an Excel file in use but has higher performance. To make a conditional query, for example:

$select  orderid,client,amount from sales.btx where Client='TAS'

 

Read http://doc.raqsoft.com/esproc/tutorial/jiwenjian.html to learn the difference of txt files and btx files.

esProc also offers ctx file and the extended esProc SQL+ feature to further enhance the SQL performance through advanced capabilities, such as parallel processing. Read http://doc.raqsoft.com/esproc/func/sqljia.html to know more about SQL+.