"SQL syntax is close to natural language, with low learning threshold and the bonus of first move .."

Hoo RaqForum 19 No.
1 Reply • 4,265 View • 3 Years ago

Open-source SPL that can execute SQL without RDB

SQL syntax is close to natural language, with low learning threshold and the bonus of first mover advantage, it soon became popular between database manufacturers and users. After years of development, SQL has become the most widely used and most mature structured data computing language.

However, SQL must work based on RDB, and there is no RDB in many scenarios, such as encountering CSV \ restful JSON \ MongoDB and other data sources, or performing mixed calculations between these data sources, such as CSV and XLS. In these scenarios, many people will choose to hard code algorithms in high-level languages such as Java or C# etc., which requires writing lengthy underlying functions from scratch, and the execution efficiency is difficult to guarantee. It is easy to accumulate the "code shit mountain" that everyone hates. Some people load data into the database and then use SQL for calculation, but the loading process is very cumbersome and the real-time performance is also very poor. Sometimes you have to turn to ETL tools. The framework is aggravated, the risk is increased, and it is doubly troublesome to do mixed calculations.

Now, here comes esProc SPL, and these problems can be easily solved.

SPL is an open-source computing technology, which fully covers the computing power of SQL and supports a wide variety of data sources. SQL can now be used for structured data computing without RDB.

Perfect SQL computing power

SPL provides a syntax equivalent to the SQL92 standard and can perform rich and diverse data calculations, including filtering, calculating fields, selecting some columns, renaming, etc. You can directly execute SQL by using files such as text and XLS as data tables. Let's take the CSV file as the data source as an example:

1. Filtering

Basic comparison operation：

$select * from d:/Orders.csv where Amount>=100

like：

$select * from d:/Orders.csv where Client like '%bro%'

Null value judgment：

$select * from d:/Orders.csv where Client is null

Logical operators such as and, or and not can combine comparison operations to realize combined filtering:

$select * from d:/Orders.csv where not Amount>=100 and Client like 'bro' or OrderDate is null

in：

$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')

Multi-layer parentheses:

$select * from d:/Orders.csv where (OrderDate<date('2020-01-01') and Amount<=100)or (OrderDate>=date('2020-12-31') and Amount>100)

2. Calculating columns

SPL has rich mathematical functions, string functions and date functions:

$select round(Amount,2), price*quantity from d:/Orders.csv
$select left(Client,4) from d:/Orders.csv
$select year(OrderDate) from d:/Orders.csv

case when：

$select case year(OrderDate) when 2021 then 'this year' when 2020 then 'last year' else 'previous years' end 
from d:/Orders.csv

coalesce：

$select coalesce(Client,'unknown') from d:/Orders.csv

3. SELECT

$select OrderId, Amount, OrderDate from d:/Orders.csv

4. ORDER BY

$select * from d:/Orders.csv order by Client, Amount desc

5. DISTINCT

$select distinct Client ,Sellerid from d:/Orders.csv

6. GROUP BY … HAVING

$select year(OrderDate),Client ,sum(Amount),count(1) from d:/Orders.csv 
group by year(OrderDate),Client 
having sum(Amount)<=100

Aggregation functions include sum, count, avg, max and min. Aggregation can be directly done without grouping:

$select avg(Amount) from d:/Orders.csv

7. JOIN

Left join:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Orders.txt o 
left join d:/Employees.txt e on o.SellerId=e.Eid

Right join:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e 
right join d:/Orders.txt o on o.SellerId=e.Eid

Full join:

$select o.OrderId,o.Client,e.Name e.Dept,e.EId from d:/Employees.txt e 
full join d:/Orders.txt o on o.SellerId=e.EId

Inner join:

$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o 
inner join d:/Employees.csv e on o.SellerId=e.Eid

Inner join can also be written in the form of where:

$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o ,d:/Employees.csv e 
where  o.SellerId=e.Eid

8. Subquery

$select t.Client, t.s, ct.Name, ct.address from 
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t 
left join ClientTable ct on t.Client=ct.Client

with：

$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t 
left join ClientTable ct on t.Client=ct.Client

Subquery within in:

$select * from d:/Orders.txt o  where o.sellerid in (select eid from d:/Employees.txt)

9. AS

Use the as keyword to rename fields, calculated columns, physical tables, and subqueries:

$select price*quantity as subtotal from d:/detail.csv

10. Set operations

Including union, union all, intersect, minus. Here is an example:

$select * from Orders1.csv union all select * from Orders2.csv

11. into

The query results can be written to the file with the into keyword:

$select dept,count(1) c,sum(salary) s into deptResult.xlsx
from employee.txt group by dept having s>100000

Rich data sources support

SPL supports various non database data sources, including text in various non-standard formats. CSV has been shown in the previous examples. Tab separated TXT can also be supported, and SPL will process it automatically according to the extension:

$select * from d:/Orders.txt where Amount>=100 and Client like 'bro' or OrderDate is null

If the separator is not a comma or tab, you need to use the SPL extension function. For example, the separator is a colon:

$select * from {file("d:/Orders.txt").import@t (;":")} 
where Amount>=100 and Client like 'bro' or OrderDate is null

For files without title lines, column names can be represented by serial numbers:

$select * from {file("d:/Orders.txt").import()} where _4>=100 and _2 like 'bro' or _5 is null

Some strings in special formats should also be parsed with extension functions. For example, the date format is not standard yyyy-MM-dd:

$select year(OrderDate),sum(Amount) from 
{file("d:/Orders.txt").import@t(orderid,client,sellerid,amount,orderdate📅"dd-MM-yyyy")}
group by year(OrderDate)

SQL can also be executed on Excel files. For Excel with standard format, you only need to directly reference the file name:

$select * from d:/Orders.xlsx where Amount>=100 and Client like 'bro' or OrderDate is null

You can also read the specified sheet:

$select * from {file("D:/Orders.xlsx").xlsimport@t (;"sheet3")} 
where Amount>=100 and Client like 'bro' or OrderDate is null

CSV / XLS file downloaded from remote website:

$select * from {httpfile("http://127.0.0.1:6868/Orders.csv).import@tc() } 
where Amount>=100 and Client like 'bro' or OrderDate is null

HTTP protocol has many features, such as character set, port number, post parameter, header parameter, login authentication, etc. SPL extension functions can support all of them. The extension function can also grab the table data in the web page and support downloading files from the FTP server, which will not be elaborated here.

The JSON file will be read as a string before parsing:

$select * from {json(file("d:\\data.json").read())} 
where Amount>=100 and Client like 'bro' or OrderDate is null

There are few two-dimensional JSON, and multi-layer is the norm. The SPL extension function can convert multi-layer data into two-dimensional records and then calculate them with SQL. The details will not be explained here.

Restful json

$select * from {json(httpfile("http://127.0.0.1:6868/api/getData").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null

If there are many and long extension functions, they can be written in step-by-step form:

	A
1	=httpfile("http://127.0.0.1:6868/api/getData")
2	=A1.read()
3	=json(A2)
4	$select * from {A3} where Amount>=100 and Client like 'bro' or OrderDate is null

Similar to CSV / XLS, SPL can also read JSON / XML files on HTTP websites.

XML

$select * from {xml(file("d:/data.xml").read(),"xml/row")}
where Amount>=100 and Client like 'bro' or OrderDate is null

Web Service

$select * from {ws_call(ws_client("http://.../entityWS.asmx?wsdl"),"entityWS ":" entityWSSoap":"getData")}
where Amount>=100 and Client like'bro' or OrderDate is null

SPL can also support NoSQL.

MongoDB

$select * from
{mongo_shell@x (mongo_open("mongodb://127.0.0.1:27017/mongo"),"main.find()")}
where Amount>=100 and Client like 'bro' or OrderDate is null

There are often multi-layer data in MongoDB, including restful and web service, and they all can be converted into two-dimensional data with SPL extension functions.

Salesforce

$select * from {sf_query(sf_open(),"/services/data/v51.0/query","Select Id,CaseNumber,Subject From Case where Status='New'")} where Amount>=100 and Client like 'bro' or OrderDate is null

Hadoop HDFS csv/xls/json/xml：

	A
1	=hdfs_open(;"hdfs://192.168.0.8:9000")
2	=hdfs_file(A1,"/user/Orders.csv":"GBK")
3	=A2.import@t()
4	=hdfs_close(A1)
5	$select Client,sum(Amount) from {A3} group by Client

HBase：

	A
1	=hbase_open("hdfs://192.168.0.8", "192.168.0.8")
2	=hbase_scan(A1,"Orders")
3	=hbase_close(A1)
4	$select Client,sum(Amount) from {A2} group by Client

HBase also has access methods such as filter and CMP, which can be supported by SPL.

Hive has a public JDBC interface, but its performance is poor. SPL provides a high-performance interface:

	A
1	=hive_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","hive","asus")
2	=hive_query(A1, "select * from table")
3	=hive_close()
4	$select Client,sum(Amount) from {A2} group by Client

Spark

	A
1	=spark_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","aa")
2	=spark_query(A1,"select * from tablename")
3	=spark_close(A1)
4	$select Client,sum(Amount) from {A2} group by Client

Alibaba cloud

	A
1	=ali_open("http://test.ots.aliyuncs.com","LTAIXZNG5zzSPHTQ","sa","test")
2	=ali_query@x(A1,"test",["id1","id2"],[1,"10001"]:[10,"70001"], ["id1","id2","f1","f2"],f1>=2000.0)
3	$select Client,sum(Amount) from {A2} group by Client

Cassandra

	A
1	=stax_open("127.0.0.1":9042,"mycasdb","cassandra":"cassandra")
2	=stax_query(A1,"select * from user where id=?",1)
3	=stax_close(A1)
4	$select Client,sum(Amount) from {A2} group by Client

ElasticSearch

	A
1	=es_open("localhost:9200","user":"un1234")
2	=es_get(A1,"/person/_mget","{\"ids\":[\"1\",\"2\",\"5\"]}")
3	=es_close(A1)
4	$select Client,sum(Amount) from {A2} group by Client

Redis

	A
1	=redis_open()
2	=redis_hscan(A1, "runoobkey", "v*", 3)
3	=redis_close (A1)
4	$select key,value from {A2} where value>=2000 and value<3000

SAP BW

	A
1	=sap_open("userName","passWord","192.168.0.188","00","000",”E")
2	=sap_cursor(A1, "Z_TEST1","IT_ROOM").fetch()
3	=sap_close(A1)
4	$select * from {A2} where Vendor like '%software%'

InfluxDB

	A
1	=influx_open("http://127.0.0.1:8086", "mydb", "autogen", "admin", "admin")
2	=influx_query(A1, "SELECT * FROM Orders")
3	=influx_close(A1)
4	$select Client,sum(Amount) from {A2} group by Client

Kafka

	A
1	=kafka_open("D://kafka.properties";"topic-test")
2	=kafka_poll(A1)
3	=kafka_close (A1)
4	$select Client,sum(Amount) from {A2} group by Client

MDX multidimensional database

	A
1	=olap_open("http://192.168.0.178:8088/msmdpump.dll","CubeTest","Administrator","admin")
2	=olap_query(A1,"with member [Measures].[AnnualInterestRate] as'[Measures].[SalesAmount]/[Measures].[StandardCost]-1'select {[Measures].[SalesAmount],[Measures].[StandardCost],[Measures].[AnnualInterestRate]} on columns, {[Order Date].[Calendar Year].[Calendar Year]} on rows from [DataSourceMulti]")
3	=olap_close(A1)
4	$select * from {A2} where SalesAmount>10000

In addition to supporting a wide variety of data sources, SPL can also perform mixed calculations between data sources. For example, between CSV and RDB:

$select o.OrderId,o.Client,e.Name e.Dept from d:/Orders.csv o
inner join d:/Employees.xls e on o.SellerId=e.Eid

Between MongoDB and database:

	A	B
1	=mongo_open("mongodb://127.0.0.1:27017/mongo")
2	=mongo_shell@x(A1,"detail.find()").fetch()	=connect("orcl").query@x("select * from main")
3	$select d.title, m.path,sum(d.amount) from {A2} as d left join {B2} as m on d.cat=m.cat group by d.title, m.path

Mixed calculations can be performed between any data sources, and the SQL syntax is not affected by the data sources.

Deeper computing power

In fact, the original meaning of SPL is Structure Process Language, which is a language specially used for structured data processing. In the previous examples, some of the syntax of SPL itself (those extension functions) have been shown. SQL is just a function provided by SPL, and SPL itself has more powerful and convenient computing power than SQL. Some calculation logic is complex, and it is difficult to code in SQL or even stored procedure, while SPL can complete the calculation with simpler code.

For example, here is a task: calculate the longest consecutive rising days of a stock. SQL uses multi-layer nested subqueries and window functions, and the code is lengthy and difficult to understand:

select max(continuousDays)-1
from (select count(*) continuousDays
    from (select sum(changeSign) over(order by tradeDate) unRiseDays
         from (select tradeDate,
                         case when price>lag(price) over(order by tradeDate) then 0 else 1 end changeSign
                   from AAPL) )
    group by unRiseDays)

While SPL only needs two lines:

	A	B
1	=T("d:/AAPL.xlsx")	Read Excel file, the first line is title
2	=a=0,A1.max(a=if(price>price[-1],a+1,0))	Get the max continuous rising days

For simple calculations, using basic SQL is very convenient, but when the calculation requirements become complex, SQL is not applicable. Even if more functions (such as window functions) are provided, the calculation cannot be simplified. In this case, we recommend that users directly use SPL with concise code instead of writing multi-layer nested complex SQL. For this reason, SQL in SPL only supports the SQL92 standard and does not provide more syntax including window functions.

SQL does not advocate multi-step calculation. It is used to writing a calculation task in a large statement, which will increase the difficulty of the task. SPL naturally supports multi-step calculation, and can easily split complex large calculation task into simple small tasks, which will greatly reduce the difficulty of coding. For example, find out the top n major customers whose cumulative sales account for half of the total sales, and rank them according to the sales from large to small:

	A	B
1	= T("D:/data/sales.csv").sort(amount:-1)	Fetch data, sort in descending order
2	=A1.cumulate(amount)	Calculate cumulative sequence
3	=A2.m(-1)/2	The last cumulative value is the sum
4	=A2.pselect(~>=A3)	The required position (more than half)
5	=A1(to(A4))	Get values by position

Flexible application structure

How to use SPL?

For interactive calculation and analysis, SPL has a professional IDE, which not only has complete debugging functions, but also can visually observe the intermediate calculation results of each step:

SPL also supports command line execution and supports any mainstream operating system:

D:\raqsoft64\esProc\bin>esprocx.exe -R select Client,sum(Amount) from d:/Orders.csv group by Client
Log level:INFO
ARO     899.0
BDR     4278.8
BON     2564.4
BSF     14394.0
CHO     1174.0
CHOP    1420.0
DYD       1242.0
…

For the calculation in the application, SPL provides a standard JDBC driver and can be easily integrated into Java:

...
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement("$select * from employee.txt where SALARY >=? and SALARY<?");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
...

There will be frequent modifications or complex calculations in the application. SPL allows the code to be placed outside the Java program, which can significantly reduce the code coupling. For example, the above SPL code can be saved as a script file and then called in the form of stored procedure in Java:

...
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call getQuery(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000); 
ResultSet result=st.execute();
...

With open-source SPL, you can easily use SQL without RDB.