Looking for the Best Tool for Using Non-RDB Data Sources in Reporting Tools

 

A reporting tool can use the general JDBC/ODBC interface to connect to RDBs. Almost all reporting tools provide the necessary support. Yet they offer different abilities for using non-RDB. This essay takes a look at the common methods reporting tools used to retrieve data from non-RDBs and assesses them through the following aspects:

1. Whether to be able to connect to the data source to retrieve data;
2. Whether to be able to make good use of the data source’s computing ability;
3. Whether to have the independent computing power to handle data in case that the data source does not have enough computing ability to avoid putting all computing pressure on the reporting tool (as the reporting tool has limited computing capability);
4. Whether to support big data handling as reporting tools usually do not have such an ability.

Common non-RDB data sources include but not limited to:

  • File &API: TXT/CSV, Excel, JSON, XML, HDFS, and FTP

  • Big data & NoSQL: MongoDB, Cassandra, HBase, Redis, Hive, Spark, Elasticsearch, and Kafka

  • Multi-dimensional databases: Essbase, SSAS

  • Time series database: InfluxDB

  • ERP & CRM: Salesforce and SAP

  • Webservice

Data source support in reporting tools

Generally, a reporting tool supports certain data sources, including non-RDB data sources. CSV and Excel are most supported, and JSON follows them. MongoDB and Webservice are supported by some reporting tools. The open source BIRT and JasperReports, for instance, support a set of file sources, including CSV, XML and Excel, enable users to access MongoDB, and can connect to Webservice, which are enough for handling commonly-seen computing scenarios.

The data source connection feature offered by reporting tools is convenient as few extra configurations are needed. Yet different reporting tools have their own ways of support for the unconventional non-RDB data sources. Some provide visualized interface for data set creation, which is simple yet inflexible. Some others require writing a script, which is flexible yet not easy. There isn’t a simple, uniform way of handling non-RDB sources.

The data source support way a reporting tool uses decides whether or not it can make use of the data source’s computing ability. Some reporting tools only support data retrieval through field selection and conditional filtering. A second operation (such a JOIN) on data sets is needed in order to further process the retrieved data as the data source’s computing ability (such as MongoDB $lookup) cannot be employed. The data-source-end computations, however, are more efficient, probably get smaller result sets, and are easier to optimize for higher transmission speed. Reporting tools, unfortunately, fail to achieve an equal result.

Speaking of independent computing power, some reporting tools supply rather simple join operations and scripting computations for data set creation (data preparation) phase. A user-defined interface is needed to perform more complicated computations, which is a topic irrelevant to the reporting tool’s computing capacity.

As for big data handling, most reporting tools load total data into and display the result set through the memory, leading to more heavy resource consumption and bad performance.

Unofficial JDBCs

Reporting tools naturally support JDBC data sources. For a particular group of data sources that do not provide JDBC officially, a third party sometimes develops and publishes their own JDBCs for connecting the reporting tool with the data source. These third-party JDBCs are usually developed by individuals and their quality is inconsistent and uneven. You need to try you luck to catch a desired one.

Here are three third-party JDBCs:

elasticsearch-jdbchttps://github.com/jprante/elasticsearch-jdbc

salesforce-jdbchttps://github.com/ascendix/salesforce-jdbc

influxdb-javahttps://github.com/influxdata/influxdb-java

But generally, these JDBCs just have the data retrieval functionality. The reporting tool has to take on further computations. They nearly do not have the last three types of abilities we listed at the beginning.

Data source connector

A data source connector is a special tool for connecting to a data source from an application. A great connector not only helps connect to a non-RDB data source but gives a certain degree of SQL support to make it convenient and fast to retrieve data. CData Connector is one of them.

The connector supports connecting to 225 types of data sources, which cover most of the commonly used except for InfluxDB. It is applicable to various computing cases by supporting JDBC, ODBC, ADO.NETA and Python.

CData Connector claims that it provides SQL-92 support for most data sources. Reporting tools can access data in sources from Excel to MongoDB using SQL. Yet in actual uses, the connector shows weak computing ability, which is not so friendly to complicated computations.

The CData JDBC driver requires jars of the to-be-connected data sources. The reporting tool imports the jars and configures the target data source for the connection. Below is the connection information for Excel:

JDBC Driver: cdata.jdbc.excel.ExcelDriver

JDBC URL: jdbc:excel:Excel File='D:/documents/orders.xlsx'

Use the Excel sheet name as the table name when using SQL to retrieve data, for instance:

SELECT * FROM Sheet1

Similar ways can be used to connect to the other data sources. Users can choose the connector CData specifically offers for connecting a data source, or use the all-in-one connector.

It is convenient to use a data source connector to connect to all sorts of complicated non-RDB data sources and to achieve most simple data preparation work using its built-in SQL ability. Note that a targeted test is needed to check whether performance is affected before placing a data source connector between an application and a data source, and that no related detailed information can be found in official CData documentation.

Often a data source connector boasts certain computing abilities thanks to its support of SQL, but it lacks sufficient capability to deal with complicated computations. It is hard even for SQL to handle those computations (particularly when window functions are absent). The data source connector supports SQL partially. No wonder that it has only partial computing power.

Basically, data source connectors are almost not able to handle big data. Though most have SQL support, it is difficult for them to employ a database pagination-like mechanism (though the mechanism has its weaknesses). Apart from this, big data handling often depends on special techniques (such as parallel processing and external memory computing), but data source connectors cannot offer any of them.

User-defined extension interface

Many reporting tools keep a data source (script-formed data set or user-defined data set) extension interface, making it convenient for users to write a program to connect to the data source and handle data in certain occasions.

A user-defined data source often uses a special computing library to achieve data source connection and perform data preparation. Such computing libraries include POI for Excel, dom4j for XML, JsonPath for JSON, CSVJDBC for CSV, MongoDB connector, and DataFrame class libraries (including Tablesaw, Joinery, Morpheus, Datavec, Paleo, and Guava). These computing libraries can accelerate data source connection and data handling process.

Unlike data source connectors, computing libraries are equipped with great and flexible computing power to be able to handle complicated data preparation work for report development. Even if a computing library lacks certain computing abilities, users can still scale it up through a program in the user-defined data source.

A user-defined data source can make full use of its computing capacity by writing code, achieve complex data handling jobs through a computing library and coding, and deal with big data processing scenarios. In theory, hardcoding is almighty. Yet its strength is also its weakness. Hardcoding becomes difficult to use because it involves too much code. In real-world situations, the technique is only used in certain individualized scenarios.

esProc SPL

esProc is the open-source, professional data computing engine that offers connection to a variety of data sources and that has powerful computing capacity. It is the combination of data source connector and computation package. It is based on the independent scripting language SPL (Structured Process Language), which enables a convenient connection to non-RDB data sources for achieving data computations. It also provides standard JDBC and ODBC interfaces through which the SPL script result set can be called by reporting tools.

Below is the SPL script (mongo.dfx) for connecting to MongoDB database and performing an associative operation:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"test1.find()")

3

=A2.new(Orders.OrderID,Orders.Client,Name,Gender,Dept).fetch()

4

=mongo_close(A1)

5

return A3

Create the JDBC data source within the reporting tool. Below is the method for connecting to esProc JDBC:

JDBC Driver: com.esproc.jdbc.InternalDriver
JDBC URL: jdbc:esproc:local://

The SPL script can be called from a report data set using the way similar to accessing the stored procedure for achieving MongoDB connection and data processing, for example:

call mongo()

mongo is the name of the above SPL script (mongo.dfx).

Besides the native SPL syntax, esProc also supports SQL syntax. To join two single level MongoDB collections, for example:


A

1

=mongo_open("mongodb://127.0.0.1:27017/mongo")

2

=mongo_shell(A1,"Orders.find()").fetch()

3

=mongo_shell(A1,"Orders.find()").fetch()

4

=mongo_close(A1)

5

$select     o.OrderId,o.Client,e.Name,e.Gender,e.Dept from {A2} o join {A3} e   on   o.SellerId=e.EId

Most of the time, SPL connects to non-RDB data sources through CLI, which enables to make the most use of the data source’s native computing capability. By supporting MongoDB’s JSON query expression mode (find, count, distinct and aggregate), the interval query can be written as follows:

=mongo_shell(A2,“test1.find({Orders.Amount:{$gt:1000,$lt:3000}})”)。

SPL boasts big data processing abilities. It can handle most of the big data handling scenarios using order-based computations, external memory computing and parallel processing and output result set fast to a reporting tool.

esProc SPL supports dozens of data sources, meeting requirements of most computing scenarios. Below is part of the data sources it supports:

  • AliCloud

  • Cassandra

  • Multidimensional      databases

  • ElasticSearch

  • FTP

  • Hbase

  • HDFS

  • Hive

  • Influxdb

  • Informix

  • Kafka

  • MongoDB

  • Redis

  • Salesforce

  • SAP

  • Spark

  • Webcrawl

  • Webservice

  • YModel

  • Zip

 

In summary, reporting tools give different ways and degrees of support for non-RDB data source use. For some data sources, the support is too simple. To support them or on some occasions, users need to turn to another route to achieve data access. The unofficial JDBCs are not consistent enough to be the first choices for diverse data source connections. Data source connectors are convenient but weak in data processing ability, and some are not sufficiently stable. User-defined data sources, in theory, can do anything, but it is the last choice because the implementation process is highly complicated. esProc SPL shows the best ability. It handles commonly seen computing scenarios involving diverse data sources professionally and excellently (though data sources it supports is not rich enough, it can deal with most scenarios), is strong in processing data, supports big data processing well, and satisfies special needs in certain scenarios. The tool is a synthesis of ideas of the previous tools.