What to use for data preparation in report, query or analysis business?

 

Whether reporting, querying or analyzing, it always involves the data preparation. Data preparation is to process the raw data into a form that can be presented or analyzed (such as the common report dataset). However, data preparation is often not easy and may face various complex situations.

Challenges in data preparation

Diverse data sources

The data of business comes from diverse sources at present. Theoretically, any data source can be used as the source of report data at the same time, such as RDB, NoSQL, CSV, Excel, Json, Hadoop, etc., which leads to the problem of diverse data sources in report development. Furthermore, it is required that not only can these data sources be connected, but also the mixed calculations on them can be performed. Since the report analysis is a comprehensive work, and it is also very common to do a mixed analysis on multiple same or different type of data sources, this requires the data preparation technologies to have the capabilities of connecting to multiple data sources, performing cross-source mixed calculation, and closely combining with reporting and BI tools.

Complex calculation

Another problem is the complexity of report data preparation. Since the complexity of some data preparation is very high, it needs a lot of complex code to process the raw data into the form required by the report. Of course, the complexity is also related to the business nature of the report; the complex data preparation process comes from wide range of data sources, large span, and complex data processing logic. Therefore, we often see a SQL statement with hundreds of lines or a stored procedure of hundreds KBs.

Integration

Some reports are developed using reporting tools and finally published in the application for users. The data preparation, as a part of report, naturally needs to be integrated with report application. In addition, although some one-time reports for temporary use do not need to be integrated, if they appear repeatedly, they may also be migrated to the application system, in this case, the integration is also required. If the integration of data preparation technology is not good, the effect on the report will be greatly reduced.

On the other hand, whether the data preparation technology supports the hot deployment / hot swap is directly related to the convenience of report development, operation and maintenance. If supported, it would be very convenient to upload the modified report to the application to take effect in real time; conversely, if the service had to be taken effect by restarting, it would be very inconvenient for the development, operation and maintenance of reports.

Common data preparation technologies

SQL

SQL is currently a common method for report data preparation, and a dedicated set operation language. Using SQL to process the structured data is very simple. For example, the common grouping and aggregating operation can be written out with just one group by statement.

To use SQL, you only need to import the corresponding database driver (such as JDBC). Although it has a certain degree of integration, it needs to be tightly coupled with database. In addition, sending SQL to the database can query the results in real time, and hence it can be considered that SQL supports hot swap.

However, the shortcomings of SQL are also obvious, and it is difficult to cope with the challenges of data preparation we mentioned earlier.

SQL-based databases have the concept of “base”, and data can only be queried after they are loaded into the database, which makes it difficult to deal with scenarios with diverse data sources. Many data sources, such as NoSQL, local files, network file systems, etc., are all facing the situation that SQL cannot deal with. Implementing the data preparation based on these data sources is far more troublesome than SQL.

Moreover, SQL's support on complex calculations is not good. To implement complex calculations, SQL often needs to nest multiple layers, and the code with hundreds of lines is very common; for such code, it is not only difficult to implement, but also inconvenient to modify. Take a less complicated example:

According to the stock record table, query the stocks whose share price has risen for more than 5 consecutive days and the rising number of days (if the share price remains unchanged, it is recorded as rising).

Using SQL to implement:

select code, max(risenum) - 1 maxRiseDays
  from (select code, count(1) risenum
          from (select code,
                       changeSign,
                       sum(changeSign) over(partition by code order by ddate) unRiseDays
                  from (select code,
                               ddate,
                               case
                                 when price >= lag(price)
                                  over(partition by code order by ddate) then
                                  0
                                 else
                                  1
                               end changeSign
                          from stock_record))
         group by code, unRiseDays)
 group by code
having max(risenum) > 5

Do you feel this way: although this code is not very long, it will take some time to understand, let alone come up with a method and write it out. The reason why SQL is difficult to implement complex calculations comes from relational algebra theory, as the problem (the database itself) cannot be solved from engineering aspect.

Many databases also provide the stored procedure, which adds the process control mechanism based on SQL; in this way, SQL can deal with multi-step data processing tasks. However, the stored procedure does not change the capability of SQL, and calculations that are not well supported by SQL are still difficult to code in stored procedure.

The low development efficiency makes it difficult to cope with the endless reporting requirements, and a SQL statement of hundreds lines or a stored procedure of hundreds KBs cannot be easily coded anyway. Consequently, it cannot cope with the endless reporting requirements of high-frequency additions and modifications at a low cost.

From this point of view, SQL, as the most common data preparation technology, is not a very ideal option, even is very unsatisfactory in many cases. Then, why is it still so widely used? That's because Java and other technologies perform worse than SQL.

Java

As a Turing-complete language, Java can theoretically complete all kinds of calculations, including the report data preparation. Moreover, Java can easily solve the problem of diverse data sources, as various data sources usually provide Java interface. Therefore, Java can read and use such data sources smoothly, and there is nothing hard coding can't solve.

Another advantage of Java is that it is easy to integrate with applications. Both the reporting tools and the analysis applications provide interface for Java, and data preparation done in Java can naturally be used for reports.

The problem of Java is that it cannot perform structured data calculations conveniently, and the report data preparation work are mostly based on structured data. Java lacks the necessary class library for structured data calculation, and the data types provided are too basic, resulting in dozens of lines of code for even a simple grouping and aggregating task (general ones will be more complicated), and thus the development efficiency is too low in comparison with SQL.

Similar to SQL, low development efficiency cannot cope with endless reporting requirements very well, and Java is even more serious in this regard. Moreover, Java, as a compiled language, does not support hot swap, which also makes it difficult to cope with frequently changing reporting requirements.

Therefore, on the whole, Java does not have an advantage in several aspects of challenges faced during the data preparation phase except its strong flexibility, and in many cases, it is even inferior to SQL.

Python

As a new data processing programming language, Python can be used independently for report & analysis, and its biggest advantage is its rich program library, such as the Pandas, which is well-known in data analysis. Similar to Java, Python can also connect to multiple data sources to complete mixed computing.

In terms of complex calculation, Python supports the procedural calculation. With the support of features such as DataFrame, using Python is much simpler to implement the calculation than Java, and comparable to SQL. For some calculations, using SQL is easier to implement, while Python has an advantage in implementing complex calculations due to it supports procedural calculation.

As a result, it is reasonably convenient to use Python to develop those relatively temporary (one-time use) reports. However, if these reports need to be used persistently, redeveloping them may be required using other techniques when they are integrated into the application system. This is because Python is difficult to be integrated and deployed with mainstream applications (usually Java applications), and the inter-process communication for multiple processes may also be required, and hence the performance and stability will not be good, and this will greatly limit the use scenario of report data preparation by Python.

In terms of implementing report data preparation, these technologies all have problem of one kind or another, is there any other option then?

Ideal esProc SPL

Using the open-source esProc SPL can complete the report data preparation and deal with various challenges very well.

esProc SPL is an open-source data processing engine, boasting the computing power that does not depend on databases. This technology is good at structured data computing, and has a rich computing class library, which can meet various data preparation work. SPL naturally supports multiple data sources (RDB, NoSQL, Json, CSV, Webservice, etc.), and can also perform the cross-source mixed calculation. Agile syntax and procedural calculation can quickly implement complex data preparation tasks and easily cope with endless requirements.

Open multi-source support

Unlike the database that needs to store data before calculation, SPL can calculate directly when facing with diverse data sources, which allows SPL to take full advantage of respective advantage of different data sources (for example, the IO efficiency of files is very high; NoSQL is capable of storing documental data; RDB is strong in computing power).

Currently, SPL directly supports dozens of data sources (still in expanding), and can support almost all data sources, whether you have heard of or not. SPL can not only connect and fetch data, but also perform the cross-source mixed computing. SPL will do the cross-source computing after making full use of the advantages of various data sources to provide data preparation for the report.

..

Agile syntax and procedural calculation

SPL designs the agile syntax specifically for structured data computing, which not only supports the procedural calculation naturally, making it possible to implement the data processing in a multiple-step and stepwise way according to natural thinking, but also provides rich computing class library, therefore, it is easier to implement the complex computing using SPL.

Let's show a piece of code to see the effect:

[Calculation objective] To find out the orders of the top n customers (key customers) that account for half of the sales.


A

1

=file(“/opt/ods/orders.csv”).import@tc()

2

=A1.groups(customer;sum(amount):amount).sort(amount:-1)

3

=A2.sum(amount)/2

4

=0

5

=A2.pselect((A4=A4+amount,A4>=A3))

6

=A2.(customer).to(,A5)

7

=A1.select(A6.pos(A1.customer))

First, find out the qualified key customers step-by-step, and then query the detailed order information of these customers. These calculations are all performed outside the database, and thus other data sources like files can be used. From the perspective of implementation process, the procedural calculation of SPL is better, and its syntax is more concise.

For the aforementioned calculation of the maximum days that a stock keeps rising, the SPL code is as follows:


A


1

=connect@l("orcl").query@x("select * from stock_record order by ddate")


2

=A1.group(code)


3

=A2.new(code,~.group@i(price<price[-1]).max(~.len())-1:maxrisedays)

Calculate the consecutive rising days of each stock

4

=A3.select(maxrisedays>=5)

Select the records that meet the conditions

First, sort by trading day; and then group by stocks, SPL grouping does not force aggregation, and the group members can be reserved; finally, calculate the consecutive rising days based on each group. When calculating, group the consecutively risen records into one group, and calculate the maximum value. 1 subtracted from the maximum value is the maximum consecutive rising days. This calculation is implemented completely according to natural thinking, and there is no need to make it so complicated.

SPL syntax is also more concise than that of Python, and simpler to implement the same calculation. For example, for stock 000062, we want to calculate the rise of three days when its price is maximum.

Using Python to implement:

import pandas as pd
stock_file="D:/data/STOCKS.csv"
stock_data=pd.read\_csv(stock_file,dtype={'STOCKID':'object'})
stock_62=stock_data.query('STOCKID=="000062"').copy()
ort_pos=(-stock_62["CLOSING"]).argsort()
max3pos=sort_pos.iloc[:3]
stock_62s=stock_62.shift(1)
max3CL=stock_62["CLOSING"].iloc[max3pos]
max3CLs=stock_62s["CLOSING"].iloc[max3pos]
max3_rate=max3CL/max3CLs-1
print(max3_rate)

In this code, the argsort(…) can return the sorted position information. Since there is no loop function in Python, the position information cannot be used to calculate during the loop. In this case, Python has to do more steps in a way that first find the share prices of three days when the price is maximum, and then find the share prices of the day before these three days, and finally calculate the two results to get the rise, which is a bit troublesome.

Using SPL to implement:


A


1

=file(“D:/data/STOCKS.csv”).import@tc(#1:string,#2,#3)


2

=A2.select(STOCKID=="000062")


3

=A2.psort@z(CLOSING)

/share price sort position

4

=A3.m(:3)

/take the first 3 positions

5

=A2.calc(A4,if(#==1,null,CLOSING/CLOSING[-1]-1))

/calculate the rise by position

In this code, the psort(…) function returns the position information from small to large; the @z option represents the reverse order; and the calc(…) function refers to the positioning calculation. The calculation is performed using the member position and relative position. CLOSING[-1] is the member preceding the current member. Overall, both the process and the operational thinking are very concise.

Editing SPL scripts can be done using a dedicated IDE which is characterized by simple and easy-to-use development environment, single-step execution, setting breakpoint, and WYSIWYG result preview window etc. The editing and debugging is simple and the development efficiency is higher.

..

Seamless integration with applications

When SPL prepares data for the report, it is between the report presentation tool (or BI tool) and the data source, and can be seamlessly integrated with the application system, which enables reporting and BI tools to have complex multi-step data preparation capabilities, and it does not need to couple with the data source and is independent from the main application. SPL provides standard JDBC and ODBC drivers for reporting tools to call, which can completely replace the original data preparation methods.

..

Executing / calling SPL scripts via JDBC is very simple:

...
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
...

SPL is the interpreted-execution, and naturally supports hot swap. Thus, it can adapt to the changeable modification needs of the report very well, and the modification takes effect immediately without restarting.

..

Data processing logic is inside the SPL file (.splx) and it takes effect in real time after modification. Compared with compiled languages like Java that require restarting the server, SPL has great advantages.

Both SPL and SQL are dedicated structured data computing languages, but SPL is more open, and its syntax is more concise. Compared to Java and Python, SPL has more advantages in terms of development efficiency, integration, and hot swap, therefore, SPL is more suitable for data preparation of report & analysis.