How So Many Intermediate Tables in the Database Come into Being?

 

Intermediate tables are data tables created specifically in a database to store intermediate computational results obtained by processing the original data. Usually, they are summary tables feeding aggregate values to future front-end queries and analyses so that the latter can run faster or more conveniently.  

It seems that there are mainly three reasons for the existence of intermediate tables:

More than one step is needed to achieve the computing goal

An original data table in the database must undergo a series of computations before it is displayed in the desired form with a report. One SQL query alone cannot achieve this. We need multiple continuous SQL queries. Each query produces an intermediate table for use by the next one.

Extremely long wait time for real-time computing

Users often need to wait a ridiculously long time when the original data table is large or computations are too complicated. They choose to run batch jobs and store result sets in intermediate tables at night. And it’s much faster to perform queries based on the intermediate tables in the daytime.

Diverse data sources

Conventionally, data coming from files, NoSQL databases, WebService, etc., needs to be first loaded to the database and stored as intermediate tables before it combines with the database data for mixed computing.

Intermediate data is inevitable and necessary. But why is it stored in the database as intermediate tables:

Because we need the database’s computing abilities, intermediate data will be further processed for subsequent use, and sometimes the processing is complex. Only databases (SQL actually) have convenient-to-use capabilities.

As times goes by, intermediate tables accumulate in databases. The number of them even reaches tens of thousands in some large systems’ application databases. Why on earth are there so many of them? What influences will they bring?

Generally, intermediate tables are used by front-end queries or reporting services and one query corresponds to one intermediate table. The reporting service needs many queries and thus produces a lot of intermediate tables. After an intermediate table is created, it will probably be used by multiple queries or applications, and deleting it could disable queries due to database’s flat structure. Sometimes we even cannot know which applications are using the intermediate table. It is kept permanently for fear of serious consequences. No wonder that a huge number of intermediate tables accumulate over a long period.

A mass of intermediate tables has consequences. They occupy a lot of database storage space, leading to shortages in database storage and the need to ramp up capacity. As database space is expensive, scaling up storage is costly and also has limitations. The cost-ineffective solution is not a good choice for getting rid of the intermediate-table problem.

A crowded database becomes slower. Intermediate tables do not come from anywhere but are generated from the original table through a series of computations, which consumes an enormous amount of database computing resources. Generating a lot of intermediate tables will thus result in slow queries and low performance.

Databases are not the only place intermediate data can be stored. Files are a better choice in terms of obtaining higher IO performance. By using the flexible compression algorithm, we can make better use of parallel processing. The disadvantage is that files do not have computing abilities. Users need to write hard code in Java to implement computations, and this is far more inconvenient than SQL-based databases.

esProc SPL is the ideal solution to the problem caused by the accumulation of intermediate tables. As a special open-source data computing engine, esProc has computational capabilities as good as databases but independent of them. It can perform computations directly on files and other various data sources.

undefined 

With its distinct features, esProc provides solution to the database storage and performance issues caused by the build-up of intermediate tables.

File processing

SPL supports computing files directly (Excel, CSV, JSON and SPL binary files), and stores data of as many intermediate tables as possible in the original database in external files to reduce database load.

Diverse data source support

Besides files, SPL also supports various other data sources, including RDB, NoSQL, Hadoop, RESTApi to name a few. This makes it unnecessary to load data from different data sources to the database and store it as intermediate tables. Instead, it will perform mixed computing directly based on them to feed data to queries.

SPL agile syntax

Yet that is not enough for scenarios requiring complex computations (like Java). SPL’s agile syntax shows more powerful computational abilities than Java in those cases, and is simpler than SQL on many occasions. The more complex the computation, the better play its advantages are given.

Here’s an example:

We are trying to find detailed order information for the first n big customers whose sales amount amounts to half of the total.


A

1

=file(“/opt/ods/orders.csv”).import@tc()

2

=A1.groups(customer;sum(amount):amount).sort(amount:-1)

3

=A2.sum(amount)/2

4

=0

5

=A2.pselect((A4=A4+amount,A4>=A3))

6

=A2.(customer).to(,A5)

7

=A1.select(A6.pos(A1.customer))

SPL’s stepwise code finds the eligible big customers and then gets their detailed order information. The whole process is much simpler than that of implementing the same structured data computation in native Java and far more concise than the SQL one involving nested queries.

SPL is interpreted execution and supports hot deployment.

High performance

SPL encapsulates a variety of high-performance algorithms, including order-based calculations, multilevel indexing and parallel processing, and offers proprietary high-performance binary storage formats BTX and CTX to further improve computing performance.

esProc can act as an embedded JDBC to be integrated into an application, which, as it accesses a database, will access an SPL result set through JDBC.

undefined 

By storing intermediate data in a tree-structure file system outside databases, esProc, after it integrates with an application, enables the “external intermediate tables” to be distributed along with the latter, reducing coupling between the application and the database, avoiding sharing conflicts when one intermediate table is used by multiple applications, and greatly facilitating system operation.