RaqForum 23 No.
1,128 View •
Invest More beyond Tools to Improve Reporting Performance
As a type of front-end data service intended for end-users, reports in applications have received a lot of concerns about their performances. It’s all users’ wish that parameters in and query and aggregate results out in no time. Though less than twenty seconds’ waiting time is within a tolerant range, three to five minutes will threaten to extend beyond a user’s patience with the worst product experience.
But why is a report so slow? And how shall we optimize its performance?
Usually a report in an application is developed with a reporting tool. Each time when the reporting process is slow, users tend to point their fingers at report developers or reporting tool vendors. In fact, in most cases slowness is a symptom, the real disease lies in the data preparation stage. This means data handling is already slow before the reporting stage. So it’s no use to simply optimize the report development or to push the reporting tool too hard.
Reports are for viewing. Since the ability of our naked eye is very limited, there’s no need to let a report present a large amount of data. The descriptive computing mode the reporting tool adopts, though intuitive, is not suitable for complex computations. It’s inappropriate and pushy to let a reporting tool deal with big data computing and complex computations. Reporting tools are only good at handling the arrangement for small data sets, and simple computations. It takes not long to get them done.
About 80% of slownesses in reporting are caused by terrible data preparation. Though a report presents only a very small amount of data, the original data for generating it could be in huge amount. It takes very long to filter and aggregate a massive volume of data to get just a small volume. A similar thing happens with a complex computation, for which the most time-consuming job is the data preparation. So data preparation is the bottleneck in trying to facilitate the report building. Ways to optimize the data preparation process are:
1. Optimizing data preparation code. Generally it is SQL code or stored procedure code to optimize. Sometimes it is the application code (when involving non-database data sources or multiple databases).
2. Scaling up/out the database. We can also scale up/out the database, like clustering, if there is a huge amount of data and there is nothing more for code optimization to do.
3. Employing a high-performance computing engine. We can replace traditional databases that are poor in performance or costly for certain operations with a better computing mechanism.
The computational performance problem might also happen during report presentation stage.
A typical scenario is building a multi-source report, for which joining and aligning multiple two-dimensional data sets by the primary key, and sometimes grouping and aggregation, are needed. As the reporting tool requires that all calculations be written in the cells, the relationship between a cell and other data sets can be only expressed by dataset filtering, such as the formula ds2.select(ID==ds1.ID). The calculation has quadratic complexity. This has little impact on performance when the data amount is very small, but has serious effect if there is a bigger amount of data (say, thousands of rows) and a lot of data sets. The time it takes in performing the calculation could range from seconds to dozens of minutes.
But the time can be significantly reduced if the calculations are handled during data preparation stage outside of the reporting tool. With data sets coming from a same database, we just write the SQL JOIN statement; if data sets come from different databases or we want to save the database from high computational pressure, we can use the HASH JOIN algorithm outside of the database. The HASH JOIN algorithm treats multiple databases as a whole and brings a much higher efficiency than the filtering join used by the reporting tool does. It achieves an almost instant response for a several-thousand-row data set.
On the surface, the poor performance is a report-presentation-stage issue, but its solution is often found beyond the reporting tool.
Another scenario is a grouped and aggregated report with detail rows. Its problem is the slow computation caused by the handling of a big amount of data during the report presentation stage. The solution is the same – moving the data computing out of the reporting tool.
Data transmission is another bottleneck of the reporting performance.
Currently, many applications are J2EE-based, reporting tools are developed in Java, and JDBC is used for database access. The problem is that, since the performance of some popular databases’ JDBC is bad (we’ll omit the brands here), it takes fairly long for users to retrieve data even in a slightly large amount (tens of thousands of rows). Here’s a paradox: database has a light pressure and delivers quick computations, and report tool handles a relatively simple computation, yet the overall reporting process is still slow.
Neither application vendors nor reporting tool vendors are capable of improving the database JDBC performance. What they can do is thinking of ways from outside. According to tests, data retrieval performance can be increased by several times using the multithreaded processing (a precondition is a light-loaded database). No reporting tools, however, have ever provided the parallel retrieval functionality directly (It’s difficult to make it a reporting tool function because the data segmentation plan varies with database products as well as data retrieval syntax, and thus needs to be controlled by code). Still, the solution is implementing the parallel retrieval during data preparation stage beyond the reporting tool.
If we can cache the results of recent computations for generating a report, then when other users perform queries using the same parameters, there will be no need to re-compute the data and they can just return the results. Obviously this will enhance user experience. This kind of cache functionality is offered by many reporting tools, but not for intermediate results. Only the final result of the computations for generating a report can be cached, and different reports have separated and unconnected caches.
1. Partial caching. For certain types of reports, particularly the commonly seen multi-source reports, most of the involved original data is static (historical data) and only a small part is dynamic (current data). The lifetime of the whole-report cache data is measured by the shortest data history, which results in repeated computations. To cache a part of the data can extend its term of cache validity, and thus reducing the amount of computations.
2. Reusability of caches. Different reports may reference the same data. But a fragmented cache mechanism for reporting will result in replicated computations over same data. If a cached computational result can be made available across all reports, the amount of computations may be considerably reduced.
Controlling cache is complicated and requires coding. It’s difficult to provide such a function within a reporting tool, but it’s relatively easy to implement it during the programmable data preparation process.
As previously mentioned, the relationship between reporting and big data isn’t as strong as many reporting tool vendors advertising big data report claim it to be.
There’s indeed a type of list reports that can be called big data reports. A list report, which lists transaction details in a certain period of time, is frequently seen in financial sector. It features a huge amount of data across thousands of, even ten thousand, pages. The computations, however, is relatively simple, listing data, for most of the time, or summarization by groups or pages at most.
Generally reporting tools adopt a full memory storage strategy in order to manage inter-cell computations flexibly. It is very likely that the strategy will lead to memory overflow when loading a big list into the reporting tool. Besides, it takes intolerably long to retrieve and load the list.
It’s easy to think about a way of presenting data while retrieving, page by page, causing no danger of memory overflow and reducing the time users have to wait. All databases provide cursors to retrieve data out part by part. Users may flip pages on the front-end, which demands the ability of retrieving data out by random page number (i.e. by rows), but databases don’t offer the kind of API for doing so. Retrieving data with conditional filtering not only is quite slow, but can’t ensure data consistency.
There’s no better way than to get it done during the data preparation stage. We can devise two asynchronous threads: one for retrieving data from the database and buffer it to the external storage (suppose there’s too much data for the memory); and the other for receiving a front-end request and loading data from buffer by random page number (i.e. by rows) to return.