How to Group Huge Volume of Data Fast

How to Group Huge Volume of Data Fast

 

Key words: Slow database grouping queries  Big data  Storage format

 

Grouping queries are all too familiar in database querying. You have to do a full traversal no matter how you arrange or index data. As an index is useless, storage format becomes the key factor that affects the traversal speed. Though databases use the binary storage format, their IO performance is not even satisfactory. Traversal of database data is fast but retrieving external data is very slow.

Take Oracle for an example. The source data looks like this:

ORDERID  CLIENT SELLERID  AMOUNT  ORDERDATE NOTE

1  287  47  5825  2013-05-31  gafcaghafdgie f ci…

2  89  22  8681  2013-05-04  gafcaghafdgie f ci…

3  47  67  7702  2009-11-22  gafcaghafdgie f ci…

4  76  85  8717  2011-12-13  gafcaghafdgie f ci…

5  307  81  8003  2008-06-01  gafcaghafdgie f ci…

6  366  39  6948  2009-09-25  gafcaghafdgie f ci…

7  295  8  1419  2013-11-11  gafcaghafdgie f ci…

8  496  35  6018  2011-02-18  gafcaghafdgie f ci…

9  273  37  9255  2011-05-04  gafcaghafdgie f ci…

10  212  0  2155  2009-03-22  gafcaghafdgie f ci…

The actual data volume (25G) exceeds Oracle’s available maximum memory capacity (12G). To perform group & aggregation with Oracle’s Parallel Execution, the SQL query is:

select /*+ Parallel(8) */ client,sellerid,count(orderid),sum(amount) from orders group by client,sellerid

It takes 210 seconds to execute the SQL query.

Then we try handling the grouping over same data under same environment with esProc using the following script. And the execution time is about 65 seconds!


A

1

=$(esProcOdbc) select /*+ parallel(8) */   client,sellerid,count(orderid),sum(amount) from orders.btx group by   client,sellerid

How does esProc get it done so fast? It stores data in the bin file (esProc’s own binary file format). Data storage format determines the traversal performance. Below shows features and performance ranking of different storage formats:

Storage format

Feature

Performance ranking

Binary

Smallest space usage & fastest parsing speed

1

Text

Universal in representation but bad in performance

2

Database

Binary; bad IO performance / Fast database data   traversal & slow external data retrieval

3

So, storage format should be the first and foremost concern in performance optimization for big data processing. To save the precious database space and increase computing performance, we can read data that used only for OLAP analysis out of the database to be stored in more efficient format. Read Performance Optimization - Traversal to learn more.

An esProc SPL script can be easily embedded into a Java program. Read How to Call an SPL Script in Java to learn details.

Read Getting Started with esProc to download and install esProc, get a license for free and find related documentation.