Are Wide Tables Fast or Slow?

 

Wide tables are usually a standard component of the BI system. Many BI projects will first prepare wide tables at the beginning of construction. A wide table is formed by joining up multiple tables that have a certain association relationship. The result set does not conform to the normal forms; and there is a large amount of redundant data. Moreover, as wide tables need to be pre-created, they are not so flexible to use.

But why do people very much prefer wide tables even if they have many shortcomings?

Because wide tables are FAST!

Querying data in the wide table is usually faster than performing the real-time multi-table join. So, building wide tables is to avoid joins. Join operations are a long-standing problem in SQL. They are difficult to write and has poor performance. Find detailed analysis about SQL joins HERE.

However, though wide tables help avoid joins, too much extra data may be read during the computation because there is data redundancy and this increases I/O time. For example, there is an Orders table where each order corresponds to 5 records in OrderDetails table. By stretching the two to a wide table, data in Order table will repeat 5 times. What’s more, the Orders table has dimension tables such as Customer and Employee, and Customer table has dimension tables including Region, and so on. When all these tables are extended to form a wide table, the entire data volume will be enlarged many times. To perform a query on that wide table, such as summing up order amounts by customer’s region, a large volume of data will be retrieved and I/O overhead is huge.

According to the above analysis, wide tables should have been slower. Why they are faster in the real-world practice? This is because relational database joins are too slow. Even if the wide table IO cost increases several times, the query is still faster than the real-time joins.

If we can do some optimization to make the join run faster, can we get satisfactory performance while avoiding a series of wide table problems including redundant data, error from result set that does not conform to normal forms and stiffness?

The answer is yes. But it is a pity that SQL cannot do that.

In the above document link, there are already detailed analysis about the SQL join. In a nutshell, the JOIN defined by Cartesian product is indeed very simple and the simple connotation gets broader denotation to cover various JOIN scenarios. But a too general definition makes it impossible to perform targeted optimization on different join operations. People can only think of some temporary solutions in engineering, but cannot fundamentally solve the problem.

Here’s another fact. Since the debut of the database, optimization methods for simple SQL used for BI analysis have been stretched to the limit by various vendors, but even so wide tables are needed to solve the performance problem. It can be seen that it is difficult to deal with – we can even say – impossible to solve the performance problem of join operations.

Is there anything we can do?

We can use SPL to tackle the problem.

SPL (Structured Process Language) is an open-source computing engine intended for structured data computations. It offers powerful computing ability independent of databases. The performance of handling join operations in SPL is much higher than those of both the SQL join and the wide table-based join. The language addresses the root of join operation performance problem as well as avoiding problems brought by wide tables.

The commonly seen equi-joins in BI analyses are categorized into two types in SPL – foreign key join and primary key join. Each is provided their own performance optimization methods, which is explained in the second half of the above-mentioned post about Join Simplification and Acceleration. SPL specifically offers dimension table preload method and numberization method for the common foreign key joins and order-based merge method for primary key joins that help to significantly reduce the join operation complexity.

Once we speed up the join operation, wide tables become useless and the volume of data to be read is reduced. The result is that SPL greatly increases BI performance.

That’s theoretical explanations. How best does SPL’s field performance?

A comparison test was performed. The test includes a common aggregation by dimension in multidimensional analysis after a join between a fact table and multiple, multilayer dimension tables, and an aggregation based on the wide table.

The test data is based on a data set of TPCH 100G and a computation involving a join between one large fact table and multiple dimension tables is designed:

  1. Two table join between one fact table and one dimension table;
  2. Seven-table join involving joins between a primary-sub fact table and four dimension tables, during which a dimension table is used twice;
  3. Convert the seven-table join result to a wide table and perform wide-table-based aggregation.

The products used for performing the test are two specialized OLAP databases – StarRocks and Clickhouse, which are famous for high-performance BI analysis. Below is the test result:

Time unit: Second

8C32G 4C16G
Two-table join Seven-table join Wide table Two-table join Seven-table join Wide table
SPL 11.5 30.6 57.7 21.5 55.6 114.2
Starrocks 35.1 73.3 62.1 78.8 152.5 129.9
Clickhouse 89.3 Memory overflow 33.2 204.1 Memory overflow 74.3

Find detailed test report HERE.

According to the test result, the SQL’s wide table is faster than the join, which verifies the previous analysis. The performance of SPL’s wide table is actually not as fast as ClickHouse, but its real-time join performance is very high. It is higher than joins performed in the two SQL databases (3-9 times faster), and even surpasses the database’s wide table method by huge margins. If we take wide table’s shortcomings into account (redundant data, data error and stiffness), the advantage of SPL’s real-time join becomes more obvious. It not only avoids wide table defects but increases performance by N times.

The wide-table-based join is not necessarily faster than the real-time join! Because of SPL, costly wide tables created to obtain high performance in the BI system become useless.