About Using SQL for Big Data Computing

SQL is still the mainstream technology for processing structured data on today’s big data platforms. There are some obvious advantages to become SQL compatible. SQL is widely applicable and commonly used; adopting SQL can avoid the costs of learning. There are a lot of front-end applications that support SQL; it’s easy for a SQL-based big data platform to function in the existing computing ecosystem. Since the conventional databases that a big data platform intends to replace are also SQL-based, there will be good compatibility and relatively low migration cost.

But there are disadvantages in sticking with SQL. The main difficulty lies in obtaining high performance that the big data computing badly needs.

SQL lacks some essential data types and definitions of certain operations, making it unable to phrase some high-performance algorithms. We have to turn to the practical optimization of the computing engine. After decades of development, a rich variety of optimization methods have been in place for the conventional commercial databases, but still, many computing scenarios are difficult to be optimized. A theoretical problem can’t be solved at the practical level. Compared with the mature databases, the emerging big data platforms are far less experienced in practical optimization and don’t have algorithmic advantage; they boost performance by scaling out the cluster. As a descriptive language, SQL isn’t good at defining an execution path. Yet a specially optimized execution path is necessary for achieving high performance. This needs external intervention with many special qualifiers - the procedural syntax, however, is more direct - and is a hindrance to producing high-performance code in SQL.

In the early days of SQL, the language was designed to match the then poor computer hardware capabilities to ensure its viability, which has become the cause of its awkwardness in adapting to the hardware environment of contemporary computers, which mean large memory, multiple parallel threads and cluster. A SQL JOIN is performed by key values; but, with a large memory, it can be done according to addresses without the need of calculating and comparing HASH values, considerably increasing performance. A SQL data table is unordered; while it’s easy to perform segment-based parallel processing for computations over a single table, it’s hard to achieve the synchronous, dynamic segmentation for a multi-table JOIN. Generally we need to divide each of the multiple tables into a number of segments in advance, making it impossible to set the number of parallel threads according to the machine load on an ad hoc basis. Theoretically, SQL doesn’t differentiate between dimension tables and fact tables, and it simply defines a JOIN operation as a filtering of the Cartesian product. So for cluster computations, a JOIN of multiple big tables inevitably involves a large number of hash shuffle actions that consume network resources heavily. With a node-intensive cluster, the high network latency would outweigh any benefits gained.

A brand-new algebra system and computational model could overcome those SQL weaknesses to obtain high performance by both conveniently expressing high-performance algorithms and making good use of the available hardware capabilities.

But with a non-SQL external interface, the new computing engine won’t be compatible with the SQL code.

We are not talking about the NoSQL databases. They are not intended for high-performance computations. Rather, they sacrifice the computational ability for the scale-out ability, which actually goes against the trend of big data computing as computations becoming more and more complex.

Is there any way to have both high performance and good compatibility? Like adopting a new core model, based on which the SQL syntax is interpreted; or migrating the SQL code automatically under a new system.

Theoretically the answer is yes because, in spite of the large workloads, it’s not difficult to interpret SQL code or migrate it under another system. Both ways can only realize the syntax compatibility, but can’t achieve high performance. High-efficiency code can only be produced based on features of the computational model. As SQL syntax doesn’t define those features, the previous practical optimization issue is still inevitable even after the migration. Migration alone can’t attain the best-effect. In reality, for two SQL-based database products that implemented SQL syntax differently, it’s barely possible to interpret the syntax of one of them by the other or move code between them without compromising the performance. Though every emerging big data platform vendor volunteers to offer their migration technology to reduce the cost of moving code from the conventional databases, few are successful.

Well, what’s the best choice for big data platform product vendors?

For products with medium and short-term perspective, it’s reasonable to stick with SQL for the purpose of rolling out fast, with performance increased mainly by enhancing the hardware capacity or scaling out the cluster. For products with long-term perspective, my opinion is that, in order to achieve high performance, it’s necessary to replace a computational model based on the relational algebra by one based on a better mathematical theory and to abandon SQL - the new model won’t be able to support SQL, either. It takes long to establish and improve a new computing eco-system, but those who have patience will be handsomely rewarded.