New Algebra New Database
[How are we developing database?]
In recent years, many enterprises have joined the queues of database development, especially Internet enterprises. Using traditional commercial databases is either unable to meet their needs or unaffordable for these enterprises. Upon the stimulation of their own needs, these enterprises began to develop database and can achieve the world leading level in some aspects. However, most of the emerging database manufacturers are optimizing in practice, making their products more suitable for a certain kind of scenario than traditional databases, and have not obtained overwhelming advantages. The technical problems faced are only relieved to some extent, and are not fundamentally solved. Even some database developments can only be called remold, not optimization. For example, NoSQL products have improved the throughput, but they have seriously sacrificed the computing power and given up the consistency ability which is important in some occasions; Some NewSQL products compromise consistency for scalability; Hadoop, a popular big database platform in recent years, originally tried to overturn the traditional relational data warehouse, but it is back to SQL.
So, how do we develop database?
We are inventing new math!
What kind of math is the database using now?
At present, the mainstream database is relational database, so called because its mathematical base is called relational algebra, which is one of the few special mathematics for computer science.
Relational algebra has been invented for nearly 50 years. The application requirements and hardware environment of 50 years ago are quite different from those of today. Isn’t it too old to continue to use the theory of 50 years ago to solve today’s problems? However, the reality is that because there are too many existing users, and there is no mature new technology, SQL (based on relational algebra) is still the most important program language on database today. Although there have been some improvements over the past few decades, the root has not changed. In the face of modern complex requirements and hardware environment, relational database is not so handy.
It’s hard for old bottles to hold new wine.
To clarify this problem, we need to see what the database is actually doing?
Database has a “base” in its name, which makes people think that it is mainly for storage. In fact, it’s not. Simple storage is easy to solve. There are two things the database really does: Computation and transaction! Storage also serves these two things.
Let’s start with computation, which is what we often call OLAP capability. Here I prefer to use the word computation rather than analysis (A in OLAP). The concept of computation is more extensive and pragmatic. It does not only refer to addition and subtraction, but also to search and association.
What kind of computing system is good?
Two goals: easy coding and running fast.
Easy coding is easy to understand, that is to let programmers write codes quickly, so that finish more work in same time; Running fast is easier to understand, and we certainly hope to get the results in a shorter time.
So, how is relational database, or SQL, doing in these two aspects?
SQL statements are very similar to English, some queries can be read and written in English (there are many examples on the Internet, so we don’t give examples). Should this be regarded as meeting the requirement of easy coding?
Wait a moment! The SQL statements we see in textbooks are often only two or three lines. These SQL statements are really easy to write, but what if we try some slightly complicated tasks?
I often give a simple example: How many days has a stock been rising for the longest time? It is written in SQL as follows:
SELECT MAX(ContinuousDays) FROM
(SELECT COUNT(*) ContinuousDays FROM
(SELECT SUM(UpDownTag) OVER (ORDER BY date) NoRisginDays FROM
( SELECT date,
CASE WHEN price>LAG(price) OVER( ORDER BY date THEN 0 ELSE 1 END UpDownTag
FROM stock ))
GROUP BY NoRisingDays)
The working principle of this statement will not be explained here. Programmers can try it on their own.
This puzzle used to be my company’s recruitment test, and the pass rate is less than 20%; Because it was too difficult, it was later changed to another way: the SQL statement was given to let the candidates explain what it was calculating, and the pass rate was still not high. What does this mean? It means that SQL is difficult to both understand and write!
Let’s take another look at the problem of running fast. Another simple example that I often used: get top 10 from 100 million pieces of data. This task is not complicated to write in SQL:
SELECT TOP 10 x FROM T ORDER BY x DESC
However, the corresponding execution logic of this statement is to sort all the data first, and then get the top 10, ignoring all the rest. As we all know, sorting is a very slow action, and it will traverse the data many times. If the amount of data is too large to be loaded in memory, it also needs external storage as the cache, and the performance will further decline sharply. If the logic embodied in this SQL is strictly followed, the operation will not run fast in any case. However, many programmers know that this operation does not need sorting all data, nor does it need external storage to cache. One traversal can complete the task with a little memory, that is, there is higher performance algorithm. Unfortunately, we can’t write such an algorithm in SQL. We can only expect that the database optimizer is smart enough to parse this SQL into a high-performance algorithm. However, the database optimizer may not be reliable in complex situations.
It seems that SQL, that is, relational database, is not doing well in these two aspects. These two problems are not very complicated. In reality, among thousands of lines of SQL codes, there are too many cases that are hard to write codes and cannot run fast.
Why isn’t SQL good enough?
To answer this question, we need to understand what programming is indeed..
In essence, the process of programming is the process of translating the idea of solving problems into a computer executable precise formal language. For example, just like primary school pupils to solve application problems, after analyzing the problems and coming up with solutions, they have to list arithmetic expressions. It’s the same to compute with programs: we should not only come up with a solution to the problem, but also translate the solution into the actions that the computer can understand and execute.
The core of the formal language used to describe the calculation method is the algebra system. Simply putting, the so-called algebraic system is some data types and operation rules upon them. For example, the arithmetic system we learned in primary school is integer, addition, subtraction, multiplication and division. With the algebraic system, we can write the operations we want to do with the symbols agreed by the algebraic system, that is, the codes, and then the computer can execute them.
If the design of this algebraic system is not considerate and the data types and operations provided are not convenient, it will lead to the difficulty of describing the algorithm. In this case, there will be a strange phenomenon: the difficulty of translating the solution to the codes far exceeds the difficulty of solving the problem itself.
For example, when we were young, we learned to use Arabic numerals for daily calculation. It’s very convenient to implement addition, subtraction, multiplication and division. Everyone naturally believes that numerical operation should be like this. Actually it is not necessarily like this! I think most people know that there is another kind of thing called Roman numeral. I don’t know whether the Roman numeral system has the familiar operations of addition, subtraction, multiplication and division (the Roman numeral system can’t carry out these operations as easily as Arabic numerals, and the definition of operation is probably different). I have been puzzled for years about how the ancient Romans went shopping?
In this way, we know that the difficulty of programming is largely a problem of algebra.
Let’s take another look at the reasons why codes can’t run fast.
Software can’t change the performance of hardware. CPU and hard disk are as fast as they should be. However, we can design a low complexity algorithm, that is, an algorithm with less computation. In this way, the computer will perform fewer actions, and naturally it will be faster. However, it is not enough to just think of an algorithm. We have to be able to write this algorithm in a formal language, otherwise the computer cannot execute it. And the coding should be easy; if the coding is very long and troublesome, no one will do it. So, for a program, running fast and easy coding are actually the same thing, the reason of which is the algebra adopted by this formal language. If this algebra is not good, it will lead to high-performance algorithm is difficult to achieve or even cannot be achieved, and there is no way to run fast. As mentioned above, we can’t write the expected small memory single traversal algorithm in SQL, and we can only expect that the optimizer can make it run fast.
Let’s make another analogy:
Many friends probably know the story of Gauss calculating 1+2+3+…+100. Ordinary people will add 100 times step by step. Gauss was very smart. He found that 1+100=101, 2+99=101,., , 50+51=101, and the result is 50 times 101. He calculated the result in no time.
After hearing this story, we all feel that Gauss was very smart and could think of such a clever method, that is, simple and fast. This is not wrong, but it’s easy to overlook one point: in the age of Gauss, the human arithmetic system (also an algebra) already had multiplication! As mentioned earlier, we learned four basic operations from childhood, and would take multiplication for granted. In fact, multiplication was invented after addition. If there were no multiplication in the age of Gauss, even if there was a smart Gauss, there would be no way to quickly solve this problem.
Now, we can answer the previous question: why is relational database not good enough in the two aspects we expect?
The problem lies in relational algebra. Relational algebra is like an arithmetic system with only addition but not multiplication. It is inevitable that many things cannot be done well.
Moreover, unfortunately, this problem is theoretical, and no matter how to optimize it in practice, it can only be improved in a limited way, not eradicated. However, most database developers do not think of this part, or in order to take care of the compatibility of existing users, they do not intend to think of this part. As a result, the mainstream database industry has been wriggling in this circle.
So what to do? In other words, how to make the computation easier to code and run faster?
Invent new algebra! Algebra with “multiplication”.
Well, here comes the math!
After computation, let’s talk about transaction, which is commonly referred to as OLTP capability.
Now is the era of cloud computing, almost all database manufacturers are busy moving database to the cloud. But is the current database really suitable for the cloud?
In fact, the so-called cloud database, including some of the world’s giants, is to physically move the database at home to the cloud server. Other aspects are still only remold in practice, making a trade-off between strong consistency and scalability, and the application development process is not very different from the traditional database. Of course, we cannot say that this is not a cloud application, but this mechanism does not reflect the basic characteristics of cloud application.
The basic feature of cloud application is the diversity of data structure.
Cloud applications will break the boundaries of enterprises. Unlike traditional applications that each application serves one user, cloud service providers will provide a system for all users; The application is not going to be done phase by phase as before, but to be online forever, it can only be hot upgraded. The data structure of different users is not exactly the same, and the data structure of the same user in different periods will also change, which will accumulate a lot of data with different structures to be stored and calculated together. This is a problem that has not been considered in the design of relational database, because relational algebra has almost no ability designed to deal with data with diverse structures.
If we continue to use relational database (or its variants) to deal with cloud applications, we will face the contradiction between personalization and mass users: if we want mass users, we have to sacrifice personalization (all users face the same data structure applications); If we want to personalize (each user with its own data structure), we have to sacrifice the number of users (only serve a relatively small number of users). But cloud application exactly requires the coexistence of personalization and mass users.
Another important ability of the transaction database is to ensure consistency (there are many explanations online, not covered here). Although relational database can achieve consistency, the resource consumption is serious, which will lead to the decline of concurrency, and multi concurrency is a common feature of cloud applications.
The cost of implementing consistency of relational database is too high, because of its data organization mechanism, which is determined by the data models involved in operations, and the data models are defined by relational algebra. As with the diversity mentioned above, to achieve low-cost consistency, we must break relational algebra and organize and store data in a different way.
Well, it’s still math!
[On the road]
It’s one thing to use so many words to explain that there is something wrong with the algebraic system adopted by the current database. To invent a new system of algebra is quite another matter. It is not very difficult to improve only one operation, but it will be a great challenge to integrate all kinds of data types and operations into one system to ensure closure and self-consistency.
Let’s talk about closure and self-consistency. Closure means that any calculation result must still belong to the defined data types, and an algebra that does not satisfy the closure property cannot operate continuously. For example, integers are closed to addition, subtraction and multiplication, but not to division, we can’t do four operations continuously and arbitrarily within the range of integers, either restrict division or extend integers to rational numbers. Self-consistency means that these operations can’t produce contradictory results. For example, we should stipulate that we can’t divide by 0. Otherwise, dividing a certain number by 0 can produce logical contradiction, and the system can’t calculate the right thing. Closure and self-consistency are the necessary conditions for a reasonable algebraic system.
In the application aspect, in order to make it have enough description ability, that is, the common requirements can be easily combined with basic data types and operations, which requires that there are as many data types and operations as possible. However, in order to reduce the overall complexity and the corresponding learning cost, we should also make the data types and operations as little as possible. This trade-off needs a lot of consideration. There is a concept of minimum complete set in mathematics. Generally speaking, we design data types and operations just enough, without any excess or deficiency.
As a result, it took us ten years and four times to overturn and restructure before we finally released the OLAP function, while the cloud database for solving OLTP is still in the experimental environment.
My nickname on Tsinghua BBS used to be “ten years grinding a sword”, a word became a prophecy.
What is the result of ten years’ grinding?
Grinding out new algebra! We have a mathematical name: discrete dataset model. Based on this algebra, we design a new formal language named SPL (structured process language).
It is not appropriate to explain the content of discrete dataset in detail in this article. I will just write the previous examples in SPL for a brief understanding.
How many days has a stock been rising for the longest time?
The calculation idea is the same as the previous SQL, but it is much easier to express because of the introduction of order-related calculation.
Get top 10 from 100 million pieces of data:
SPL has more abundant set data types. This statement describes an efficient algorithm to implement simple aggregation through a single traversal, without sorting all the data.
High performance is the eternal pursuit of big data technology. With the support of new algebra, how much can we improve the performance of database computing?
We have done a TPCH test. After rewriting SQL into SPL, we ran the test on ARM chip. It’s several times faster than running the same test on Intel chip written in SQL. The high performance algorithm implemented by SPL on the low performance chip can far surpass the low performance algorithm written by SQL on the high performance chip. This is the power of mathematics!
We also dare to challenge traditional databases, including international giants. In the face of slow codes written by SQL, we can speed up by an order of magnitude on average after switching to SPL. This is the foundation of mathematics!
[Thanks to Maths]
“Thanks to Math” is an article I wrote 15 years ago when I commemorated the death of Professor Shiing-shen Chern. It explains what mathematics has given me and why I can do it. Or it should be more straightforward: how dare I do it?
It’s so difficult to shake the relational database with such a deep and huge user base and pull users out of their usage habits for many years. I know that there are countless practitioners who give up innovation because of compatibility, and I have been well advised countless times that this route is too difficult.
“If you have math, you have confidence!”
Mathematics gives me strict and abstract thinking. With a little more strictness, you can find more blind spots that others can’t see; With a little more abstractness, you can see deeper and farther than others.
A carriage is always a carriage. It will still be a carriage after so many optimizations, and it will still be pulled by a horse. The newborn car, of course, will have a variety of operations that we are not used to, there will be a lot of functional dissatisfaction, and may even fall apart. But it’s a car, it’s driven by an engine, and if it continues to improve over time, its huge advantages will surely crush the carriage in an all-round way.
With the help of computer, human beings have the ability to process large-scale data mechanically for the first time. At the same time, we will inevitably encounter many new problems that we have never faced before. These problems may be related to the basic theory of computer, and cannot be dealt with simply by using engineering methods. We need to develop a new mathematical system and adopt new models to solve them.
However, compared with the great success in practice, computer science is very weak in theory, and we can only count a few fields such as computability, relational algebra and so on. Most of the mathematics used in the computer field was invented by mathematicians decades or even centuries ago.
When calculus was first invented 300 years ago, various conjectures and theorems emerged one after another, and mathematicians could write their names into our textbooks from time to time. From nowadays, these famous achievements do not seem very complicated (it was not easy to invent them at that time), and ordinary science and engineering students can understand them. This is totally different from the mainstream field of contemporary mathematics. Without studying for 20 years, you can’t communicate with people at all.
Today’s computer mathematics is also at that time, in the most easily fruitful era, perhaps one day discrete dataset theory will be written into university textbooks. As an old saying goes: People should always have some ideals, what if they come true?