How Slow Is Database I/O?
Most of your veteran developers must have experienced the rather low database I/O performance. But many of you may not have the perception about how slow is the I/O operation, and particularly, about the gap between database I/O speed and the speeds of other data read/write methods.
Java is a commonly used technique used for application development. Now we perform a field test to find out performances of retrieving data from Oracle and MySQL – the two typical types of databases, in Java, and compare them with the performance of retrieving data from the text file.
Use the international T-PCH benchmark to generate data tables and select the customer table, which has 30,000,000 rows and 8 fields. The size of the original text file is 4.9G. Import the file data to Oracle and MySQL respectively.
The hardware environment consists of a single, 2-CPU and 16-core server where both the text file and the database are stored in the SSD. The whole test is performed on the local machine and there isn’t any substantial network transmission time.
As it is complicated to implement the test using Java, we write the code in SPL. SPL simply encapsulates Java’s data retrieval operation. Data is retrieved through the database’s JDBC driver, and performance won’t be affected. Here the complete retrieval operation involves retrieving data via the SPL cursor and converting all data into in-memory objects.
A | |
---|---|
1 | =now() |
2 | =connect("oracle") |
3 | =A2.cursor@x("select * from customer") |
4 | for A3,10000 |
5 | =interval@s(A1,now()) |
It takes Oracle 280 seconds, an average of 100,000 rows per second, and MySQL 380 seconds, an average of 80,000 rows per second averagely, to retrieve all the 30,000,000 rows from the table.
The retrieval speed depends on both the number of fields and the data type used, as well as the hardware environment. This means that this test result is only a reference and can be quite different in various environments.
But the comparison is reliable under the same environment. We still use SPL to directly retrieve a TPCH-generated text file:
A | |
---|---|
1 | =now() |
2 | =file("/home/tpch/customer.tbl") |
3 | =A2.cursor(;,"|") |
4 | for A3,10000 |
5 | =interval@s(A1,now()) |
Like we perform the test using the database, here we use SPL cursor to retrieve data and then convert the data to in-memory objects. It takes 42 seconds to complete the retrieval of 30,000,000 rows of data. This is over 6 times faster than Oracle and 9 times faster than MySQL!
We know that text parsing is rather complex and has high CPU consumption degree. But even so, retrieving data from a text file is still far faster than data retrieval from the database.
Now let’s test data retrieval from a binary file and find out to what extent the performance will be dragged down during the text parsing.
In order to create obvious contrast and make the later parallel processing test more convenient, we use the larger orders table that has three hundred million rows of data and nine fields. The code for retrieving data from the text file is similar to the previous one, and it takes 483 seconds to complete the execution.
Now we convert the current text file into a SPL composite table file and test the retrieval speed:
A | |
---|---|
1 | =now() |
2 | =file("/home/tpch/orders.ctx").open() |
3 | =A2.cursor() |
4 | for A3,10000 |
5 | =interval@s(A1,now()) |
It takes 164 seconds to retrieve all data from the composite table file, which is 3 times faster than retrieving data from the text file.
It is just what we expect because binary data does not need to be parsed and can be directly used to generate objects. The computation amount is less and the speed becomes faster.
But databases also use the binary format to store data and do not have the burden of text parsing. It is natural that they are not as fast as the more compact SPL composite tables as databases should consider data writing and cannot use the compression technique. But it is a little bit perplexed about their being slower than text files.
Actually, we find that databases are fast when SQL is used to perform a traversal-based aggregate operation on the above data table and returns a very small result set, much faster than when we perform the same operation based on a text file. This means that traversing a data table in the database is not slow at all, or we can say the data storage format has a fairly good performance.
The obstacle lies in the driver. The JDBC driver is extremely slow. We even doubt that this is an intentional design, which expects and even forces data not to be retrieved out of the database but to be computed inside it.
In view of this, we have come to a conclusion – when we want to achieve high performance for big data computations, the ad hoc data retrieval from the database is infeasible and it would be the best to perform the computation inside the database. If data must be first retrieved from the database before the computation can be really handled (because sometimes it is difficult or even impossible to implement some computing logics in SQL), we’d better not to store data in the database. If data is still kept in the database, performance won’t be satisfactory in most cases no matter what high-performance algorithms we use outside the database; unless the amount of data involved is very small.
This is why SPL must implement its own storage formats if it intends to improve SQL’s computing performance. It is impossible to achieve high performance if SPL still uses the database storage.
Is there any way to increase the computing speed when database is the only data source?
If the problem only lies in the slow driver, that is, the slow speed isn’t due to heavy database load, we can improve the situation using parallel processing.
It is a little bit complicated to implement the multithreaded computation in Java, so we perform a test by using SPL to write the parallel retrieval code:
A | B | |
---|---|---|
1 | =now() | |
2 | =connect("oracle").query@ix("SELECT COUNT(*) FROM CUSTOMER")(1) | |
3 | >n=6 | |
4 | =n.([int(A2/n)*(~-1),int(A2/n)*~]) | |
5 | fork A4 | =connect("oracle") |
6 | =B5.cursor@x("SELECT * FROM CUSTOMER WHERE C_CUSTKEY>? AND C_CUSTKEY<=?",A5(1),A5(2)) | |
7 | for B6,10000 | |
8 | =interval@s(A1,now()) |
Note that the parallel threads should not share the same database connection; instead, each of them should connect to the database separately.
The field test shows that when the number of parallel threads is small (generally <10), the computing efficiency goes up nearly in a linear way. This means that the retrieval speed becomes nearly n times faster when there are n retrieval threads. The test result is that speed increase reaches 5 times with a 6-thread parallel processing.
Here we need to first compute the total number of data rows and then write the WHERE condition for each thread to retrieve the corresponding part of data, placing many computing actions on the database while still leading to noticeable retrieve performance increase. This further tells that the slow speed is mainly due to the slow driver rather than the in-database retrieval and computations.
Of course, it is more convenient to use the multithreaded processing to increase the speed when data is stored in files. It is easy for SPL to implement the parallel computation:
Parallel retrieval from the text file:
A | B | |
---|---|---|
1 | >n=4 | =now() |
2 | =file("/home/tpch/orders.tbl") | |
3 | fork to(n) | =A2.cursor(;A3:n,"|") |
4 | for B3, 10000 | |
5 | =interval@s(B1,now()) |
Parallel retrieval from the composite table file:
A | B | |
---|---|---|
1 | >n=4 | =now() |
2 | =file("/home/tpch/orders.ctx").open() | |
3 | fork to(n) | =A2.cursor(;;A3:n) |
4 | for B3, 10000 | |
5 | =interval@s(B1,now()) |
The field test result is similar to that when data is stored in the database. A linear speed increase is obtained when the number of parallel threads is small. In the test, the 4 threads make text file retrieval 3.6 times faster and composite table data retrieval 3.8 times faster.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version