Hard Disk Performance Characteristics
Compared with expensive, fast internal memory, hard disks are much cheaper and about one or two orders of magnitude of slower. But the problems of hard disks are more than the slow speed.
Disk drives are bad at handling high-frequency accesses in small amount
There are computing scenarios where the frequency of data access is high and the amount of data retrieved each time is very small. With an internal memory, it takes nearly same time to retrieve a hundred million bytes of data in ten millions accesses, 10 bytes at a time, and to retrieve the same amount of data in 10 accesses with ten millions bytes each time. This is because byte is the smallest unit in the internal memory for data access, and either way the number of bytes retrieved is same.
Hard disk access uses a different mechanism. Hard disks store data by sectors. A sector is the smallest unit for accessing data on the hard disk (the size of one sector is 512 bytes by default). It takes almost same time to read 1 byte and to read 512 bytes (time spent to copy data to the memory can be ignored compared with the time taken to retrieve data from the hard disk). So it could take extremely longer to access the hard disk ten million times with 10 bytes retrieved each time than to access it ten times with ten million bytes retrieved at a time, though the total amount of data retrieved is the same. A lot of useless data could be retrieved in the first way.
The real-life situations are even worse. Usually data is stored directly in files under the control of the operating system. Most of the file systems use cluster as the data access unit. A cluster generally holds a default of 4K data, which is 8 times bigger than the 512 bytes. Besides, hard disk access is more complex than the internal memory access for which one CPU instruction is enough. Access to hard disks requires starting the disk controller with a series of actions and the location of to-be-retrieved data. Ten million hard disk accesses need the same number of start and locate actions. Though the retrieval of ten million bytes at one time involves not a few disk actions (because ten million bytes cover many sectors), the total number of access actions is much smaller. The performance of low-frequency retrieval in batch is far better than that of high-frequency retrieval in small amount.
Random access to discontinuous data is a major factor
The use of “could” in assessing the high-frequency data access in small amount is because things may be different when data is stored continuously. If the to-be-read-data is continuously stored, it won’t take very long to retrieve data even in ten million accesses with ten bytes each time. Because for every access action each 10 bytes are already within the retrieved unit sector thanks to the buffer functionality both the disk controller and the operating system have, the actual number of hard disk accesses won’t be so huge and the performance difference between the two retrieval ways isn’t so big. The problem is the random high-frequency data access in small amount wherein the to-be-read data isn’t continuously stored and a retrieved unit sector doesn’t contain the next to-be-retrieved piece of data, causing a waste of time. The same unit sector will have to be re-read later if needed. The waste of data retrieval effort and the possible re-reads are certain to cause an extremely poor performance.
For an HDD, random accesses involve head jumps, which correspond to the average seek time stated in a hard disk’s product standard. Locating the disk space where the to-be-read data is stored is an extremely slow mechanical movement, much slower than reading a sector. Even if each retrieved sector (or cluster) can be fully used, the seek time could be longer than the retrieval time when random hard disk accesses are performed. It’s wise to avoid the random high-frequency data access in small amount with an HDD.
Because an SSD doesn’t store data based on sectors, it doesn’t need the mechanical locating actions. This means there isn’t the seek time issue for random accesses. Yet the operating system simulates the HDD’s way for an SSD to read and write data according to the unit of cluster (or sector), which is too big to perform random high-frequency data accesses in small amount (less than a cluster) efficiently.
Parallel and concurrent processing is the root of the problem
For computations that need continuous batch access only, like traversal-based aggregation or query, is the hard disk performance determined only by its access speed?
Generally the answer is yes, if it is a single-thread computation or a single-task one. But today the parallel processing has become an indispensable to the high-performance computation, and a lot of computing services need to support concurrency. Both types of computation will to some degree transform the continuous hard disk accesses to random accesses. The reason is simple. Multiple threads share the same hard disk, but their requests of access are not continuous. Head jumps happen when the hard disk is trying to respond to these requests, leading to random accesses.
Usually this is disastrous to an HDD. Frequent switches between threads produce frequent seek actions, making the multithreaded processing slower than a single-threaded processing. Sometimes the performance of single-task processing falls dramatically from acceptable to poor after concurrency is used. A countermeasure is increasing the size of the buffer for storing the retrieved data. If each thread retrieves an enough large amount of continuous data, the proportion of the seek time decreases and becomes not that noticeable. But this also increases the memory consumption. The more the number of threads there are, the more the need of the memory space. An SSD needs smaller memory space, which is the size of a cluster (or a sector).
An example is the columnar storage, which stores data continuously by columns. For a multi-column computation, even a single thread will lead to random hard disk accesses. If there are a lot of columns involved, the columnar storage won’t necessarily be able to increase performance. Multithreaded processing will make the situation even worse. So, be careful when trying to use columnar storage with an HDD, the technique isn’t always a guarantee of a better performance.
Two algorithms for external memory computations
Because of those hard disk headaches, a different algorithm will be adopted and a steep decrease in performance happens when data to be processed is transferred from internal memory to an external memory device (the decrease rate isn’t a simple multiplication according to the proportion of the amount of the data transferred).
A typical example is the JOIN operations. For a foreign key-based JOIN with a 1:N relationship, if all data is stored in the internal memory, we can create a hash index over the primary key of the dimension table pointed by the foreign key. Then according to the index, we can quickly locate desired records during the traversal of the fact table. Even if there are multiple foreign keys pointing to different dimension tables respectively, one traversal is enough to achieve all JOINs. If the dimension table is too big to be accommodated by the internal memory, a different algorithm is needed. The hard disk can’t handle the random high-frequency accesses in small amount, which is the type of operation in accessing a dimension table. The use of the above algorithm will result in a terrible performance, worse than the performance of a sorting and merge algorithm. An alternative is dividing a large table stored in an external device into multiple small enough segments that can be loaded into the memory for a JOIN according to the key values (their hash values, actually). But only one JOIN can be handled each time. Multiple same operations will thus be needed for a computation where the fact table corresponds to multiple big dimension tables via multiple foreign keys, causing a much bigger workload than the internal memory JOINs.
We can use a cluster to get away from the complicated external memory computations if the one computer’s memory space isn’t big enough. Though the performance of accessing to a unit amount of data across a cluster via the network when getting data stored in the internal memory of a node is little better than that with a hard disk due to the network latency, which is almost as slow as the hard disk, the internal memory’s capability in handling random access and concurrency becomes available. Similar to the hard disk accesses, a connection to the network is resource-consuming, and thus inefficient in dealing with high-frequency data accesses in small amount. To counter this issue, the results of multiple data access requests are collected and returned in batch. To solve JOINs involving big dimension tables, the cluster strategy is more complicated than the internal memory strategy, but still much simpler than an external memory algorithm by achieving multiple JOINs at a time.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL