"[url]Preface The essence of big data from a technical point of view is high performance. Only wi .."

mars RaqForum 25 No.
1 Reply • 1,469 View • 2 Years ago

Performance Optimization - Preface

Preface

The essence of big data from a technical point of view is high performance. Only with sufficient performance can the big data analysis be truly and effectively implemented.

Performance optimization should be implemented under limited hardware conditions. Since software cannot change the speed of hardware, what we can do is to design algorithms with lower complexity to reduce the actual amount of computation. And naturally, we can obtain higher computing performance.

Some big data algorithms have good adaptability and can work in all cases, but it is difficult to obtain high performance due to their conservative nature. To reduce the amount of calculation, we should carefully study and make use of the characteristics of data and tasks, and design appropriate storage schemes and algorithms according to actual conditions.

This book will present storage schemes and optimization algorithms applicable to different scenarios and objectives. Having become familiar with the principles and application prerequisites of these basic algorithms, and obtained the ability to flexibly use them in a combined way, programmers will be able to solve actual high-performance problems. Once programmers understand these algorithms and their features, they will also make significant progress in the technology selection and understanding of big data products.

The algorithms presented in this book are mainly oriented to the computation of structured data, involving operations such as search, filtering, grouping, sorting and join. These operations are the basic content of big data calculation and the most common tasks in data analysis and calculation.

This book is not just a simply list and summary of historical algorithms, many algorithms and optimization technologies in it appear for the first time in the industry. Moreover, this book not only discusses high-performance algorithms in theory, but also covers technical methods that may not have a particular advantage in complexity but can improve performance in practice.

This book is not intended for beginners and requires a certain level of expertise. Readers should:

1) master various operations of relational database and SQL, which will not be explained in this book.
2) understand the knowledge equivalent to the data structure course of university computer major; relevant concepts will be directly cited.
3) understand the basic knowledge of algorithm complexity analysis.
4) be familiar with programming languages such as C/C++ or Java, memory management mechanism of operating system and basic LAN.

Since the principles and processes of some algorithms are relatively cumbersome and difficult, it is not a must for application programmers to master. However, programmers can still use these algorithms as long as they understand the conditions that such algorithms apply and then become familiar with the application code examples.

This book will use SPL to write application code examples, and directly use SPL’s data types and syntax to describe calculation objectives, which requires readers to understand in advance. Readers with SPL knowledge can easily convert these terms into the corresponding vocabulary of other programming languages.

SQL is currently the most commonly used structured data operation language, but it is too rough to apply most of the optimization algorithms in this book. Other programming languages like Java, C/C++ lack the necessary concepts for structured data operation, and defining these concepts would take up too much bulk of the book and, although they can implement and apply these algorithms, the code will be quite long and too much energy would be spent on details.

SPL may be currently the only programming language in the industry that can apply these algorithms without being too cumbersome. After understanding the mechanism of these algorithms, programmers can also implement them themselves in Java, C/C++ etc. to get better performance.

This book will introduce dozens of basic high-performance algorithms or storage schemes for structured big data. The flexible use of such methods has been made in many actual scenarios. Compared with SQL on conventional relational database, the algorithms implemented in SPL can often improve the performance several times or even a hundred times.

Theoretically, SPL, as a programming language, would not be faster than SQL. The reason why SPL actually runs faster is that it can implement the high-performance algorithms that SQL cannot implement. If a computing task has already adopted the best algorithm when it is implemented in SQL, rewriting it in SPL will not run faster. However, there are too many high-performance algorithms that are hard to be implemented in SQL, as long as the code is somewhat complex, it is almost certain that we will be able to find the points that can change the storage scheme and algorithm so as to optimize the performance. For more than ten scenarios we have performed, we could always find certain points to improve every time, in this case, rewriting in SPL can effectively improve the overall performance. Of course, coding in C/C++ or Java may also get a higher performance, but the development efficiency is too low.

It needs to be noted once again that each algorithm in this book has its own scenario to which it adapts; and even some algorithms cannot be used at the same time due to certain contradictions between them, therefore, the algorithm needs to be selected according to practical situation. We have been emphasizing that we should first fully understand task’s objectives and data characteristics, and then design an optimization scheme according to actual conditions. Moreover, the big data computing task in reality does not simply correspond to the algorithm presented in this book one by one but a comprehensive task, and hence we should use these basic algorithms in an appropriately combined manner to cope with real task. Consequently, it is more important to learn the analysis methods.

Because this book focuses on the algorithms and working principles of optimization schemes, complete examples and tests are not given. If you are interested in learning more test examples and codes, go to SPL forum to find them.

Performance Optimization - 1.1 [In-memory search] Binary search
Performance Optimization - 1.2 [In-memory search] Sequence number positioning
Performance Optimization - 1.3 [In-memory search] Position index
Performance Optimization - 1.4 [In-memory search] Hash index
Performance Optimization - 1.5 [In-memory search] Multi-layer sequence number positioning
Performance Optimization - 1.6 [In-memory search] Index on in-memory table

Performance Optimization - 2.1 [Dataset in external storage] Text file segmentation
Performance Optimization - 2.2 [Dataset in external storage] Bin file and double increment segmentation
Performance Optimization - 2.3 [Dataset in external storage] Data types
Performance Optimization - 2.4 [Dataset in external storage] Composite table and columnar storage
Performance Optimization - 2.5 [Dataset in external storage] Order and data appending
Performance Optimization - 2.6 [Dataset in external storage] Multi-zone composite table
Performance Optimization - 2.7 [Dataset in external storage] Data update