Performance Optimization - Preface

 

Preface

The essence of big data from a technical point of view is high performance. Only with sufficient performance can the big data analysis be truly and effectively implemented.

Performance optimization should be implemented under limited hardware conditions. Since software cannot change the speed of hardware, what we can do is to design algorithms with lower complexity to reduce the actual amount of computation. And naturally, we can obtain higher computing performance.

Some big data algorithms have good adaptability and can work in all cases, but it is difficult to obtain high performance due to their conservative nature. To reduce the amount of calculation, we should carefully study and make use of the characteristics of data and tasks, and design appropriate storage schemes and algorithms according to actual conditions.

This book will present storage schemes and optimization algorithms applicable to different scenarios and objectives. Having become familiar with the principles and application prerequisites of these basic algorithms, and obtained the ability to flexibly use them in a combined way, programmers will be able to solve actual high-performance problems. Once programmers understand these algorithms and their features, they will also make significant progress in the technology selection and understanding of big data products.


The algorithms presented in this book are mainly oriented to the computation of structured data, involving operations such as search, filtering, grouping, sorting and join. These operations are the basic content of big data calculation and the most common tasks in data analysis and calculation.

This book is not just a simply list and summary of historical algorithms, many algorithms and optimization technologies in it appear for the first time in the industry. Moreover, this book not only discusses high-performance algorithms in theory, but also covers technical methods that may not have a particular advantage in complexity but can improve performance in practice.


This book is not intended for beginners and requires a certain level of expertise. Readers should:

1) master various operations of relational database and SQL, which will not be explained in this book.
2) understand the knowledge equivalent to the data structure course of university computer major; relevant concepts will be directly cited.
3) understand the basic knowledge of algorithm complexity analysis.
4) be familiar with programming languages such as C/C++ or Java, memory management mechanism of operating system and basic LAN.

Since the principles and processes of some algorithms are relatively cumbersome and difficult, it is not a must for application programmers to master. However, programmers can still use these algorithms as long as they understand the conditions that such algorithms apply and then become familiar with the application code examples.


This book will use SPL to write application code examples, and directly use SPL’s data types and syntax to describe calculation objectives, which requires readers to understand in advance. Readers with SPL knowledge can easily convert these terms into the corresponding vocabulary of other programming languages.

SQL is currently the most commonly used structured data operation language, but it is too rough to apply most of the optimization algorithms in this book. Other programming languages like Java, C/C++ lack the necessary concepts for structured data operation, and defining these concepts would take up too much bulk of the book and, although they can implement and apply these algorithms, the code will be quite long and too much energy would be spent on details.

SPL may be currently the only programming language in the industry that can apply these algorithms without being too cumbersome. After understanding the mechanism of these algorithms, programmers can also implement them themselves in Java, C/C++ etc. to get better performance.


This book will introduce dozens of basic high-performance algorithms or storage schemes for structured big data. The flexible use of such methods has been made in many actual scenarios. Compared with SQL on conventional relational database, the algorithms implemented in SPL can often improve the performance several times or even a hundred times.

Theoretically, SPL, as a programming language, would not be faster than SQL. The reason why SPL actually runs faster is that it can implement the high-performance algorithms that SQL cannot implement. If a computing task has already adopted the best algorithm when it is implemented in SQL, rewriting it in SPL will not run faster. However, there are too many high-performance algorithms that are hard to be implemented in SQL, as long as the code is somewhat complex, it is almost certain that we will be able to find the points that can change the storage scheme and algorithm so as to optimize the performance. For more than ten scenarios we have performed, we could always find certain points to improve every time, in this case, rewriting in SPL can effectively improve the overall performance. Of course, coding in C/C++ or Java may also get a higher performance, but the development efficiency is too low.


It needs to be noted once again that each algorithm in this book has its own scenario to which it adapts; and even some algorithms cannot be used at the same time due to certain contradictions between them, therefore, the algorithm needs to be selected according to practical situation. We have been emphasizing that we should first fully understand task’s objectives and data characteristics, and then design an optimization scheme according to actual conditions. Moreover, the big data computing task in reality does not simply correspond to the algorithm presented in this book one by one but a comprehensive task, and hence we should use these basic algorithms in an appropriately combined manner to cope with real task. Consequently, it is more important to learn the analysis methods.

Because this book focuses on the algorithms and working principles of optimization schemes, complete examples and tests are not given. If you are interested in learning more test examples and codes, go to SPL forum to find them.

Table of contents

Performance Optimization - 1.1 [In-memory search] Binary search
Performance Optimization - 1.2 [In-memory search] Sequence number positioning
Performance Optimization - 1.3 [In-memory search] Position index
Performance Optimization - 1.4 [In-memory search] Hash index
Performance Optimization - 1.5 [In-memory search] Multi-layer sequence number positioning
Performance Optimization - 1.6 [In-memory search] Index on in-memory table

Performance Optimization - 2.1 [Dataset in external storage] Text file segmentation
Performance Optimization - 2.2 [Dataset in external storage] Bin file and double increment segmentation
Performance Optimization - 2.3 [Dataset in external storage] Data types
Performance Optimization - 2.4 [Dataset in external storage] Composite table and columnar storage
Performance Optimization - 2.5 [Dataset in external storage] Order and data appending
Performance Optimization - 2.6 [Dataset in external storage] Multi-zone composite table
Performance Optimization - 2.7 [Dataset in external storage] Data update

Performance Optimization - 3.1 [Search in external storage] Binary search
Performance Optimization - 3.2 [Search in external storage] Hash index
Performance Optimization - 3.3 [Search in external storage] Sorting index
Performance Optimization - 3.4 [Search in external storage] Row-based storage and index with values
Performance Optimization - 3.5 [Search in external storage] Index preloading
Performance Optimization - 3.6 [Search in external storage] Batch search
Performance Optimization - 3.7 [Search in external storage] Search that returns a set
Performance Optimization - 3.8 [Search in external storage] Merging multi indexes
Performance Optimization - 3.9 [Search in external storage] Full-text searching

Performance Optimization - 4.1 [Traversal technology] Cursor filtering
Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal
Performance Optimization - 4.3 [Traversal technology] Parallel traversal
Performance Optimization - 4.4 [Traversal technology] Load from database in parallel
Performance Optimization - 4.5 [Traversal technology] Multi-cursor
Performance Optimization - 4.6 [Traversal technology] Grouping and aggregating
Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation
Performance Optimization - 4.8 [Traversal technology] Redundant grouping key
Performance Optimization - 4.9 [Traversal technology] Column-wise computing

Performance Optimization - 5.1 [Ordered traversal] Ordered grouping and aggregating
Performance Optimization - 5.2 [Ordered traversal] DISTINCT and COUNT(DISTINCT)
Performance Optimization - 5.3 [Ordered traversal] Ordered grouped subsets
Performance Optimization - 5.4 [Ordered traversal] Program cursor
Performance Optimization - 5.5 [Ordered traversal] First-half ordered grouping
Performance Optimization - 5.6 [Ordered traversal] Second-half ordered grouping
Performance Optimization - 5.7 [Ordered traversal] Serial number grouping and controllable segmenting
Performance Optimization - 5.8 [Ordered traversal] Index sorting

Performance Optimization - 6.1 [Foreign key association] Foreign key addressization
Performance Optimization - 6.2 [Foreign key association] Instant addressization
Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization
Performance Optimization - 6.4 [Foreign key association] Time key
Performance Optimization - 6.5 [Foreign key association] Inner join syntax
Performance Optimization - 6.6 [Foreign key association] Index reuse
Performance Optimization - 6.7 [Foreign key association] Aligned sequence
Performance Optimization - 6.8 [Foreign key association] Big dimension table search
Performance Optimization - 6.9 [Foreign key association] One side partitioning

Performance Optimization - 7.1 [Merge and join] Ordered merge
Performance Optimization - 7.2 [Merge and join] Merge in segments
Performance Optimization - 7.3 [Merge and join] Association location
Performance Optimization - 7.4 [Merge and join] Attached table

Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation
Performance Optimization - 8.2 [Multi-dimensional analysis] Time period pre-aggregation
Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting
Performance Optimization - 8.4 [Multi-dimensional analysis] Boolean dimension sequence
Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension
Performance Optimization - 8.6 [Multi-dimensional analysis] In-memory flag change

Performance Optimization - 9.1 [Cluster] Computation and data distribution
Performance Optimization - 9.2 [Cluster] Multi-zone composite table of cluster
Performance Optimization - 9.3 [Cluster] Duplicate dimension table
Performance Optimization - 9.4 [Cluster] Segmented dimension table
Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance
Performance Optimization - 9.6 [Cluster] Spare-wheel-pattern fault tolerance
Performance Optimization - 9.7 [Cluster] Multi-job load balancing