Performance Optimization - Preface

 

Preface

The technical essence of big data is high performance. With sufficient performance, big data analysis can be truly implemented.

Performance optimization should be implemented under limited hardware conditions. Software cannot change the speed of hardware. What we can do is to design algorithms with lower complexity to reduce the actual amount of computation, and naturally we can obtain higher computing performance.

Some big data algorithms have good adaptability and can work in all cases, but they are usually more conservative and difficult to obtain high performance. In order to reduce the amount of calculation, we should carefully study and make use of the characteristics of data and tasks, and design appropriate storage schemes and calculation methods according to actual conditions.

The content of this book is to describe applicable storage schemes and optimization algorithms for different scenarios and objectives. After programmers are familiar with the principles and application prerequisites of these basic algorithms, they can flexibly combine and use them to solve high-performance problems in business. After understanding these algorithms and features, you can also make great progress in the technical selection and understanding of big data products.

The algorithms in this book are mainly oriented to structured data calculation, involving operations such as search, filtering, grouping, sorting and join. These are the basic contents of big data calculation and the most common tasks in data analysis and calculation.

This book does not just simply list and summarize the algorithms in history, many algorithms and optimization technologies are written in the book for the first time in the industry. This book not only discusses high-performance algorithms in theory, but also involves technical means that have no special advantages in complexity but can improve performance in engineering practice.

This book is not for beginners. It has certain professional requirements for readers:

1) Master various operations of relational database and SQL. The meaning of these operations will not be explained in this book.

2) Understand the knowledge equivalent to the data structure course of university computer major, and the relevant concepts will be directly cited.

3) Understand the basic knowledge of algorithm complexity analysis.

4) It is better to be familiar with C/C++ or Java programming language, memory management mechanism of operating system and basic LAN.

The principle and process of some algorithms are cumbersome and difficult. Application programmers do not have to master them. You can also use them as long as you understand the adaptive conditions of the algorithms and are familiar with the application code examples.

This book will use SPL to write application code examples, and directly use SPL data types and syntax to describe calculation objectives, which requires readers to understand in advance. Readers with SPL knowledge can easily convert these terms into the corresponding vocabulary of other programming languages.

SQL is the most commonly used structured data operation language, but it is too rough to apply most of the optimization algorithms in this book. Java, C/C++ and other programming languages still lack the necessary concepts of structured data operation, and to define them from the beginning will take too much length of the book. Moreover, although they can implement and apply these algorithms, the code will be quite long and too much energy will be consumed in details.

SPL may be the only programming language in the industry that can apply these algorithms without too cumbersome. After understanding the mechanism of these algorithms, you can also implement them by yourself in Java, C/C++ and other programming languages, and get better performance.

Table of contents

Performance Optimization - Preface
Performance Optimization - 1.1 [In-memory search] Binary search
Performance Optimization - 1.2 [In-memory search] Sequence number positioning
Performance Optimization - 1.3 [In-memory search] Position index
Performance Optimization - 1.4 [In-memory search] Hash index
Performance Optimization - 1.5 [In-memory search] Multi-layer sequence number positioning
Performance Optimization - 2.1 [Dataset in external storage] Text file segmentation
Performance Optimization - 2.2 [Dataset in external storage] Bin file and double increment segmentation
Performance Optimization - 2.3 [Dataset in external storage] Data types
Performance Optimization - 2.4 [Dataset in external storage] Composite table and columnar storage
Performance Optimization - 2.5 [Dataset in external storage] Order and data appending
Performance Optimization - 2.6 [Dataset in external storage] Data update and multi-zone composite table
Performance Optimization - 3.1 [Search in external storage] Binary search
Performance Optimization - 3.2 [Search in external storage] Hash index
Performance Optimization - 3.3 [Search in external storage] Sorting index
Performance Optimization - 3.4 [Search in external storage] Row-based storage and index with values
Performance Optimization - 3.5 [Search in external storage] Index preloading
Performance Optimization - 3.6 [Search in external storage] Batch search
Performance Optimization - 3.7 [Search in external storage] Search that returns a set
Performance Optimization - 3.8 [Search in external storage] Merging multi indexes
Performance Optimization - 3.9 [Search in external storage] Full-text searching
Performance Optimization - 4.1 [Traversal technology] Cursor filtering
Performance Optimization - 4.2 [Traversal technology] Multipurpose traversal
Performance Optimization - 4.3 [Traversal technology] Parallel traversal
Performance Optimization - 4.4 [Traversal technology] Load from database in parallel
Performance Optimization - 4.5 [Traversal technology] Multi-cursor
Performance Optimization - 4.6 [Traversal technology] Grouping and aggregating
Performance Optimization - 4.7 [Traversal technology] Understandings about aggregation
Performance Optimization - 4.8 [Traversal technology] Redundant grouping key
Performance Optimization - 5.1 [Traversal technology] Ordered grouping and aggregating
Performance Optimization - 5.2 [Traversal technology] Ordered grouped subsets
Performance Optimization - 5.3 [Traversal technology] Program cursor
Performance Optimization - 5.4 [Traversal technology] First-half ordered grouping
Performance Optimization - 5.5 [Traversal technology] Second-half ordered grouping
Performance Optimization - 5.6 [Traversal technology] Serial number grouping and controllable segmenting
Performance Optimization - 5.7 [Traversal technology] Index sorting
Performance Optimization - 6.1 [Foreign key association] Foreign key addressization
Performance Optimization - 6.2 [Foreign key association] Instant addressization
Performance Optimization - 6.3 [Foreign key association] Foreign key sequence-numberization
Performance Optimization - 6.4 [Foreign key association] Inner join syntax
Performance Optimization - 6.5 [Foreign key association] Index reuse
Performance Optimization - 6.6 [Foreign key association] Aligned sequence
Performance Optimization - 6.7 [Foreign key association] Big dimension table search
Performance Optimization - 6.8 [Foreign key association] One side partitioning
Performance Optimization - 7.1 [Merge and join] Ordered merge
Performance Optimization - 7.2 [Merge and join] Merge in segments
Performance Optimization - 7.3 [Merge and join] Association location
Performance Optimization - 7.4 [Merge and join] Attached table
Performance Optimization - 8.1 [Multi-dimensional analysis] Partial pre-aggregation
Performance Optimization - 8.2 [Multi-dimensional analysis] Time period pre-aggregation
Performance Optimization - 8.3 [Multi-dimensional analysis] Redundant sorting
Performance Optimization - 8.4 [Multi-dimensional analysis] Dimension of boolean sequence
Performance Optimization - 8.5 [Multi-dimensional analysis] Flag bit dimension
Performance Optimization - 8.6 [Multi-dimensional analysis] In-memory flag change
Performance Optimization - 9.1 [Cluster] Computation and data distribution
Performance Optimization - 9.2 [Cluster] Multi-zone composite table of cluster
Performance Optimization - 9.3 [Cluster] Duplicate dimension table
Performance Optimization - 9.4 [Cluster] Segmented dimension table
Performance Optimization - 9.5 [Cluster] Redundancy-pattern fault tolerance
Performance Optimization - 9.6 [Cluster] Spare-wheel-pattern fault tolerance
Performance Optimization - 9.7 [Cluster] Multi-job load balancing
Performance Optimization - Postscript

Download PDF

《Performance optimization》pdf