"Background Relational database is the most common data storage scheme, and hence SQL naturally b .."

Jerry RaqForum 14 No.
200 View • 10 Months ago

Two major shortcomings of Python in enterprise applications

Background

Relational database is the most common data storage scheme, and hence SQL naturally becomes the first choice for data processing. However, as the complexity of enterprise applications advances, data operation and processing implemented in SQL begins to encounter many serious problems at framework level. Specifically, it is difficult to migrate complex SQL (stored procedures); it imposes a heavy burden on the database since all processing and computing of data are executed in database, which has become a bottleneck of the whole application; Sharing a database by multiple applications will easily lead to strong coupling between applications. Therefore, more and more modern applications begin to resort to other technologies to process data.

Among these technologies, Python is a good choice for the reason that: i)it offers powerful library, and supports various data sources; ii) its syntax is flexible and can express complex calculation process; iii)it can be stored independently or coupled with front-end applications, and is easy to maintain; iv) it provides a complete IDE, which allows us to debug conveniently. Therefore, more and more applications begin to use Python to process data.

However, two major shortcomings exist when Python is used for enterprise applications.

Inefficient big data operation

Python relies heavily on Pandas when processing structured data. For conventional in-memory computations such as sorting, grouping, aggregating and join, it is easy to develop in Python and can achieve good performance because it offers basic library functions. But Python provides little support for big data that cannot be loaded into memory, yet big data operation is common task in enterprise applications.

When the data cannot be fully loaded into memory, it will be difficult to process with Pandas. In this case, we have to read data in segments, and then code ourselves to execute the calculation task according to requirements. For simple counting or aggregating operation, it is not difficult to handle, but when faced with operations like grouping and join, hard coding will be very cumbersome.

Let's take a big data grouping and aggregating operation as an example. Database usually uses the Hash grouping algorithm, which can obtain good performance. However, Python doesn’t provide such algorithm, executing this calculation through hard-coding is very troublesome. Since the difficulty of this task is beyond the capabilities of most programmers, they usually have to put up with the second best and select an easy-to-implement algorithm, for example, they will sort the file by the grouping column first in external storage and then to traverse the ordered file during which neighboring records are put to same group if they have same grouping column value, and a record is put to a new group if its grouping column value is different from the previous one. This algorithm is relatively simple yet very slow.

Even if there is a whiz who can implement HASH algorithm, it is still hard to achieve high performance that big data operation requires under the Python system. The reason is that the parallel processing of Python is fake, which is actually serial processing for CPU, or even slower than serial processing, and thus it is difficult to leverage the advantages of modern multi-core CPU.

To be specific, there is a Global Interpreter Lock in the CPython interpreter (the mainstream interpreter of Python). This lock needs to be got ahead of executing Python code, which means that even if multiple threads of CPU work in the same time period, it is only possible to execute the code with one thread and multiple threads can only execute the code in an alternately way. Yet, since multiple-thread execution involves complex transactions such as context switching and lock mechanism processing, the performance is not improved but decreased.

Due to the inability of Python to make use of simple multi-thread parallel processing mechanism in one process, many programmers have to adopt the complicated multi-process parallel processing method. The cost and management of process itself are much more complex, and the effect of parallel processing cannot be comparable to that of multiple threads. In addition, the inter-process communication is also very complex, programmers have to give up direct communication sometimes, and use file system to transfer the aggregation result instead, which leads to a significant performance decrease.

The performance of processing big data is largely related to data IO. If the IO cost is too high, it doesn’t work no matter how fast the operation speed is. Efficient IO often relies on specially optimized storage scheme. Unfortunately, however, Python doesn’t offer a widely used efficient storage scheme but usually stores data to text file or database, either of which has poor IO performance. If the data source itself is text file or database, we can do nothing but to bear low IO speed. However, many complex operations (such as big data sorting) need to store intermediate results during computation; theoretically the read and write performance are controllable, yet since Python lacks efficient storage scheme, we have to use inefficient text or database, resulting in low overall performance. In addition, some operations require utilizing a large amount of historical data, if such data are all read from text or database, it often results in an embarrassing situation where the IO time far exceeds computing time.

Confusing versions

Chaotic in versions of Python is a headache to many developers, especially for enterprise applications. Python was originally designed as a personal programming language, and did not give much consideration to the requirements of collaborative work in enterprise applications, and focused only on the convenience of personal use. Any developer has an easy-to-use version in their mind, which will cause serious problems to enterprise applications. For example, when two applications developed by two developers are put on the same server, the applications will fail to run due to incompatible Python versions.

The versions of Python are quite complex indeed. For example, a completely incompatible upgrade was once made to major version, that is, update Python2 to Python3, it results in a phenomenon that programs that run normally on Python 2 probably fail to run on Python 3, so many businesses install both versions to solve this problem.

Minor versions also cause small problems. For example, Python 3.7 adds the dataclass decorator, and if there is a decorator of “dataclass” in the program of the old version of Python, then running this program directly on Python3.7 may report an error due to the duplicate name of decorator. Although such problem is not as serious as upgrading Python 2 to Python 3, the more versions are updated, the more problems accumulate, resulting in a big problem ultimately.

This problem does not seem to be difficult to solve, as long as enterprises reach a consensus and use a “stable version”. In fact, similar problems exist in many software utilities, and this solution works. However, for Python, it is not that easy.

The reason why Python is powerful relies heavily on its rich third-party library packages. However, many of these packages are independently developed, and little consideration is given to mutual compatibility. While rich library packages bring exceptional flexibility and rapid growth to Python, the blind growth often results in incompatibility between packages, and incompatibility between package version and Python version. For other software, there is usually an official authority to make checks on the compatibility, which can avoid incompatibility, and the authority will announce incompatibility in advance even if it does occur, thereby reducing unnecessary troubles. For Python, however, there is no such an authority, which causes a phenomenon that even the developer himself of library package can hardly tell which library package or which Python version the library package he developed is incompatible with, let alone ordinary programmers. For personal development, the compatibility problem is easy to solve, as we just need to choose compatible library packages and Python versions. But for enterprise applications, compatibility issues will be magnified, for example, the library package that application A depends on is not compatible with the one that application B depends on; application C conflicts with application D. Although it is possible to isolate applications and avoid mutual influence by adopting methods such as Docker and virtual machine, it still needs to maintain several or even dozens of Python versions corresponding to applications. Not to mention the additional cost caused by these technologies themselves, maintaining these Python versions alone is a disaster.

Solve Python's major shortcomings with SPL

esProc SPL (hereinafter referred to as SPL) is an open-source programming language specifically for structured data computation, and was originally designed for solving the difficulties of SQL (difficult to code and slow in running for complex calculation tasks, difficult for cross-source computation, dependent on stored procedures, etc.). Like Python, SPL can solve various shortcomings of SQL and stored procedures while retaining the advantages of SQL.

SPL has many advantages. Specifically, it offers an computing ability independent of database, allowing us to switch over to another database without modifying the SPL script, and solving the migration issue of stored procedure; it has a concise and easy-to-use IDE equipped with a complete set of editing and debugging functionalities, making it easier to implement various algorithms; the SPL system is more open and allows accessing and processing a variety of data sources directly; the “outside-database stored procedure” does not rely on database and can be stored together with the corresponding application, getting rid of the coupling issue; the management problem is solved through the tree structure of file system; SPL runs independently of the database, and hence security risk is avoided.

In addition, SPL has the ability to solve two major shortcomings of Python.

Efficient big data operation

First of all, we should acknowledge the performance of Python in in-memory computing, and the performance of its basic functions (such as groupby and merge) is not bad. Especially when performing the matrix operation on pure number, Python can achieve very good performance.

SPL is superior to Python in most in-memory computing scenarios. Detailed performance comparison can be found in the following two articles.

Comparison test of Python and SPL in data reading and computing performance

Performance comparison of Python and SPL in data processing

In addition, SPL offers big data operation and parallel computing abilities that Python lacks, and binary data file formats (btx and ctx), which can greatly improve data read and write efficiency. A ready-made storage way can reduce the amount of calculation, which is the unique advantage of SPL.

SPL not only has the ability to process big data, but strives to be the same as in-memory operation in coding, for example:

	A	B
1	D:\data\orders.txt
2	=file(A1).cursor@t()	/Create a cursor
3	=A2.groups(state;sum(amount):amount)	/Group and aggregate

This is a code that adopts the cursor mechanism to group and aggregate big data. Looking at A3 alone cannot distinguish it is an in-memory calculation or a cursor calculation, because this code is the same as that of in-memory operation.

It is also very convenient for SPL to perform parallel processing; we just need to add the @m option in A2 and A3:

A2=file(A1).cursor@mt()

A3= A2.groups@m(state;sum(amount):amount)

SPL will automatically use the multiple cursors to read data in parallel and accomplish calculation.

In addition to the grouping and aggregating operation, SPL’s cursor supports various common big data operations such as aggregation, association, and set operations, enabling us to improve performance through parallel processing, and reduce the data read amount by making use of the multi-purpose traversal technology of cursor/channel, thereby further improving big data operation performance.

As discussed above, the performance of processing big data is largely related to data IO. For this reason, SPL provides two high-performance file storage types: bin file (*.btx) and composite table (*.ctx). The bin file adopts the compression technology (faster reading due to less space occupation), and stores the data types (faster reading as a result of avoiding parsing data type). Since the bin file supports the double increment segmentation mechanism that can append data, it is easy to implement parallel computing by utilizing the segmentation strategy, and the computing performance is ensured. The composite table supports the columnar storage, which has great advantage when the number of columns (fields) involved in calculation is small. In addition, not only is the index adopted on composite table, but the double increment segmentation mechanism is supported, allowing us to utilize the advantage of columnar storage, and improve the performance more easily through parallel computing.

Consistent version

SPL is developed and maintained by a commercial business, which can ensure there are no incompatible versions. Although minor modifications are inevitable when a version is updated, it will not be as annoying as Python, nor does an incompatible situation happen after developers deploy their code to the server.

More advantages

Compared with Python, SPL has some other advantages. For example, it is simpler to develop in SPL; it is convenient to integrate SPL into application.

For example, calculate the daily rising/falling rate of a stock, Python code:

stock_pre=stock ["CLOSING"].shift(1)

stock ["RATE"]=stock ["CLOSING"]/ stock_pre -1

This code uses shift() (shifting the index) to accomplish calculation.

Another example, calculate the moving average of a stock in 5 consecutive trading days, Python code:

stock ["MAVG"]=stock ["CLOSING"].rolling(5,min_periods=1).mean()

This code uses rolling() (generating iteration object) to accomplish calculation.

In fact, both problems belong to the calculation of getting adjacent data, but Python uses two methods. The adoption of completely different methods to solve similar problems virtually increases the cost of learning.

In contrast, SPL handles similar problems in the same method:

RATE=stock.(if(#==1,null,CLOSING/CLOSING[-1]-1))

MAVG= A2. (CLOSING[-4:0].avg())

As we can see that both codes use [] to get adjacent data. The only difference is that the objects to be obtained are different; CLOSING[-1] is to fetch the previous record, while CLOSING[-4:0] is to fetch the previous four records. Developers can easily solve such problems as long as they master the usage of [].

Such syntax consistency issue is also caused by the blind growth of Python due to the lack of an authority to oversee its versions. Although Python is highly adaptable, the lack of rules on version compatibility makes it difficult for developers to control. In contrast, SPL is elaborately designed, powerful in computing ability, and considerations are paid to the version compatibility when updating, and thus it can be easily controlled.

In addition, Python is lacking in structured data operation. Let’s take the ordered grouping operation as an example, since Python doesn’t support ordered grouping directly, it has to adopt an indirect way, that is, create an order-related derived column first, and then carry out conventional grouping operation, which is not only difficult to develop, but inefficient in running. SPL, by contrast, provides related algorithm and options (group@o()), making it possible to make fully use of the ordered characteristic of data, improving computing performance while reducing development difficulty. For more information, visit Comparison between SPL and Python in processing structured data.

For enterprise applications, consideration should be given to integration. Since many of the modern applications are based on the J2EE system, it often needs to run in two processes when Python works with Java applications, resulting in a poor result in both the invoking performance and stability. In contrast, SPL is developed purely in Java and can be completely and seamlessly integrated into Java applications, allowing it to be managed, operated and maintained by numerous mature Java frameworks, and enabling it to possess the abilities such as load balancing, elastic expansion, security control, and thus both the invoking performance and the stability are satisfactory.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

Jerry • 200 View • 10 Months ago