Desktop programming languages suitable for data analyst

 

Many scenarios need to use programming language to analyze and process the desktop data such as Excel and csv. In these scenarios, simple tasks include querying multiple files of same format through parameter, batch splitting or merging files, processing larger files, etc.; complex tasks include organizing files of irregular format, multi-step calculation, judging the flow of multi branches, calculating by sequence number or location, processing each group of data after grouping, and multi-layer association calculation, etc.

There are many programming languages that meet the above requirements; however, the vast majority of these languages are aimed at professional programmer, such as Java and C#. To use them, we need to understand heavy framework, design the data table object (Data Table or Data Frame), write cumbersome code, and go through painful learning process. In fact, there are some other programming languages, which save us from building heavy framework, come with data table objects, and are simply in coding and easy to learn, and suitable for data analyst to process desktop data. This article will present them one by one.

Python

..

Python is a flexible and simple interpreted language, and often uses Pandas, a third-party function library, to access and calculate the data. Python code is short and easy to learn, and the IDE that Python often adopts is Jupyter-Notebook (or Jupyter-Lab, an upgraded version).

Python supports a variety of data sources and its ability to read and write the data source is relatively strong. For example, it usually uses Pandas to read and write pure data, and the code is short; it generally uses the third-party function libraries like XLWings to access the style attributes like Excel border, and the code is longer. In addition, Python has quite powerful computing ability, it provides the data table object specially for computing, as well as a lot of basic calculation functions, yet the code is long when implementing complex calculations.

Python’s disadvantages are summarized as follows: first, it is fairly troublesome to install IDE separately. Specifically, not only do we need to install the Python interpreter manually, but we also need to install the third-party libraries additionally such as Numpy and Pandas, and install Jupyter through command line. Although using Anaconda development platform is more convenient, allowing us to install the components needed for Jupyter with one click, it is less stable than separate installation because we need to integrate the components of several different teams. Second, Jupyter is average in terms of the ease of use. To be specific, it needs to type a command to display the function description; it needs to install the extension component to display the function tips; it also needs to type a command to perform debugging. As we can see, it is troublesome to operate, the only advantage of Jupyter is that the operation window is concise and clear, and suitable for data analyst. Third, although Python has the mechanism to process large file, it lacks directly usable function, we have to swap the data of memory and external storage manually.

https://www.anaconda.com/products/distribution

esProc SPL

..

SPL is a flexible, simple interpreted data-calculation language, and has built-in native data source read and write library and computing library, and its overall performance is good. SPL code is short and easy to learn. In addition to native syntax, SPL allows us to execute SQL directly on files like Excel and csv.

SPL has many advantages, i)it provides its own IDE, and supports one-click installation without resorting to external environment or interpreter, and eliminating the need to configure third-party function library. SPL IDE is concise and easy to use and suitable for data analyst. In SPL IDE, we can use the shortcut key to display the function description, there is function input prompt, and it also provides complete debugging functionality, making it possible to display the calculation result of each step; ii)it supports many types of data sources and has powerful ability to read and write data source, and allow us to read and write common data source with short code, yet it doesn’t support accessing the style attributes like Excel border; iii) it has strong computing ability, and has built in the data table objects that facilitate calculation, as well as a lot of basic calculation functions. SPL code is not long when performing complex calculations. Moreover, it has strong computing ability on large file, and provides many directly usable functions, making it possible to automatically swap the data of memory and external storage.

http://www.raqsoft.com/esproc-desktop

KDB+ (Q)

..

Q language is a simple and flexible interpreted language targeting in-memory database, and has built in multiple native function libraries, including the function library for data source read and write, the function library for calculation, and the function library for database maintenance. The IDE that Q often utilizes is Studio for KDB+. Q’s code styles are not the same. Specifically, when calculating the table data object, it adopts SQL style, which is highly readable and not difficult to learn, while for other statements, such as flow control and data management, it uses APL style, which is less readable and difficult to learn.

Q has strong computing ability, and has built-in data table objects, and provides relatively rich basic calculation functions, making it possible to perform more complex calculations. Although Q code is short, it is difficult to code.

The disadvantages of Q language are reflected in the following aspects. First, the installation is troublesome. Specifically, each installation step needs to be done manually, the command line and configuration file need to be used when configuring, and it needs to write script when starting up, all the steps require higher technical skills. Second, Q’s IDE needs to be downloaded separately, the Java environment needs to be installed in advance, and it needs to write the script to start IDE. In addition, IDE window is too simple and lacks basic functionalities, such as function description, input prompt, debugging functionality and so on. Third, the ability of Q to read and write data source is weak, it supports only csv file with private form or regular form, and supports few other data sources including Excel. Fourth, the ability of Q to calculate large file is relatively weak, it lacks the function to directly calculate large file, and hence we need to swap the data of memory and external storage manually.

https://kx.com/

MATLAB

..

MATLIB is a full-featured and highly integrated interpreted computing language, and has a large number of built-in model prediction libraries, as well as data source access and data computing libraries, allowing us to analyze and process desktop data. MATLIB comes with IDE, and the code is short and easy to learn.

MATLIB IDE is installed through a graphical interface without resorting to an external environment or third-party components, but it is large in size, time-consuming and involves many steps. MATLIB IDE is convenient and easy to use, and has the function description and input prompt functionalities, complete debugging functionality and concise operation window, and is suitable for data analysts. MATLIB supports rich data sources, allowing us to read and write the data in Excel/csv file with short code, and access style attributes such as Excel border. However, when accessing style attribute, we need to use cominterface, the code is longer. In addition, MATLIB has strong computing ability, and has built-in data table objects that facilitate calculation, and provides rich basic calculation functions, yet the code is long when implementing complex calculation.

Although MATLIB has the mechanism to process large file, the processing ability is weak as a result of lacking the calculation function targeted directly on large file, so we need to swap the data of memory and external storage manually.

https://www.mathworks.com/products/matlab-online.html

R

..

R, as a simple and flexible interpreted computing language, is comprehensive in function but not high in integration. This language has only built-in the csv file read and write library and basic computing library, and thus we have to turn to the third-party function libraries if we want other functionalities. R code is simple and easy to learn, and the IDE that R often uses is RStudio.

To use R, we need to follow these steps: first install and configure the runtime environment of R, then install RSutdio, and finally install the third-party function libraries such as xlsx, xlsReadWrite, data.table, All of the steps are performed in graphical interface and the installation process is fairly convenient. RSutdio provides the function description, input prompt and complete debugging functionalities, the operation window is concise, and suitable for data analyst. R has relatively powerful ability to read and write data source, and supports a variety of data sources. Specifically, it can read and write the data in Excel/csv file through short code; it can also read and write the style attributes, except that there is a need to switch between libraries, the code is long, and the styles are not uniform.

R is average in computing ability. Although it has built-in data table objects and calculation functions, both the functionality and the number are insufficient, and hence we often need to use the third-party function libraries to make up with the insufficiencies, and its stability is a little bit poor. R can implement more complex calculations, but the code is complicated and long, and the performance is poor. While R has the mechanism to process large file, the ability is weak as a result of lacking a calculation function directly on large file, so we need to swap the data of memory and external storage manually.

https://posit.co/products/open-source/rstudio/