"Usually, a large amount of data preparation work is needed before performing data analytics and .."

mars RaqForum 25 No.
1 Reply • 316 View • 8 Months ago

Data Preparation Script: Python Pandas OR esProc SPL?

Usually, a large amount of data preparation work is needed before performing data analytics and AI computing. It involves arranging different data sources and different types of data into a consistent format. As the real-world scenarios are extremely complicated and diverse, almost no visualization tools can do the job and generally programming is needed.

Many free scripting languages are available for data preparation work in the related industries. Among them Python Pandas is preferred by users because of its supply of interfaces for various data sources and a rich library of computing functions. esProc SPL is an emerging data handling language, which is distinguished by agile syntax and powerful computing capability. Following will compare the two from many aspects for daily processing tasks, such as data parsing, data cleansing, data computing and data export, but not for further uses, such as AI computing, and special computing scenarios such as high-performance computing.

Language features

Programming paradigm

Python is a general-purpose programming language, which supports multi-paradigm programming, including completely object-oriented and functional. However, there is a great number of Python users who are not professional application programmers and who seldom use the two complicated modern programming paradigms. Actually, the ancient and simple procedural programming is the most commonly used.

SPL specializes in structured data processing and also supports the three commonly seen programming paradigms. The language greatly simplifies the concept of object orientation. It retains the object concept and supports accessing attribute using the dot and step-by-step computations, but it does not have such concepts as inheritance and overloading. SPL also streamlines the functional programming to make it more suitable for non-professional application programmers – where the Lambda expressions are even simpler and easier to use than SQL’s.

Language unity

Pandas isn’t Python’s native class library but a third-party one developed based on Numpy (Numpy itself is a third-party class library). It is outside the Python schema and cannot get low-level support from Python, resulting in low language unity and low specialization of basic data types, particularly the structured data type (DataFrame). This affects both coding performance and computing efficiency.

SPL is a native class library, and can design its syntax, functions, interfaces and basic data types, particularly the structured data type (table sequence), bottom-up, creating a high level of language unity.

Working mode

Python is an interpreted language written in C language. SPL is a Java-based interpreted language. Both can automatically infer data types, and provide agile and convenient syntax to do that. Interpreted languages generally have lower performance than compiled languages. But SPL is often able to obtain better performance than compiled languages by building a great number of low-time-complexity basic operations in it. Pandas cannot have performance as good as Python’s native class libraries owing to its low-level language unity.

IDE

Both Python and SPL have graphics-based IDE, which includes a complete set of debugging functionalities, convenient structured object monitoring and intuitive code block/working sphere indentation. Python uses space/tab indentation while SPL adopts Excel’s grid-style indents.

Learning threshold

There is plenty of documentation about Pandas and the learning threshold is low. But the learning curve suddenly becomes steep when users try to learn the complete object-oriented programming and functional programming to be able to handle more difficult situations.

SPL deliberately simplifies the concept of object and functional programming interface, making it not hard for both beginners and advanced learners. Yet it is not so easy to learn the many special algorithms designed to achieve high-performance computing.

Code amount

Pandas has a wealth of library functions, which alone are sufficient for achieving simple data preparation tasks and involve a relatively small amount of code. Yet, it intensively needs Python native class libraries and the third-party class libraries in order to implement complex data preparation tasks. With a low-level of language unity, the amount of code swells as coding difficulty spikes.

SPL boasts a large number of library functions and a high-level of language unity, so it can use equal amount of code to handle both simple and complex data preparation tasks.

Data sources

Supported data source types

Pandas supports various data sources, including:

Text files: Tab-separated txt file, CSV (comma-separated values) files, and files delimited by custom separator.
Fixed Width Format (FWF) files,
Various relational databases,
Excel,
JSON,
XML,
Restful and WebService，
HTML crawling,
SAS,
SPSS,
Stata,
Column-oriented Parquet,
Column-oriented ORC,
Google BigQuery,
HDF for storing scientific data,
Feather for storing data frames,
Structured data in the clipboard,
And proprietary format pickle.

SPL also supports diverse data sources, including:

Text files: Tab-separated txt file, CSV (comma-separated values) files, and files delimited by custom separator.
Fixed Width Format (FWF) files,
Various relational databases,
Excel,
JSON,
XML,
Restful and WebService，
HTML crawling,
HBase,
HDFS,
Hive,
Spark,
Elasticsearch,
MongoDB,
Kafka,
R2dbc,
FTP,
Cassandra,
DynamoDB,
influxDB,
Redis,
SAP,
Structured data in the clipboard,
And proprietary format btx and ctx.

Database read/write

To query database in SQL and update the database using a CSV file:

Pandas:

conn = create_engine('mysql+pymysql://root:password@localhost:3306/testdb')
df_read = pd.read_sql_query('select * from product', conn)
data = pd.read_csv("d:/Orders.csv")
data.to_sql('testdf', conn, index=False)
conn.dispose()

The code is elegant enough when Pandas simply reads/writes data from/to the database.

SPL:

	A
1	=connect(“com.mysql.jdbc.Driver”,“jdbc:mysql://localhost:3306/testdb?user=root&password=password”)
2	=A1.query("select * from product")
3	=T("d:/Orders.csv")
4	=A1.update(A3, testdf; ORDERID)
5	=A1.close()

SPL code is simple, too. The overall logic is similar to Pandas code. Difference is that SPL allows writing data source information in the configuration file and it only needs to reference the data source name in the code. A1, for instance, can be written as connect(“myDB”).

Reading and writing text files

Regular-format text: to read a CSV file, perform a simple computation and write the result to a new CSV file:

Pandas:

data = pd.read_csv("d:/Orders.csv")
data['OrderDate']=pd.to_datetime(data['OrderDate'])
result=data.groupby(data['OrderDate'].dt.year).agg({'Amount':[len,np.sum]})
result.to_csv("d:/resultP.csv")

The Pandas code is concise, but there are still some problems. The datetime type cannot be automatically parsed. There are curly braces, square brackets and parentheses, and expressions and strings in the code, making low language unity and needing considerable optimization.

SPL code implementing the same computations:

	A
1	=T("d:/Orders.csv")
2	=A1.groups(year(OrderDate);count(1),sum(Amount))
3	=file("d:/resulS.csv").export@t(A2)

SPL code is succinct. SPL automatically parses the datetime type, uses one type of brackets and only expressions, creating a very high-level of language unity.

Irregular-format text: Every three rows correspond to one record, where the second row has three fields (members of a set are also sets). To arrange such a file into a regular-format structured object:

Pandas:

data = pd.read_csv("d:/threeLines.txt",header=None)
pos_seq=[i//3 for i in range(len(data))]
def runSplit(x):
    f123=x.iloc[1,0].split("\t")
    f=[x.iloc[0,0],f123[0],f123[1],f123[2],x.iloc[2,0]]
    return pd.DataFrame([f], columns=['OrderID','Client','SellerId','Amount','OrderDate'])
df=data.groupby(pos_seq).apply(runSplit)
df.reset_index(drop=True, inplace=True)			#drop the Second Index

The above parsing process generally has three parts: read the text as a single-field DataFrame; perform order-based grouping – putting every three rows in one group; loop each group to piece data together into a single-record DataFrame, and concatenate all records when loop is over to form a new DataFrame.

Obviously, Pandas code becomes complex when handling irregular-format texts. These are proofs. When creating a grouping criterion like [0,0,0,1,1,1,2,2,2…], Pandas uses the complicated for loop statement to first define the loop number i and then perform division by i and get the quotient. When using apply to loop through each group of data, the language needs to define a function to handle the intra-group data. As the function definition occupies more than one statement, it cannot use Lambda expression to simplify it (Even the compiled languages, such as Java, do not have such a restriction). When getting members of DataFrame data, Pandas can only use iloc function (or loc function) but it can directly use the subscript to get members of list f123. Both are sets but their data retrieval methods are completely different only because DataFrame is not a Python native class library – which indicates a low-level of language unity – and does not enjoy succinct syntax rules as a native class library does. DataFrame itself has index. When using apply function to concatenate multiple DataFrames, it adds another level of index and you need to peel one level off manually.

SPL:

	A
1	=file("D:\\split.csv").import@si()
2	=A1.group((#-1)\3)
3	=A2.new(~(1):OrderID, (line=~(2).split("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )

SPL’s parsing logic is same, but the code is much simpler. Instead of the complex for loop statement, it uses the simpler group(…) loop function to create the grouping criterion. And it does not need to define the loop number because symbol # by default represents it (~ by default represents loop variable). Though SPL still needs to define a function to handle the intra-group data as the new()function loops through each group of data, it can write multiple lines of code in new() thanks to its support of concise Lambda expressions. Python, however, defines the whole function structure manually to do the job. SPL directly uses the subscript to get members from its any type of sets (including table sequence) – the syntax is simple and consistent. In the last step, the SPL new() function concatenates multiple records without generating a new, useless index. In short, the underlying reason for the more concise SPL code is that the native class library has a higher level of language unity.

Multilevel data handling

In a JSON file, the higher level contains salespeople data and the lower level has orders data. To query all orders meeting the specified condition:

Pandas:

JsonStr=open('D:/data.json','r').read()
JsonObj=json.loads(JsonStr)
df=pd.json_normalize(JsonObj,['Orders'])
df['OrderDate']=pd.to_datetime(df['OrderDate']) 
result=df.query('Amount>1000 and Amount<2000 and contains("business")')

Pandas code is simple enough. But there is one thing you need to note. Python’s basic data types, such as dict and list, support generic type, naturally correspond to JSON’s object and array types, and are suitable for representing multilevel JSON data (but not for two-dimensional data). On the contrary, DataFrame represents two-dimensional data well, but it does not support different data types in a same column. It isn’t the true generic type and cannot represent the general multilevel JSON data. Being not good at describing multilevel JSON, it needs json_normalize function to convert the multilevel JSON into a two-dimensional DataFrame for subsequent computations. This is another evidence of Pandas low language unity.

SPL:

	A
1	=file("d:/EO.json").read()
2	=json(A1)
3	=A2.conj(Orders)
4	=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,\"business\"))

Table sequences support both two-dimensional data and multilevel data. They support generic type thoroughly, naturally correspond to JSON’s object and array, and are good at representing multilevel data, which is the general form of two-dimensional data. They also fit for representing two-dimensional data and can be computed directly, without additional normalization.

Accessing nodes on different levels: To group JSON data, where the grouping field consists of a higher level of field and a lower level of field:

Pandas:

JsonStr=open('D:/data.json','r').read()
JsonObj=json.loads(JsonStr)
df=json_normalize(JsonObj,record_path=['Orders'],meta=['Name','Gender','Dept'])
result=df.groupby(['Dept','Client']).agg({'Amount':['count','sum']}).reset_index()
result.columns = ['Dept','Clt','cnt','sum']

As Pandas DataFrame cannot represent multilevel JSON, it does not support intuitively accessing data according to the tree-structure levels. It can only use normalize function to convert the multilevel data into two-dimensional data and access the flat-structure data.

SPL:

	A
1	=json(file("d:/data.json").read())
2	=A1.groups(Dept,Orders.Client:Clt; count(Orders.OrderID):cnt, sum(Orders.Amount):sum)

A SPL table sequence can represent and compute multilevel JSON data in a more concise and elegant way. One of the characteristics of the multilevel data computing is the convenient to use syntax that represents the tree-structure hierarchical relationships. The dot in the above code “Orders.Client”, for instance, can be used to reference data of any node. This type of reference can notably streamline the description.

Same logic for XML handling. Both Pandas and SPL can compute XML. DataFrame does not support multilevel XML data, it needs to convert the data into a two-dimensional structure and has less expressive power. SPL table sequences can both represent and compute multilevel XML data with elegant code.

Unlike JSON’s normalize function, Pandas does not offer convenient normalization function for XML. The official recommendation is that users can convert multilevel XML into two-dimensional XML using an XML query language, such as the commonly used XSLT and XPath. To compute XML data, users need to learn a third-party language. The learning cost is too high.

The SPL language has a very high-level of unity. It can parse XML data using similar code for parsing JSON, and compute XML using the same code for computing JSON. The learning cost is low. To group and summarize multilevel XML data, for instance:

	A
1	=file("d:\\xml\\emp_orders.xml").read()
2	=xml(A1,"xml/row")
3	=A2.groups(Dept,Orders.Client:Clt; count(Orders.OrderID):cnt, sum(Orders.Amount):sum)

Besides files, both Pandas and SPL can parse multilevel data coming from RESTful/WebService. The former, however, does not offer built-in RESTful/WebService interface due to low level of language unity and needs to import third-party class libraries. Below is one way of coding:

import requests
resp=requests.get(url="http://127.0.0.1:6868/api/emp_orders")
JsonOBJ=resp.json()

With high language unity, SPL naturally supports multilevel data and RESTful/WebService:

=json(httpfile("http://127.0.0.1:6868/api/emp_orders").read())

Structured data object

Generation

The structured data object in Pandas is DataFrame, which can be generated from a data source or created directly. Below is one way of generating or creating a DataFrame:

# Generate from List; two fields, four records; row numbers (indexes) are 0-3 by default; column names are 0-1
df=pd.DataFrame([[1,'apple'],[2,'orange'],[3,'banana'],[4,'watermelon']])
# Generate from an Array
pd.DataFrame(numpy.array([[1,'apple'],[2,'orange'],[3,'banana'],[4,'watermelon']]))
# Generate from Dict; column names are the specified one and two
pd.DataFrame({'one':[1,2,3,4],'two':['apple','orange','banana','watermelon']})

A DataFrame is made up of multiple Series (columns or field objects). The lower level of Series consists of atomic type data or objects (pointers). Pandas does not have true record object. This is convenient in certain scenarios but also makes the logic hard to understand and nonintuitive. The Python package often uses the native Python class libraries and data objects of the third-party class library Numpy, including Set (mathematical sets), List (a set whose members can be repeatable), Tuple (a set whose members are immutable and can be repeatable), Dict (a set of key pairs), Array (a collection of elements). All these data objects are sets. It’s easy to confuse them with Series and DataFrame but it’s also hard to convert them to any of the two, creating not a few troubles to Pandas beginners. In addition to the external class library’s sets, it’s easy to confuse Series with the other Pandas sets, such as the post-grouping set DataFrameGroupBy. These are all evidences of Pandas low language unity and lack of support from the low level.

SPL’s structured data object is table sequence, which can be generated as follows:

// First create the structure and then insert data using a sequence; row numbers are 0-3 and column names are the specified one and two
T=create(one,two).record([1,"apple",2,"orange",3,"banana",4,"watermelon"])
// First prepare data in the form of sequence (including column names) and then create the table sequence
["one","two",1,"apple",2,"orange",3,"banana",4,"watermelon"].record(2)
// Use table sequence T0’s structure as the structure of the new table sequence and then enter data to it
T0.create(one,two).record([1,"apple",2,"orange",3,"banana",4,"watermelon"])

A table sequence consists of multiple Records (record objects). Their lower level contains the atomic type data or objects (pointers). A table sequence has true record objects, which are easier to understand and enabling intuitive coding in most scenarios. Though Record and a single-record table sequence are essentially different, they have similar business meaning and are easy to confuse. SPL is specifically designed to make the two appear to have consistent uses and deliberate differentiation is not needed so that the possibility of confusion becomes less. SPL only has two types of sets: sequence (similar to List) and table sequence. The former constitutes the basis of the latter, and the latter is the structuralized former. The result of grouping a table sequence is a sequence. They have clear relationship and differences, and can be easily converted between each other, making both language learning and coding rather cheap. SPL provides syntax support from the low level and thus obtains high language unity.

Data access

Pandas DataFrame has built-in row numbers (starting from 0), field numbers (column numbers) and field names (column names). Users can access records through the subscripts or their field names:

# Get the list of row numbers; index is the row number field name
list(df.index)
# Get the first record
df.iloc[1]
# Get records from the first to the third represented by an interval (left-closed and right-open)
df.iloc[1:4]
# Move forward to get records (at even-numbered positions) 
df.iloc[1::2]
# Get the second-to-last record (starting from 1)
df.iloc[-2]
# Get values according to record numbers and field numbers
df1.iloc[1,0]
# Get values according to record numbers and field names
df.loc[1,'two']

Like Pandas, SPL table sequences also have built-in row numbers (starting from 1), field numbers (column numbers) and field names (column names). It’s convenient for users can access records through the subscripts or their field names:

// Get the list of row numbers; # is the row number filed name
T.(#)
// Get the second record (can be simply written as T(2))
T.m(2)
// Get records from the second to the fourth represented by an interval (left-closed and right-closed)
T.m(2:4)
// Move forward to get records (at even-numbered positions)
T.step(2,2)
// Get the second-to-last record (starting from 1)
T.m(-2)
// Get values according to record numbers and field numbers
T.m(2).#1
// Get values according to record numbers and field names
T.m(2).two

The nature of row numbers (subscripts) is high-performance address index. Besides row numbers, both Pandas and SPL supply other types of indexes and the corresponding query functions, including HASH index for unique values and binary search index. As performance isn’t our concern and the two types of indexes are similar in functionality, we won’t discuss them any more here.

Data maintenance

To modify a record in a specified position:

Pandas:

df.loc[4,['NAME','SALARY']]=['aaa',1000]

Pandas does not supply an off-the-shelf modification function. It uses Series object to retrieve certain record fields and then List to modify them. Here Series represents records but usually it represents columns. Generally, List represents records, but it can also represent columns. This is confusing for beginners.

SPL:

T.modify(5,"aaa":NAME,1000:SALARY)

SPL offers modification functions, which are beginner-friendly. SPL can retrieve records for modification, too. The two methods are suitable for different scenarios respectively.

To insert a record in a specified position:

Pandas:

record=pd.DataFrame([[100,"wang","lao","Femal","CA", pd.to_datetime("1999-01-01"), pd.to_datetime("2009-03-04"),"HR",3000]],columns=df.columns)
df = pd.concat([df.loc[:2], record,df.loc[3:]],ignore_index=True)

Pandas does not have true record object and any ready methods of inserting records. Taking a roundabout route is inconvenient: create a single-record DataFrame first, split the original DataFrame into two DataFrames (first half and second half) according to the specified position, and concatenate the three DataFrames. You need to handle many small details well, such as using same field names as the original when creating records and abandoning original row numbers for the new concatenated DataFrame, in order to obtain a satisfactory result.

SPL:

T.insert(3,100,"wang","lao","Femal","CA",date("1999-1-1"),date("2009-3-4"),"HR",3000)

SPL places particular emphasis on records, offers ready methods to insert records and produces concise and easy to understand code.

To add a computed column:

Pandas:

today = datetime.datetime.today().year
df["Age"] = today-pd.to_datetime(df["BIRTHDAY"]).dt.year
df["Fullname"]=df["NAME"]+ " " +df["SURNAME"]

Pandas does not provide functions for adding computed columns. Though this isn’t a big shortcoming in practice, users need to handle multiple times to add multiple computed columns. That’s a hassle. The package’s time-related function store is not rich enough, making age computations inconvenient.

SPL:

T.derive(age(BIRTHDAY):Age, NAME+""+SURNAME:Fullname)

SPL offers functions for adding computing columns, letting you add multiple at one time, and has a rich store of time-related functions.

Structured data processing

Computing functions

Pandas has a rich store of built-in library functions and supports multiple types of structured data processing, including loop traversal (apply\map\transform\itertuples\iterrows\iteritems), filtering (Filter\query\where\mask), sorting (sort_values), getting unique values (unique), grouping (groupby), aggregate (agg(max\min\mean\count\median\std\var\cor)), join (join\merge), concatenation (append\concat), transposition (transpose), rolling window (rolling) and DataFrame shifting (shift).

in Pandas, there are not special functions for performing intersection, union and difference between sets of records. You can only choose a roundabout way and the code is complicated. Pandas offers multiple functions for handling similar operations, such as filtering operations. Main roles of these functions overlap, only with different parameter specification, output type or historical versions. You need to differentiate them clearly.

SPL boasts a rich wealth of computing functions. They include .()(loop traversal), select (filtering), sort (sorting), id (getting unique values), group (grouping), max\min\avg\count\median\top\icount\iterate (aggregate), join (association), conj (concatenate) and pivot (transposition).

SPL gives great support for set operations performed over sets of records. For subsets belonging to the same set, SPL offers high-performance set-related functions, such as isect (intersection), union (union) and diff (difference operation), which are represented by infix operators ^, &, \ respectively. For sets belonging to different source sets, SPL uses different options in merge function to do the job, such as @i for intersection, @u for union and @d for difference.

Apart from set-related functions, there are computing functions that are unique to SPL – groups (grouping & aggregation), switch (foreign-key-based switch), joinx (order-based association), merge (order-based merge), iterate (loop iteration), enum (enumerated grouping), align (alignment grouping), and pselect\psort\ptop\pmax\pmin (getting ordinal numbers). Pandas, however, does not have similar functions and needs hardcoding to implement the corresponding operations.

Pandas uses different names or parameters to distinguish a large group of functions having similar functionalities from each other. This is rather inconvenient. SPL uniquely adopts function options to differentiate those functions while using the same name for them. For instance, the basic role of SPL select function is filtering and @l option is used to get the first eligible record:

T.select@1(Amount>1000)

@b option is used to filter ordered data fast using binary search:

T.select@b(Amount-1000)

Multiple function options can work together, for instance:

Orders.select@1b(Amount-1000)

Sometimes, the parameters in a structured data computing function are complicated. Pandas uses options or names to differentiate parameters from each other in order to make them easy to remember and understand. Problem is that the code is too long and the syntactic structure is inconsistent. Take left join as an example:

pd.merge(Orders, Employees, left_on='SellerId', right_on='EId', how='left', suffixes=['_o','_e'])

SPL uses cascaded structure to simplify the syntax of complicated parameters – semicolon, comma and colon are used to divide parameters into three layers from high to low – though this is harder to remember. To implement left join in SPL:

join@1(Orders:o,SellerId ; Employees:e,EId)

cascaded parameters have strong expressive ability. The semicolon in join function, for instance, works to represent the top layer table sequence parameter. To perform a multi-table join, we just need to write more semicolons. Pandas parameters’ expressive ability is much weaker. The merge function only has left and right options to represent DataFrame. So, only two table can be merged.

Both Pandas and SPL have a rich supply of computing functions. They are almost neck and neck in basic computations that involves only single functions. However, in real-world analyses data preparation work is not so simple and requires the cooperation of more than one function and the native syntax. In view of this, gap between the two languages becomes very clear.

Calculating Link relative ratio

To group data by year and month, calculate sales amount per month and then the growth rate of each month compared with the same month in the last year:

Pandas:

sales['y']=sales['ORDERDATE'].dt.year
sales['m']=sales['ORDERDATE'].dt.month
sales_g = sales[['y','m','AMOUNT']].groupby(by=['y','m'],as_index=False)
amount_df = sales_g.sum().sort_values(['m','y'])
yoy = np.zeros(amount_df.values.shape[0])
yoy=(amount_df['AMOUNT']-amount_df['AMOUNT'].shift(1))/amount_df['AMOUNT'].shift(1)
yoy[amount_df['m'].shift(1)!=amount_df['m']]=np.nan
amount_df['yoy']=yoy

During grouping and aggregation, it’s hard for Pandas to group data while computing it as SQL does. Usually, Pandas needs to add a computed column for grouping, making code complicated. To calculate link relative ratio, Pandas uses shift function to shift a whole row to indirectly access the “previous record”. Besides, it needs to handle zeros and null values, making the code even longer.

SPL:

	A
2	=sales.groups(year(ORDERDATE):y,month(ORDERDATE):m;sum(AMOUNT):x)
3	=A2.sort(m)
4	=A3.derive(if(m==m[-1],x/x[-1] -1,null):yoy)

Like SQL, SPL performs computations and grouping at the same time for grouping & aggregation. The agile syntax brings concise code. During link relative ratio calculation, SPL directly uses [-1] to reference the “previous record” and automatically handles array’s beyond-the-range and division by zero problems. The code is short.

Except for using [x] to represent a relative position, SPL can use [x:y] to represent a relative interval. To calculate a stock’s moving average in three days, for instance:

T.derive(Amount[-2:0].avg():ma)

Pandas can express a relative interval, too. But it cannot give direct syntactic support due to low language unity. For this it offers rolling function. To calculate a stock’s moving average in three days:

df['ma']=df['Close'].rolling(3, min_periods=1).mean()

Calculating installment loans

To calculate detailed payments for each installment per loan (payment, interest, principal and remaining principal balance) according to information of multiple loans (amount, number of installments and interest).

Pandas:

loan_data = ......			# Skip data retrieval by loan_data
loan_data['mrate'] = loan_data['Rate']/(100*12)
loan_data['mpayment'] = loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term']) \ /(np.power(1+loan_data['mrate'],loan_data['Term'])-1)
loan_term_list = []
for i in range(len(loan_data)):
    loanid = np.tile(loan_data.loc[i]['LoanID'],loan_data.loc[i]['Term'])
    loanamt = np.tile(loan_data.loc[i]['LoanAmt'],loan_data.loc[i]['Term'])
    term = np.tile(loan_data.loc[i]['Term'],loan_data.loc[i]['Term'])
    rate = np.tile(loan_data.loc[i]['Rate'],loan_data.loc[i]['Term'])
    payment = np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])
    interest = np.zeros(len(loanamt))
    principal = np.zeros(len(loanamt))
    principalbalance  = np.zeros(len(loanamt))
    loan_amt = loanamt[0]
    for j in range(len(loanamt)):
        interest[j] = loan_amt*loan_data.loc[i]['mrate']
        principal[j] = payment[j] - interest[j]
        principalbalance[j] = loan_amt - principal[j]
        loan_amt = principalbalance[j]
    loan_data_df = pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])),columns = ['loanid','loanamt','term','rate','payment','interest','principal','principalbalance'])
loan_term_list.append(loan_data_df)
loan_term_pay = pd.concat(loan_term_list,ignore_index=True)

The main structure of the above code is a two-layer loop: first loop through each loan, generate its every installment by loop, transpose detailed payments of each installment into DataFrame and append the data to a ready list; then move on to the next loan. When the loop is over, concatenate the multiple small DataFrames held in lists into a large DataFrame. The business logic is clear, which calculates every item using formulas. As the two-layer loop structure is complex and data type conversion is complicated, the code is lengthy.

SPL:

	A
1	// Skip data retrieval by loan_data
2	=loan_data.derive(Rate/100/12:mRate,LoanAmtmRatepower((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment)
3	=A2.news((t=LoanAmt,Term);LoanID, LoanAmt, mPayment:payment, Term, Rate, t* mRate:interest, payment-interest:principal, t=t-principal:principlebalance)

SPL implements the same business logic. But it can achieve the two-layer loop using one news function thanks to its high language unity, and does not involve complicated data type conversion. These greatly shorten the code.

Grouping by job tenure

To group employees by job tenure and count employees in each group (Overlaps exist between certain groups):

Pandas:

# Skip data retrieval from employee info table emp
def eval_g(dd:dict,ss:str):
return eval(ss,dd)
employed_list=['Within five years','Five to ten years','More than ten years','Over fifteen years']
employed_str_list=["(s<5)","(s>=5) & (s<10)","(s>=10)","(s>=15)"]
today=datetime.datetime.today()
emp['HIREDATE']=pd.to_datetime(emp['HIREDATE'])
employed=((today-emp['HIREDATE'])/np.timedelta64(1,'Y')).apply(math.floor)
emp['EMPLOYED']=employed
dd={'s':emp['EMPLOYED']}
group_cond = []
for n in range(len(employed_str_list)):
    emp_g = emp.groupby(eval_g(dd,employed_str_list[n]))
emp_g_index=[index for index in emp_g.size().index]
if True not in emp_g_index:
    sum_emp=0
else:
    group=emp_g.get_group(True)
    sum_emp=len(group)
group_cond.append([employed_list[n],sum_emp])
group_df=pd.DataFrame(group_cond,columns=['EMPLOYED','NUM'])

Pandas is good at equi-grouping, and can implement simple interval-based enumerated grouping. With overlapping enumerated groups (as in this example), the only choice is hardcoding. The process is like this: loop through each grouping condition to convert the task to an equi-grouping one, handle the grouped subsets and concatenate results. Since there isn’t a job tenure calculation function in Pandas, users need to achieve it manually.

SPL:

	A	B
1	/ Skip data retrieval from employee info table emp
2	[?<5,?>=5 && ?<10,?>=10,?>=15]	/ Conditions
3	[Within five years, Five to ten years, More than ten years, Over fifteen years]	/ Group names
4	=emp.derive(age(HIREDATE):EMPLOYED)	/ Calculate job tenue
5	=A4.enum@r(A2, EMPLOYED).new(A3(#):EMPLOYED,~.len():NUM)	/ Enumerated grouping

The enum function deals with enumerated grouping. @r option works specifically for handling overlapping groups. Plus, SPL is efficient in algorithm description. All this enables it to produce much short code than Pandas.

Through these examples, we can see that Pandas is better at handling simple data preparation scenarios but that it is hard to code relatively complex structured data computing tasks. SPL excels in generating concise code for handling both simple and complex data preparation scenarios because of its high language unity.

Computing data that cannot fit into the memory

If the file or database table contains a huge amount of data (which cannot fit into the memory but also cannot be called big data), we need to handle it using loop and segment method anyway. Retrieve a small amount of data each time and compute it, retain the intermediate result, and concatenate all the intermediate result sets (for filtering, for instance) or perform a second computation (such as grouping & aggregation) when loop is over. Even a basic structured data computation becomes difficult to handle when a large volume of data is involved. The code is more complicated if there is join, merge, union or comprehensive computation.

Aggregation

Pandas:

chunk_data = pd.read_csv("orders.txt",sep="\t",chunksize=100000)
total=0
for chunk in chunk_data:
    total+=chunk['amount'].sum()

Pandas code is fairly simple in handling a simple large file computation such as aggregation. When opening a large text file, Pandas offers option chunksize to specify the number of records to be retrieved each time. Later you can handle the large text file using loop and segment method, which retrieves a segment and summarizes it each time, and then accumulates the results together.

SPL:

=file("orders.txt").cursor@tc().total(sum('amount'))

SPL also uses the loop and segment method to process large text files. It encapsulates code details and offers convenient cursor mechanism, letting users deal with a large amount of data intuitively using syntax similar to that used to process a relatively small amount of data. So, the process of loop accumulation is invisible in the computational code.

Filtering

Pandas:

chunk_data = pd.read_csv("d:/orders.txt",sep="\t",chunksize=100000)
chunk_list = []
for chunk in chunk_data:
    chunk_list.append(chunk[chunk.state=="New York"])
res = pd.concat(chunk_list)

Pandas does not have cursors. It uses hardcoding to achieve loop and segment, during which each time a part of the data is read into the memory for filtering and the filtering result is also stored in the memory.
This is suitable for handling scenarios where the result set can fit into the memory. If the filtering result set cannot be wholly held in the memory, it needs to be written into a file each time. This results in inconsistent code:

chunk_data = pd.read_csv("d:/orders.txt",sep="\t",chunksize=100000)
isNew=True
for chunk in chunk_data:
    need_data = chunk[chunk.state=='New York']
    if isNew == True:
        need_data.to_csv("orders_filter.txt",index=None)
        isNew =False
    else:
        need_data.to_csv("orders_filter.txt",index=None,mode='a',header=None)

The code for first creating a file is different from that for appending records. You need to handle each detail carefully and the code becomes difficult.

SPL:

	A
1	=file(d:/orders.txt).cursor@tc()
2	=A1.select(state=="New York")
3	=A2.fetch()

SPL’s cursor mechanism hides the low-level details and significantly reduces coding difficulty and code amount. Obviously, SPL has great language unity and thus is able to provide the mechanism from the low level.

When the result set cannot fit into the memory, just modify A3 as:

=file("orders_filter.txt").export@tc(A2)

Thanks to the cursor mechanism, SPL does not need to write code for new file creation and data appending separately and manually, producing much shorter code.

Sorting

Pandas:

def parse_type(s):
    if s.isdigit():
        return int(s)
    try:
       res = float(s)
       return res
    except:
       return s
def pos_by(by,head,sep):
    by_num = 0
    for col in head.split(sep):
        if col.strip()==by:
            break
        else:
            by_num+=1
    return by_num
def merge_sort(directory,ofile,by,ascending=True,sep=","):
    with open(ofile,'w') as outfile:
        file_list = os.listdir(directory)
        file_chunk = [open(directory+"/"+file,'r') for file in file_list]
        k_row = [file_chunk[i].readline()for i in range(len(file_chunk))]
        by = pos_by(by,k_row[0],sep)
        outfile.write(k_row[0])
    k_row = [file_chunk[i].readline()for i in range(len(file_chunk))]
    k_by = [parse_type(k_row[i].split(sep)[by].strip())for i in range(len(file_chunk))]
    with open(ofile,'a') as outfile:
        while True:
            for i in range(len(k_by)):
                if i >= len(k_by):
                    break
                sorted_k_by = sorted(k_by) if ascending else sorted(k_by,reverse=True)
                if k_by[i] == sorted_k_by[0]:
		    outfile.write(k_row[i])
                    k_row[i] = file_chunk[i].readline()
		if not k_row[i]:
                    file_chunk[i].close()
		    del(file_chunk[i])
                    del(k_row[i])
                    del(k_by[i])
                else:
                    k_by[i] = parse_type(k_row[i].split(sep)[by].strip())
		    if len(k_by)==0:
			break
def external_sort(file_path,by,ofile,tmp_dir,ascending=True,chunksize=50000,sep=',',usecols=None,index_col=None):
    os.makedirs(tmp_dir,exist_ok=True)
    try:
        data_chunk = pd.read_csv(file_path,sep=sep,usecols=usecols,index_col=index_col,chunksize=chunksize)
        for chunk in data_chunk:
            chunk = chunk.sort_values(by,ascending=ascending)
            chunk.to_csv(tmp_dir+"/"+"chunk"+str(int(time.time()*10**7))+str(uuid.uuid4())+".csv",index=None,sep=sep)
        merge_sort(tmp_dir,ofile=ofile,by=by,ascending=ascending,sep=sep)
    except Exception:
        print(traceback.format_exc())
    finally:
        shutil.rmtree(tmp_dir, ignore_errors=True)
infile = "D:/orders.txt"
ofile = "D:/extra_sort_res_py.txt"
tmp = "D:/tmp"
external_sort(infile,'amount',ofile,tmp,ascending=True,chunksize=1000000,sep='\t')

Divide a large file into multiple segments, sort each segment separately and write sorting results to N temporary files respectively; open the N temporary files, retain an array having N members that respectively points to the current retrieval locations in the temporary files – during which the initial location is the first record; compare the N records pointed by the members of the array, write the smallest record i into the result file, and move the current retrieve location in the temporary file that i belongs to; continue to compare the N records until the sorting is completed. This is the merge algorithm commonly used for sorting a large file. The process is rather complex. As Pandas lacks convenient cursor mechanism, it turns to hardcoding to do the job, generating lengthy and unintelligible code.

SPL:

	A
1	=file("D:/orders.txt").cursor@tc()
2	=A1.sortx(amount)
3	=file("D:/extra_sort_res_py.txt").export@tc(A2)

The above SPL code also uses merge algorithm to sort a large file. With cursor mechanism, the complicated implementation details are hidden and we get a short piece of code.

There are other types of computations on large amounts of data, such as grouping & aggregation, join and getting intersection. Many are more complicated than sorting. The first step of grouping & aggregation is usually full sorting. The HASH heaping method is fast though it is more complicated. But Pandas does not support cursor due to low language unity and has to hard code the computation. This is extremely difficult. To perform comprehensive computations on large amount of data, you can basically count Pandas out. SPL’s high language unity and convenient cursor mechanism help generate code easily and concisely. To perform grouping & aggregation that has a large result set:

	A
1	=file(file_path).cursor@tc()
2	=A1.groupx(key;sum(coli):total)
3	=file(out_file).export@tc(A2)

A comprehensive computation, such as getting the three orders with the largest amounts for each product:

	A
1	=file(file_path).cursor@tc()
2	=A1.groups(product;top(3; -amt):three)
3	=A2.conj(three)

Though having a large number of library function, Pandas cannot obtain Python’s support from low level because it is not an integral part of the Python system. This results from its low language unity and limits it to simple data preparation task handling, making the more general scenarios beyond its ability. esProc SPL has a high level of language unity and specialized structured data types. It can implement general data preparation work, including irregular data source parsing, multilevel data representation, complicated structured data computations, and computations on large amounts of data, using concise and intuitive code.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

mars • 316 View • 8 Months ago

Data Preparation Script: Python Pandas OR esProc SPL?

Language features

Programming paradigm

Language unity

Working mode

IDE

Learning threshold

Code amount

Data sources

Supported data source types

Database read/write

Reading and writing text files

Multilevel data handling

Structured data object

Generation

Data access

Data maintenance

Structured data processing

Computing functions

Calculating Link relative ratio

Calculating installment loans

Grouping by job tenure

Computing data that cannot fit into the memory

Aggregation

Filtering

Sorting

ToC