Python vs. SPL 4-- Selection and Positioning

 

Python vs. SPL 4-- Selection and Positioning

Selecting subset from a set is a very common operation, for example, selecting members who are more than 40 years old from all the members in the company. In this article, we’ll compare the selection operations between Python and SPL.

 

Select the member at a specified position

 

Based on the employee information table of the company, select the employees in even positions.

..

 

Python

 

import pandas as   pd
file="D:\data\EMPLOYEE.csv"
emp=pd.read_csv(file)
even_emp=emp.query('index%2==1')
print(even_emp)

Import pandas
The path of employee information table
Load the employee information
Use the row index to select the even positions
  
  

 

The index of Python starts from 0, so the row indexes (i.e., sequence number) in even positions are odd numbers. We use the query(...) function to select the members whose row indexes are odd numbers (i.e., the even positions). But we’ll have to create a list of even positions and then select the required members using the iloc function if the row indexes are not numbers or ordered indexes.

 

SPL


A

B

1

D:\data\EMPLOYEE.csv


2

=file(A1).import@tc()

/load the employee information table

3

=A2.select(#%2==0)

/the even positions

 

In A.select(...) function, we can use symbols “~” and “#” to indicate the current member and its position respectively. Here we use #%2==0 to indicate the elements in even positions.

 

Select the eligible members

 

Selecting the members that satisfy the conditions is the most common selection operation, and here we still illustrate with the employee information table.

 

To query the full name and salary of the employees of the R&D department in New York State.

 

Python

#continue to use the emp table of the previous example

RD_NY_emp=emp.query('DEPT=="R&D"&STATE=="New York"')
name=RD_NY_emp["NAME"]+""+RD_NY_emp["SURNAME"]
name.rename("FULLNAME",inplace=True)
salary=RD_NY_emp["SALARY"]
name_salary=pd.concat([name,salary],axis=1)
print(name_salary)

 


Select according to conditions

 

The full name of eligible members
Name the Seires with full name

The salary of eligible members

Generate Dataframe
   

 

In simple logic, Python uses the query(...) function to select the members that satisfy the conditions. But it is a bit troublesome to generate a target table based on the selection results.

 

SPL


A

B

/A2 is the employee information table of the previous example


6

=A2.select(DEPT=="R&D"&&STATE=="New  York")

/select the eligible members

7

=A6.new(NAME+" "+SURNAME:FULLNAME,SALARY)

/create a new table sequence consisting of FULLNAME and SALARY

 

In SPL, we use the same A.select() function to select by position or by condition, and the operation of generating the target table is also very easy.

 

Select the positions/indexes of eligible members

 

When performing the selection, we want to get the positions or indexes of the eligible members and then use the information to execute other operations.

 

For example, to select the positions or indexes of the employees who are over 50 years old in the employee information table.

 

Python

#continue to use the emp table of the previous example

import numpy as np
import datetime
import math
emp["BIRTHDAY"]=pd.to_datetime(emp["BIRTHDAY"])
age=(datetime.datetime.now()-emp["BIRTHDAY"])/np.timedelta64(1,'Y')
age=age.apply(math.floor)
np_age=np.array(age)
age_50=np.argwhere(np_age>=50)[:,0]
print(age_50)


Import numpy

Import datetime
Import math
Convert the data type of date
Calculate the age
Round down the integers
Convert Series to array
Calculate the position (index)
  
  

 

In Python, there is no ready-made functions for calculating age, which can only be completed with the help of numpy and datetime libraries, besides, math library should also be used to calculate integers. These requirements are already very messy, what’s worse, Pandas library is based on numpy and some functions are not fully inherited. For example, Pandas have the pd.Timedelta()function which calculates the time difference, but only the options of month, day, hour, minute, and second are available; the year option is not included. In addition, the anywhere() function can only be used on the converted data structure of “array”. Therefore, the simple conditional filtering requires multiple transition steps, making it very burdensome.

 

SPL


A

B

/A2 is the employee information table of the previous example


9

=A2.pselect@a(age(BIRTHDAY)>=50)

/select the positions of eligible members

 

The age()function in SPL is used to calculate the age, which is much more convenient. Also, in accordance to the select function, SPL provides the pselect@a() function to return the positions of the eligible members, which can be achieved with just one line of code. Many selection functions in SPL have their corresponding functions of returning positions, for example, pmax(), pmin(), ptop(), etc.

 

Select the member with maximum/minimum value

 

Sometimes we also want to query the full information of the members with extreme values. For example:

 

To query the information of the employee with the highest salary.

 

Python

#continue to use the emp table of the previous example

#select the member with maximum value (one member)
max_salary_idx=emp.SALARY.argmax()
max_salary_emp=emp.loc[max_salary_idx]
print(max_salary_emp)

 

#select the maximum members (multiple members)
max_salary=emp["SALARY"].max()
max_salary_emp=emp[emp.SALARY==max_salary]
print(max_salary_emp)


  
The index of the member with highest salary
The member with highest salary
  
  
The highest salary
Select according to the highest salary
  
  

 

When selecting a member with maximum/minimum value, Python uses the argmax()or argmin() function to get the index of the target value and then select the member according to the index, which is more efficient. However, the argmax()and argmin() functions do not return all the indexes of the same value, so we need to traverse the data twice (the first traversal calculates the maximum and minimum values; the second one fetches the values) if more than one members with maximum/minimum values need to be returned. In this way, the efficiency is degraded by half, and the logic of coding is different from the previous one, making it hard to memorize.

 

SPL


A

B

/A2 is the employee information table of the previous example


11

=A2.maxp(SALARY)

/select one member with the maximum value

12

=A2.maxp@a(SALARY)

/select all the members with the maximum value

 

SPL provides maxp()and minp() functions to return the member with maximum/minimum value, and @a option is used to return all the members with target value. Both functions traverse the data once to return the results and are very easy to memorize in terms of the syntax.

 

Many functions in SPL have three forms: selecting the value itself such as max/min; selecting the members corresponding to target values such as maxp/minp; selecting the positions corresponding to target values such as pmax/pmin. All of them are in the same style and very easy to memorize.

 

For instance, to query the position of the employee with highest salary in the employee information table.

 

=A2.pmax@a(SALARY). This simple code is actually competent to get the result.

 

Select topN members

 

Selecting topN members is also a common operation, for example:

 

To query the information of the first 10 employees joining the company.

 

Python

#continue to use the emp table of the previous table

work_year_sort=emp.sort_values(by="HIREDATE")
work_year_top10=work_year_sort.iloc[:10]
print(work_year_top10)

 

 

Sort by time of joining the company
Select the top 10  

 

The topN operation in Python first sorts the data and selects the top N members, which is acceptable for small data amount. But this method will be less efficient with large data amount.

 

SPL


A

B

/A2 is the employee information table of the previous example


14

=A2.top(10;HIREDATE)

/select the top 10 members joining the company

 

The top()function in SPL is an aggregation function, which is similar to the max() function, but max()function keeps one maximum value while top() function keeps a sequence of topN. The later function dose not sort the whole data, thus decreasing the extra consumption of sorting.

 

Similarly, we can use the ptop() function to query the positions of topN members.

 

Summary

 

Some basic selections in Python are relatively easy, such as position selection and conditional selection; some selection operations are a bit convoluted, such as selecting the eligible positions and maximum/minimum values; and some are too ineffective such as topN selection. It is also quite complicated to switch between Pandas and Numpy libraries frequently.

 

However, SPL provides abundant selection and positioning functions of an uniform style, making it very easy to complete the selection and positioning operations whether they are simple or complex.