Python vs. SPL 4-- Selection and Positioning
Python vs. SPL 4-- Selection and Positioning
Selecting subset from a set is a very common operation, for example, selecting members who are more than 40 years old from all the members in the company. In this article, we’ll compare the selection operations between Python and SPL.
Select the member at a specified position
Based on the employee information table of the company, select the employees in even positions.
Python
import pandas as pd |
Import pandas |
The index of Python starts from 0, so the row indexes (i.e., sequence number) in even positions are odd numbers. We use the query(...) function to select the members whose row indexes are odd numbers (i.e., the even positions). But we’ll have to create a list of even positions and then select the required members using the iloc function if the row indexes are not numbers or ordered indexes.
SPL
A |
B |
|
1 |
D:\data\EMPLOYEE.csv |
|
2 |
=file(A1).import@tc() |
/load the employee information table |
3 |
=A2.select(#%2==0) |
/the even positions |
In A.select(...) function, we can use symbols “~” and “#” to indicate the current member and its position respectively. Here we use #%2==0 to indicate the elements in even positions.
Select the eligible members
Selecting the members that satisfy the conditions is the most common selection operation, and here we still illustrate with the employee information table.
To query the full name and salary of the employees of the R&D department in New York State.
Python
#continue to use the emp table of the previous example RD_NY_emp=emp.query('DEPT=="R&D"&STATE=="New York"') |
Select according to conditions
The full name of eligible members The salary of eligible members Generate Dataframe |
In simple logic, Python uses the query(...) function to select the members that satisfy the conditions. But it is a bit troublesome to generate a target table based on the selection results.
SPL
A |
B |
|
… |
/A2 is the employee information table of the previous example |
|
6 |
=A2.select(DEPT=="R&D"&&STATE=="New York") |
/select the eligible members |
7 |
=A6.new(NAME+" "+SURNAME:FULLNAME,SALARY) |
/create a new table sequence consisting of FULLNAME and SALARY |
In SPL, we use the same A.select() function to select by position or by condition, and the operation of generating the target table is also very easy.
Select the positions/indexes of eligible members
When performing the selection, we want to get the positions or indexes of the eligible members and then use the information to execute other operations.
For example, to select the positions or indexes of the employees who are over 50 years old in the employee information table.
Python
#continue to use the emp table of the previous example import numpy as np |
Import numpy Import datetime |
In Python, there is no ready-made functions for calculating age, which can only be completed with the help of numpy and datetime libraries, besides, math library should also be used to calculate integers. These requirements are already very messy, what’s worse, Pandas library is based on numpy and some functions are not fully inherited. For example, Pandas have the pd.Timedelta()function which calculates the time difference, but only the options of month, day, hour, minute, and second are available; the year option is not included. In addition, the anywhere() function can only be used on the converted data structure of “array”. Therefore, the simple conditional filtering requires multiple transition steps, making it very burdensome.
SPL
A |
B |
|
… |
/A2 is the employee information table of the previous example |
|
9 |
=A2.pselect@a(age(BIRTHDAY)>=50) |
/select the positions of eligible members |
The age()function in SPL is used to calculate the age, which is much more convenient. Also, in accordance to the select function, SPL provides the pselect@a() function to return the positions of the eligible members, which can be achieved with just one line of code. Many selection functions in SPL have their corresponding functions of returning positions, for example, pmax(), pmin(), ptop(), etc.
Select the member with maximum/minimum value
Sometimes we also want to query the full information of the members with extreme values. For example:
To query the information of the employee with the highest salary.
Python
#continue to use the emp table of the previous example #select the member with maximum value (one member)
#select the maximum members (multiple members) |
|
When selecting a member with maximum/minimum value, Python uses the argmax()or argmin() function to get the index of the target value and then select the member according to the index, which is more efficient. However, the argmax()and argmin() functions do not return all the indexes of the same value, so we need to traverse the data twice (the first traversal calculates the maximum and minimum values; the second one fetches the values) if more than one members with maximum/minimum values need to be returned. In this way, the efficiency is degraded by half, and the logic of coding is different from the previous one, making it hard to memorize.
SPL
A |
B |
|
… |
/A2 is the employee information table of the previous example |
|
11 |
=A2.maxp(SALARY) |
/select one member with the maximum value |
12 |
=A2.maxp@a(SALARY) |
/select all the members with the maximum value |
SPL provides maxp()and minp() functions to return the member with maximum/minimum value, and @a option is used to return all the members with target value. Both functions traverse the data once to return the results and are very easy to memorize in terms of the syntax.
Many functions in SPL have three forms: selecting the value itself such as max/min; selecting the members corresponding to target values such as maxp/minp; selecting the positions corresponding to target values such as pmax/pmin. All of them are in the same style and very easy to memorize.
For instance, to query the position of the employee with highest salary in the employee information table.
=A2.pmax@a(SALARY). This simple code is actually competent to get the result.
Select topN members
Selecting topN members is also a common operation, for example:
To query the information of the first 10 employees joining the company.
Python
#continue to use the emp table of the previous table work_year_sort=emp.sort_values(by="HIREDATE") |
Sort by time of joining the company |
The topN operation in Python first sorts the data and selects the top N members, which is acceptable for small data amount. But this method will be less efficient with large data amount.
SPL
A |
B |
|
… |
/A2 is the employee information table of the previous example |
|
14 |
=A2.top(10;HIREDATE) |
/select the top 10 members joining the company |
The top()function in SPL is an aggregation function, which is similar to the max() function, but max()function keeps one maximum value while top() function keeps a sequence of topN. The later function dose not sort the whole data, thus decreasing the extra consumption of sorting.
Similarly, we can use the ptop() function to query the positions of topN members.
Summary
Some basic selections in Python are relatively easy, such as position selection and conditional selection; some selection operations are a bit convoluted, such as selecting the eligible positions and maximum/minimum values; and some are too ineffective such as topN selection. It is also quite complicated to switch between Pandas and Numpy libraries frequently.
However, SPL provides abundant selection and positioning functions of an uniform style, making it very easy to complete the selection and positioning operations whether they are simple or complex.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version