Python vs. SPL 6 -- Equivalence Grouping

 

Python vs. SPL 6 -- Equivalence Grouping

When there are too many things, we usually classify those things into various groups and then perform aggregation operation on them. For example, calculations such as querying the highest score of each class or the average age of employees in each department of the company are called grouping operation which is often followed by the subsequent aggregation operation. The most common grouping is to split members with the same attribute value into one group, which is also known as equivalence grouping. In this article, we’ll compare the calculating abilities of Python and SPL in equivalence grouping and aggregation.

 

Grouping & aggregation on single column

The most common grouping operation is to group the data by a certain column and then aggregate them, for example:

Calculate the number of employees, the average salary and the maximum age of each department

The employee information table of the company is as follows:

undefined 

Python

import pandas as   pd

import datetime

import numpy as np

import math

def max_age(s):

    earliest_birth=s.min()

    today = datetime. datetime.today()

    age=math.floor((today-   earliest_birth)/np.timedelta64(1,'Y'))

    return age

file="D:\data\EMPLOYEE.csv"

emp=pd.read_csv(file)

emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])

dept_agg = emp.groupby('DEPT',as_index=False).agg({'SALARY':['count','mean'],'BIRTHDAY':max_age})

print(dept_agg)

Import various library functions

 

 

 

 

The function for calculating the maximum age

 

 

 

 

Load the employee information

Convert the data type of date

Group and aggregate

 

 

 

The groupby()function in Pandas is to group Dataframe to generate a grouping object. The “object” is difficult to understand and we cannot figure out what is in it. The agg() function is an aggregation function, in which we can define which aggregation operation(s) is(are) performed on which column in the form of dictionary. We can also use the user-defined functions when there is no ready-made function, for instance, the count and mean functions in the example are ready-made functions and the max_age is a user-defined function. This merit is quite programmer-friendly, but it is also worth noting that the input of user-define function is a Series i.e., the BIRTHDAY of each group, which requires us to think deeply about the principle of grouping and aggregation, otherwise we may be very confused. While the most common grouping and aggregation operation is probably just one aggregation, which is very easy to write in Python. For example, to calculate the average salary of each department:

dept_salary= emp.groupby('DEPT',as_index=False).SALARY.mean().

 

SPL


A

B

1

D:\data\EMPLOYEE.csv

/the path of the employee information table

2

=file(A1).import@tc()

/load the data

3

=A2.groups(DEPT;count(SALARY):NUM,avg(SALARY):AVG_SALARY,

max(age(BIRTHDAY)):MAX_AVG)

/group and aggregate

 

The group()function in SPL performs the grouping and aggregation operations, in which the fields before and after “;” are used to group and aggregate respectively. In this function, not only the field for aggregation operation is very distinct, but also we can rename the field for aggregated result, for example, count(SALARY):NUM is to name the result of SALARY count as NUM to generate a new table sequence. The form of the function stays the same for only one aggregation, for instance, to calculate the average salary of each department: A3=A2.groups(DEPT;avg(SALARY): AVG_SALARY), which does not vary as much as Python does.

 

Grouping & aggregation on multiple columns

 

Sometimes we want to group the data by multiple columns, for example:

Calculate the average salary of male and female employees in each department

Python

#continue to use the employee information emp

avg_salary= emp.groupby(['DEPT','GENDER'],as_index=False).SALARY.mean()

print(avg_salary)

 

 

Group and aggregate by multiple columns

 

The grouping key in groupby()function supports the form of sequence, like ['DEPT','GENDER'], and the subsequent aggregation operation is the same.

 

SPL


A

B

/A2 is the employee information 


5

=A2.groups(DEPT,GENDER;avg(SALARY):AVG_SALARY)

/group and aggregate by multiple columns

 

The groups() function in SPL does not require the multi-column grouping keys to be input in the form of sequence. The form can be the same as long as we place “,” before “;” to separate both.

 

Grouping & no aggregation

The grouping result is not always forced to be aggregated because sometimes we are interested in the grouped subsets. For example:

Sort the employees of each department in order of entry time from the earliest to the latest

Python

#continue to use emp

emp_sort=emp.groupby('DEPT',as_index=False).apply(lambda   x:x.sort_values('HIREDATE')).reset_index(drop=True)

print(emp_sort)

 

Group and sort

 

 

After grouping, Python can use the apply+lambda expression to perform sorting on each group, and the reset_index() function at the end is to reset the index. The apply+lambda expression is previously introduced in Python vs. SPL 3 -- Loop Function, and here it is used in the same way.

 

SPL


A

B

/A2 is the employee information 


7

=A2.group(DEPT).conj(~.sort(HIREDATE))

/group and sort

 

The group()function in SPL only groups but does not aggregate, and the return result is the set of a set, which is in accordance with the essence of grouping operation. Unlike Python, SPL will never return an intricate object. Also, since the return results are sets, we can process them using the loop function: sort the subsets separately and then union them to get the result of grouping and sorting. What’s more, the group() function returns subsets, which is able to achieve all the functionality of groups()function theoretically, then why do we design a groups() function? The reason is that groups() calculates the aggregation value as an iterative function, during which it does not keep each grouped subset, in this way, the operation is more efficient and takes up less storage space.

 

Grouping & complex aggregation

Sometimes we want to get a certain aggregation result of the grouped subsets, but the operation is difficult to perform and can not be achieved with a simple aggregation function. Therefore, we need to keep the grouped subsets for further calculations. For example:

Calculate the salary difference between the oldest and youngest employees in each department

Python

#continue to use emp

def salary_diff(g):

    max_age = g['BIRTHDAY'].idxmin()

    min_age = g['BIRTHDAY'].idxmax()

    diff = g.loc[max_age]['SALARY']-g.loc[min_age]['SALARY']

    return diff

emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])

salary_diff=emp.groupby('DEPT').apply(salary_diff)

print(salary_diff)

 

The function for calculating the salary difference

The index of the oldest employee

The index of the youngest employee

The salary difference

 

Convert the data type of time

Group and complex aggregate

 

Python uses a user-defined function to perform the complex aggregation operation, and loops through each group using apply expression to get the final result. Here the input of the user-defined function is the members of grouped subsets, which is not easy to think of.

 

SPL


A

B

/A2 is the employee information


9

=A2.group(DEPT;(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY),

ma.SALARY-mi.SALARY):SALARY_DIF)

/group and complex aggregate

 

SPL uses the group()function to get the grouped subsets, and minp() and maxp() functions return the members with minimum and maximum values by computing the to-be-calculated fields directly. The operation is done without any letup and fits perfectly with the logic of thinking. The code is similar to the groups() function, which only looks simple in this form, but essentially it is:

A9=A2.group(DEPT).new(DEPT,(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY),ma.SALARY-mi.SALARY):SALARY_DIF)

 

And the principle behind it is the same as group()+conj() in the previous example, which processes the grouped subsets.

 

Group & UNIQUE count

During data analysis, we may also encounter the situation that counts the unique members after grouping. For example:

Calculate how many states the employees in each department are from

Python

#continue to use emp

dept_state=emp.groupby('DEPT',as_index=False).agg({"STATE":pd.Series.nunique})

print(dept_state)

 

Group and count the number

 

 

There is a function for counting unique members in Python, that is, pd.Series.nunique, which executes in the same way as the grouping and aggregation operation introduced at the beginning of the article.

 

SPL


A

B

/A2 is the employee information


11

=A2.groups(DEPT;icount(STATE))

/group and count the number

 

The icount() function in SPL returns the number of unique values in each group, which is equivalent to performing another grouping on the groups. But the grouping operation in the groups only counts the number without any other aggregation.

 

Summary

The grouping and aggregation operations in Python are relatively abundant and have its own system. The groupby+agg expression can achieve various aggregation operations; groupby+apply expression can process the grouped subsets, but the generated grouping object after executing the groupby function is quite difficult to understand. It is neither Series nor Dataframe and we have no idea about what kind of data structure it is, which has a great impact on the following agg or apple operation because we don’t know what is the input. Then we have to practice more to get to understand the “object” generated by the groupby function.

 

While SPL performs the operations following the normal logic: the grouped result is a subset and the whole result is a set of the set. The subsequent processing can be done according to the normal syntax without the need to figure out anything else, which makes it very easy to adapt to other situations. In addition, the groups() function can further improve the efficiency of grouping and aggregation operations.