3.1 Equivalence grouping

 

2.5 Loop to generate and execute


1. Grouping & single aggregation on single column - Calculate the number of employees in each department

2. Group & COUNT DISTINCT - Calculate how many states the employees in each department are from

3. Grouping & multiple aggregations on single column - Calculate the number of employees, the average salary and the maximum age of each department

4. Grouping & single aggregation on multiple columns - Calculate the average salary of employees in each department and each state

5. Grouping & multiple aggregations on multiple columns - Calculate the average salary, maximum salary and average age of employees in each department and each state

6. Grouping & no aggregation - Find employees with the same birthday (excluding the year)

7. Grouping & complex aggregation - Calculate the salary difference between the oldest and youngest employees in each department

SPL

A B
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1.groups(DEPT;count(1):emp_no) /Grouping & single aggregation on single column
3 =A1.groups(DEPT;icount(STATE)) /Group & COUNT DISTINCT
4 =A1.groups(DEPT;count(1):emp_no,avg(SALARY):avg_salary,max(age(BIRTHDAY)):max_age) /Grouping & multiple aggregations on single column
5 =A1.groups(DEPT,STATE;avg(SALARY):avg_salary) /Grouping & single aggregation on multiple columns
6 =A1.groups(DEPT,STATE;avg(SALARY):avg_salary,max(SALARY):max_salary,avg(age(BIRTHDAY)):avg_age) /Grouping & multiple aggregations on multiple columns
7 =A1.group(month(BIRTHDAY),day(BIRTHDAY)).select(~.len()>1).conj() /Grouping & no aggregation
8 =A1.group(DEPT;(ma=~.minp(BIRTHDAY),mi=~.maxp(BIRTHDAY), ma.SALARY-mi.SALARY):SALARY_DIF) /Grouping & complex aggregation

SQL

1. Grouping & single aggregation on single column

SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES
FROM EMPLOYEE
GROUP BY DEPT;

2. Group & COUNT DISTINCT

SELECT DEPT, COUNT(DISTINCT STATE) AS NUM_STATES
FROM EMPLOYEE
GROUP BY DEPT;

3. Grouping & multiple aggregations on single column

SELECT DEPT, COUNT(*) AS NUM_EMPLOYEES, AVG(SALARY) AS AVG_SALARY, 
    MAX(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS MAX_AGE
FROM EMPLOYEE
GROUP BY DEPT;

4. Grouping & single aggregation on multiple columns

SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY
FROM EMPLOYEE
GROUP BY DEPT, STATE;

5. Grouping & multiple aggregations on multiple columns

SELECT DEPT, STATE, AVG(SALARY) AS AVG_SALARY, MAX(SALARY) AS MAX_SALARY, 
    AVG(TRUNC(MONTHS_BETWEEN(SYSDATE, BIRTHDAY)/12)) AS AVG_AGE
FROM EMPLOYEE
GROUP BY DEPT, STATE;

6. Grouping & no aggregation

SQL doesnt have ‘Grouping & no aggregation’, so it has to adopt an indirect method.

7. Grouping & complex aggregation

WITH MAX_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
    SELECT DEPT,BIRTHDAY,SALARY, 
        ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY ASC) AS RN
    FROM EMPLOYEE)
WHERE RN = 1), 
MIN_AGE_EMPLOYEE AS (
SELECT DEPT, BIRTHDAY, SALARY
FROM (
    SELECT DEPT,BIRTHDAY,SALARY, 
        ROW_NUMBER() OVER (PARTITION BY DEPT ORDER BY BIRTHDAY DESC) AS RN
    FROM EMPLOYEE)
WHERE RN = 1)
SELECT MAX_AGE_EMPLOYEE.DEPT, 
    MAX_AGE_EMPLOYEE. SALARY - MIN_AGE_EMPLOYEE.SALARY AS SALARY_DIFF
FROM MAX_AGE_EMPLOYEE
JOIN MIN_AGE_EMPLOYEE ON MAX_AGE_EMPLOYEE.DEPT = MIN_AGE_EMPLOYEE.DEPT;

Python

emp = pd.read_csv('../EMPLOYEE.csv')
# Grouping & single aggregation on single column
dept_counts = emp['DEPT'].value_counts()
#Group & COUNT DISTINCT
dept_state_counts = emp.groupby('DEPT')['STATE'].nunique()
#Grouping & multiple aggregations on single column
dept_stats = emp.groupby('DEPT').agg(
    count=('DEPT', 'count'),
    avg_salary=('SALARY', 'mean'),
    max_age=('BIRTHDAY', lambda x: 
        (pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').max()))
#Grouping & single aggregation on multiple columns
dept_state_salary = emp.groupby(['DEPT', 'STATE'])['SALARY'].mean()
#Grouping & multiple aggregations on multiple columns
dept_state_stats = emp.groupby(['DEPT', 'STATE']).agg(
    avg_salary=('SALARY', 'mean'),
    max_salary=('SALARY', 'max'),
    avg_age=('BIRTHDAY', lambda x: 
        (pd.Timestamp('now') - pd.to_datetime(x)).astype('<m8[Y]').mean()))
birthday_md=pd.to_datetime(emp['BIRTHDAY']).dt.strftime('%m-%d').sort_values()
#Grouping&noaggregation
emp_same_birthday=emp.loc[birthday_md.index]
    [emp.groupby(birthday_md).BIRTHDAY.transform('count')>1]
#Grouping&complexaggregation

def salary_diff(g):
    max_age = g['BIRTHDAY'].idxmin()
    min_age = g['BIRTHDAY'].idxmax()
    diff = g.loc[max_age]['SALARY']-g.loc[min_age]['SALARY']
    return diff
emp['BIRTHDAY']=pd.to_datetime(emp['BIRTHDAY'])
salary_diff=emp.groupby('DEPT').apply(salary_diff) 

3.2 Non-equivalence grouping
Example codes for comparing SPL, SQL, and Python