3.3 Ordered grouping

 

3.2 Non-equivalence grouping


3.3.1 Grouping by position

Put every 10 employees into one group and take the mode of their surnames as the surname of the group.

SPL

A
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1.group((#-1)\10).new(#:GID,~.mode(SURNAME):SURNAME)

SPL uses ‘#’ to get the grouping key, eliminating the need to calculate a separate column of grouping keys.

SQL

SELECT GRP, SURNAME
FROM (
    SELECT CEIL(rownum/10) AS grp, SURNAME, COUNT(*) AS count,
        ROW_NUMBER() OVER (PARTITION BY CEIL(rownum/10) ORDER BY COUNT(*) DESC) AS rn
    FROM EMPLOYEE
    GROUP BY CEIL(rownum/10), SURNAME)
WHERE rn = 1;

Python

emp=pd.read_csv('../EMPLOYEE.csv')
pos_seq=[i//10foriinrange(len(emp))]
group_surname=emp.groupby(pos_seq)['SURNAME'].apply(lambdax:x.value_counts().idxmax())

Python is to get a calculated column, and then group by the calculated column.

3.3.2 Grouping by sequence number

Divide the employees into 10 groups and take the mode of the surnames in each group as the surname of group.

SPL

A
1 =file(“EMPLOYEE.csv”).import@tc()
2 =A1.group@n(#%10+1).new(#:GID,~.mode(SURNAME):SURNAME)

The @n option means grouping by sequence number. In this case, it is necessary to ensure that the grouping key values are natural number and start from 1, which can improve the grouping efficiency.

SQL

SQL does not provide a similar method to improve grouping efficiency.

Python

Python does not provide a similar method to improve grouping efficiency.

3.3.3 Grouping by changed value

Find the sequence with consecutive numbers in a sequence.

SPL

A B
1 [7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15]
2 =A1.group@o(~-#).select(~.len()>1) /[[15,16,17],[5,6,7,8],[10,11]]

The groups@o() function scans the entire sequence in turn. When the grouping key value of current member is the same as that of previous member, the current member is added to the current grouping subset. If the grouping key value changes, create a new grouping subset.

SQL

SQL doesn’t provide a ready-made method to get this task done.

Python

lst=[7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15]
diffs=[j-ifori,jinenumerate(lst)]
groups=pd.Series(diffs).diff().ne(0).cumsum()
g_list=pd.Series(lst).groupby(groups).agg(list)
result=g_list[g_list.apply(lambdax:len(x)>1)].tolist()

When using Python, it still requires much brainpower to construct a derived column to group.

3.3.4 Grouping by changed condition

Calculate the maximum number of days that a stock keeps rising.

SPL

A
1 =file(“STOCK.csv”).import@tc()
2 =A1.group@i(CLOSING<=CLOSING[-1]).max(~.len())-1

The grouping key of the group@i() function is an expression, which will create a new grouping subset whenever the calculation result is true, i.e., when a certain condition is met, a new group is created. Since the first record of each group does not rise, we need to subtract 1 at the end.

SQL

SQL does not provide a ready-made method to get this task done.

Python

stock = pd.read_csv('../STOCK.csv')
stock['UP'] = stock['CLOSING'].diff() > 0
up_streak = stock.groupby((~stock['UP']).cumsum())['UP'].cumsum().where(stock['UP'], 0)
max_streak = up_streak.max()

Python still uses the old method of constructing a derived column to get the task done.


3.4 Inverse grouping
Example codes for comparing SPL, SQL, and Python