3.3 Ordered grouping
3.3.1 Grouping by position
Put every 10 employees into one group and take the mode of their surnames as the surname of the group.
SPL
A | |
---|---|
1 | =file(“EMPLOYEE.csv”).import@tc() |
2 | =A1.group((#-1)\10).new(#:GID,~.mode(SURNAME):SURNAME) |
SPL uses ‘#’ to get the grouping key, eliminating the need to calculate a separate column of grouping keys.
SQL
SELECT GRP, SURNAME
FROM (
SELECT CEIL(rownum/10) AS grp, SURNAME, COUNT(*) AS count,
ROW_NUMBER() OVER (PARTITION BY CEIL(rownum/10) ORDER BY COUNT(*) DESC) AS rn
FROM EMPLOYEE
GROUP BY CEIL(rownum/10), SURNAME)
WHERE rn = 1;
Python
emp=pd.read_csv('../EMPLOYEE.csv')
pos_seq=[i//10foriinrange(len(emp))]
group_surname=emp.groupby(pos_seq)['SURNAME'].apply(lambdax:x.value_counts().idxmax())
Python is to get a calculated column, and then group by the calculated column.
3.3.2 Grouping by sequence number
Divide the employees into 10 groups and take the mode of the surnames in each group as the surname of group.
SPL
A | |
---|---|
1 | =file(“EMPLOYEE.csv”).import@tc() |
2 | =A1.group@n(#%10+1).new(#:GID,~.mode(SURNAME):SURNAME) |
The @n option means grouping by sequence number. In this case, it is necessary to ensure that the grouping key values are natural number and start from 1, which can improve the grouping efficiency.
SQL
SQL does not provide a similar method to improve grouping efficiency.
Python
Python does not provide a similar method to improve grouping efficiency.
3.3.3 Grouping by changed value
Find the sequence with consecutive numbers in a sequence.
SPL
A | B | |
---|---|---|
1 | [7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15] | |
2 | =A1.group@o(~-#).select(~.len()>1) | /[[15,16,17],[5,6,7,8],[10,11]] |
The groups@o() function scans the entire sequence in turn. When the grouping key value of current member is the same as that of previous member, the current member is added to the current grouping subset. If the grouping key value changes, create a new grouping subset.
SQL
SQL doesn’t provide a ready-made method to get this task done.
Python
lst=[7,5,5,7,13,15,16,17,13,5,6,7,8,17,6,10,11,11,18,15]
diffs=[j-ifori,jinenumerate(lst)]
groups=pd.Series(diffs).diff().ne(0).cumsum()
g_list=pd.Series(lst).groupby(groups).agg(list)
result=g_list[g_list.apply(lambdax:len(x)>1)].tolist()
When using Python, it still requires much brainpower to construct a derived column to group.
3.3.4 Grouping by changed condition
Calculate the maximum number of days that a stock keeps rising.
SPL
A | |
---|---|
1 | =file(“STOCK.csv”).import@tc() |
2 | =A1.group@i(CLOSING<=CLOSING[-1]).max(~.len())-1 |
The grouping key of the group@i() function is an expression, which will create a new grouping subset whenever the calculation result is true, i.e., when a certain condition is met, a new group is created. Since the first record of each group does not rise, we need to subtract 1 at the end.
SQL
SQL does not provide a ready-made method to get this task done.
Python
stock = pd.read_csv('../STOCK.csv')
stock['UP'] = stock['CLOSING'].diff() > 0
up_streak = stock.groupby((~stock['UP']).cumsum())['UP'].cumsum().where(stock['UP'], 0)
max_streak = up_streak.max()
Python still uses the old method of constructing a derived column to get the task done.
3.4 Inverse grouping
Example codes for comparing SPL, SQL, and Python
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL