Python vs. SPL 8 -- Ordered Grouping

 

We are naturally interested in order-related operations, and grouping operations may also involve the order. This article will compare the calculation abilities of Python and SPL in ordered grouping.

Grouping by positions

It is possible that the position information of members participates in the grouping and the grouping keys are related to positions (or sequence numbers) of the members. For example:

 

Here are the data:

time                       a                       b

0                          0.5                    -2.0

1                          0.5                    -2.0

2                          0.1                    -1.0

3                          0.1                    -1.0

4                          0.1                    -1.0

5                          0.5                    -1.0

6                          0.5                    -1.0

7                          0.5                    -3.0

8                          0.5                    -1.0

 

It is expected that every three rows are grouped together and the mode number is taken as the result of that group. And the expected result is as follows:

time                       a                       b

2                          0.5                    -2.0

5                          0.1                    -1.0

8                          0.5                    -1.0

 

Python

import pandas as pd

file1="D:/data/position.txt"

data1=pd.read_csv(file1,sep="\t")

pos_seq=[i//3 for i in range(len(data1))]

res1=data1.groupby(pos_seq).agg(lambda x:x.mode().iloc[-1])

print(res1)

 

 

Derive a column according to positions, and group by the integer quotient divided by 3

Group and aggregate according to the derived column

 

 

In order to group every three rows, we can divide the sequence number by 3 and take the integer quotient to get the grouping key, by which the grouping operation is performed. To get the mode number of each group, we can use the agg+lambda expression, which is actually an aggregation operation.

 

SPL


A

B

1

D:\data\position.txt


2

=file(A1).import@t()


3

=A2.group((#-1)\3).new(~.max(time):time,~.mode(a):a,~.mode(b):b)

/group and aggregate

 

In SPL, we can get the grouping key with “#” rather than calculating an extra column of grouping keys separately. Here the grouping key is an integer and can be calculated using the group@n() function in SPL. This function is an efficient grouping method for natural numbers, which can directly obtain the correct grouped subsets with sequence numbers rather than performing the comparison operation, so it calculates much faster. Here we can change the code slightly as:

A3=A2.group@n((#+2)\3)...

 

 

Grouping when values change

Sometimes the data are ordered originally, we want to group only the adjacent records with the same value together during grouping, for example:

 

Here are the data:

      duration  location  user

0        10    house    A

1         5    house    A

2         5      gym    A

3         4      gym    B

4        10     shop    B

5         4      gym    B

6         6      gym    B

 

It is expected to get the sum of “duration” when “user” and “location” are the same consecutively and to get the sum again when the values of “location” are changed. And the expected result is:

   duration  location   user

        15    house    A

         5      gym    A

         4      gym    B

        10     shop    B

        10      gym    B

 

Python

file2="D:/data/value.txt"

data2=pd.read_csv(file2,sep="\t")

value_seq=(data2.location!=data2.location.shift()).cumsum()

res2=data2.groupby(['user','location',value_seq], as_index=False, sort=False)['duration'].sum()

print(res2)

 

The derived column, the values of the column are the same if the adjacent locationvalues are the same; otherwise +1 

Group and aggregate according to [user, location, derived column]

 

 

 

Python cannot group the adjacent members with the same value, but has to derive a column and take it as part grouping keys to complete the grouping operation. However, it takes great effort for us to come up with a relatively simple derived column.

 

SPL


A

B


4

D:\data\value.txt


5

=file(A5).import@t()


6

=A6.groups@o(user,location;sum(duration))

/group with changed values 

 

The groups@o() function in SPL scans the whole sequence in turn and adds the member to the current grouped subset if the grouping key value is the same as the grouping key of the last member; or generates a new grouped subset and adds the current member to it if the grouping values are not the same. Then after scanning, a batch of grouped subsets are obtained, thus the grouping operation is finally completed. This method in SPL does not need us to derive a new column and the syntax of it is also quite easy with only an additional @o option to the groups function. Such grouping operation only compares the adjacent values, so it executes faster, and this is exactly what the groupby function in Python cannot achieve.

 

Grouping when conditions change

There is another situation for the originally ordered data, that is, grouping the data when the conditions are changed. For example:

 

Here are the data:

ID          code

333_c_132   x

333_c_132   n06

333_c_132   n36

333_c_132   n60

333_c_132   n72

333_c_132   n84

333_c_132   n96

333_c_132   n108

333_c_132   n120

999_c_133   x

999_c_133   n06

999_c_133   n12

999_c_133   n24

998_c_134   x

998_c_134   n06

998_c_134   n12

998_c_134   n18

998_c_134   n36

997_c_135   x

997_c_135   n06

997_c_135   n12

997_c_135   n24

997_c_135   n36

996_c_136   x

996_c_136   n06

996_c_136   n12

996_c_136   n18

996_c_136   n24

996_c_136   n36

995_c_137   x

 

It is expected to retrieve one row out of every two “x”es in code column randomly, and the expected result is:

333_c_132   n06

999_c_133   n12

998_c_134   n18

997_c_135   n36

996_c_136   n18

 

Python

file3="D:/data/condition.txt"

data3=pd.read_csv(file3,sep="\t")

cond_seq=data3.code.eq('x').cumsum()

res3=data3[data3.code.ne('x')].groupby(cond_seq).apply(lambda   x:x.sample(1)).reset_index(level=0,drop=True)

print(res3)

 

 

The derived column

Group according to the derived column and then sample

 

 

 

The logic of this operation in Python stays the same with the previous two examples, i.e., deriving a column of grouping keys on our own arduously and grouping according to that column. Also, Python provides the sample() function for sampling the data, making the sampling more effective.

 

SPL


A

B


9

D:\data\condition.txt


10

=file(A9).import@t()


11

=A10.group@i(code=="x").conj((l=~.len(),if(l<2,,~.m(2:)(rand(l-1)+1))))

/group with changed conditions

 

The grouping key of the group@i() function in SPL is an expression, which generates a new grouped subset if the return result is true, that is, a new group will be generated when a certain condition is satisfied. This method also does not need to think of a derived column arduously and the calculation efficiency is excellent. But the sampling function is not provided in SPL, we have to write the sampling action manually, which can be easily completed using the rand() function in SPL.

 

In addition, @o and @i options are valid for both groups and group functions with the same effects.

 

As for processing logs, these three order-related grouping operations introduced in the article all work very well.

 

1. The log with fixed number of lines

 

The form of the log is:

..

 

IP, TIME, GET, URL, and BROWER are in the first line;

 

MODULE is in the seconds line;

 

USERID, UNAME, and LOCATION are in the third line.

 

In this case, use group((#-1)\3) and then process the log of each group.

 

2. The log with unfixed number of lines but each line has a mark

 

The form of such kind of log is as follows:

..

 

The number of lines of each user is different but each line is started with the ID of users.

 

In this case, use group@o(~.split(“\t”)(1)) and then process the log of each group.

 

3. The log with unfixed number of lines but there is a starting mark

 

The form of the log is:

..

 

The number of lines of each user is different but each user has a staring mark “userid”.

 

In this case, use group@i(~.split(“:”)(1)==”userid”), and then process the log of each group.

 

Summary

Python is not able to use the ordered data to improve efficiency because it adopts the same method for ordered grouping and unordered grouping. It is always necessary to derive a column and make it satisfy the grouping conditions, wasting a lot of useful information.

 

While SPL can make full use of the ordered data to make the grouping operation even faster. As long as the grouping conditions are well written, we don’t have to derive a column by ourselves, and @n, @o or @i options can be added to change the form of functions, which is very easy and practical.