SPL: grouping and sorting aligned by sequence number

 

Sometimes we group and sort the data in order of sequence number, grouping members with the same sequence number into the same group. For example, calculate the total sales of each month last year in order from January to December, calculate the numbers of accessors to a website according to the order from Monday to Sunday, and so on.

This kind of grouping operation, which is aligned to a specified criterion, is collectively referred to as alignment grouping. Grouping by sequence alignment is a special case of alignment grouping where the base set is a sequence of integers starting from 1. Alignment groups may have empty groups or members that are not assigned to any of the groups.

 

1. Sorting by sequence number

Sort the data in the specified sequence order, and retain up to one matching member per group. This is suitable for situations where we want to query or use data in a specified order.

 

[e.g. 1] Query the sales records in order from Monday to Sunday according to the daily sales table. Some of the data are as follows:

Week

Day

Amount

5

Sunday

1101.2

5

Saturday

538.6

5

Friday

2142.4

5

Thursday

1456.0

5

Wednesday

48.0

5

Tuesday

1376.0

5

Monday

676.0

4

Sunday

448.0

4

Saturday

4031.0

4

Friday

364.8

The A.align(n,y) function in SPL is used to align the groups which are directly divided into n groups (from 1 to n), and calculate straight the group number of each member by grouping expression y. By default, each group retains up to one matching member.

The SPL script looks like this:


A

1

=T("DailySales.csv")

2

["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]

3

=A1.group(Week;  ~.align(7,A2.pos(Day)):WeekSales)

4

=A3.conj(WeekSales)

5

=A4.select(~)

A1: query the daily sales table.

A2: define the sequence from Monday to Sunday.

A3: when grouping the sales data by week, use the A.align(n,y) function to align sales records of each week in the order defined by A2. It is important to note here that aligned groups may have empty groups. For example, when no sales were recorded on Friday and Saturday of the second week, there are still seven groups per week:

Week

[2,Monday,1194.0]

[2,Tuesday,1622.4]

[2,Wednesday,319.2]

[2,Thursday,802.0]

(null)

(null)

[2,Sunday,2123.2]

A4: concatenate the sorted sales records of each week.

A5: select non-empty records from the result set. 

The execution results of A5 are as follows:

Week

Day

Amount

1

Monday

3063.0

1

Tuesday

3868.6

1

Wednesday

2713.5

1

Thursday

1005.9

1

Friday

1675.0

1

Saturday

400.0

1

Sunday

2018.2

2

Monday

1194.0

2

Tuesday

1622.4

2

Wednesday

319.2

 

2. Retaining all matching members of each group

Group the data in a specified sequence order, each retain all matching members. This applies to situations where we care about the information of members in each group, or where we need to continue calculating with these member records.

 

[e.g. 2] According to the sales table, list the total sales of each month in 2014 in order. Some data of the sales table are as follows:

ID

CustomerID

OrderDate

Amount

10248

VINET

2013/07/04

2440

10249

TOMSP

2013/07/05

1863.4

10250

HANAR

2013/07/08

1813.0

10251

VICTE

2013/07/08

670.8

10252

SUPRD

2013/07/09

3730.0

option @a of the A.align(n,y) function in SPL is used to keep all matching members of each group while aligning the groups.

The SPL script looks like this:


A

1

=T("Sales.csv")

2

=A1.select(year(ORDERDATE)==2014)

3

=A2.align@a(12,month(ORDERDATE))

4

=A3.new(#:Month,~.sum(AMOUNT):AMOUNT)

A1: query the sales table.

A2: select the records of 2014 from the sales table.

A3: use the A.align@a(n,y) function to sort the months of the order table into 12 groups in order from 1 to 12, with the option @a retaining all matching members of each group.

A4: calculate the total sales of each month. 

The execution results of A4 are as follows:

Month

Amount

1

66692.8

2

52207.2

3

39979.9

4

60699.39

 

3. Overlapped grouping by sequence number

Sometimes the sequence number of groups calculated by each record is more than one, and we want the record to be repeatedly assigned to multiple groups according to the sequence numbers.

 

[e.g. 3] According to the posting record table, group the posts by label, and count the frequency of each label. Some of the data in the posting record table are as follows:

ID

Title

Author

Label

1

Easy analysis of Excel

2

Excel,ETL,Import,Export

2

Early commute: Easy to pivot excel

3

Excel,Pivot,Python

3

Initial experience of SPL

1

Basics,Introduction

4

Talking about set and reference

4

Set,Reference,Dispersed,SQL

5

Early commute: Better weapon than Python

4

Python,Contrast,Install

option @r of the A.align(n,y) function in SPL is used to create overlapped groups by the sequence numbers when aligning groups.

The SPL script looks like this:


A

1

=T("PostRecord.txt")

2

=A1.conj(Label.split(",")).id()

3

=A1.align@ar(A2.len(),A2.pos(Label.split(",")))

4

=A3.new(A2(#):Label,~.count():Count).sort@z(Count)

A1: query the posting record table.

A2: separate the labels by comma and concatenate them into a sequence to get all the labels with no overlap values.

A3: use the A.align@r(n,y) function to group the posts by their labels’ position in all labels with the option @r.

A4: calculate the number of posts per label,and sort them in descending order. 

The execution results of A4 are as follows:

Label

Count

SPL

7

Excel

6

Basics

5