From SQL to SPL: Calculate based on the records within the group and fill the result into the first record

The id and nbr of the MS SQL database table are grouping fields. After sorting the records within the group by time field dt, the order of 0 and 1 in the status field 'status' is different.

id

nbr

dt

status

1

2

2023-10-01

1

1

2

2023-11-02

0

1

2

2023-12-13

0

1

3

2023-10-01

0

1

3

2023-11-02

0

1

3

2023-12-13

1

1

9

2023-11-01

0

1

9

2023-12-01

1

1

9

2024-01-01

0

2

1

2023-11-01

0

2

1

2023-12-01

0

2

1

2024-01-01

0

2

2

2023-11-01

1

2

2

2023-12-01

1

2

2

2024-01-01

1

Now we need to generate three calculated columns according to certain rules based on the records within the group, and fill them into the first record of each group, while filling in 0 for all other records.

Calculated column Cont1Sta1: When there is at least one record in this group with status=1, assign a value of 1, otherwise fill in 0.

Calculated column DateLagInDays: When there is at least one record in this group with status=1, and there is at least one record with status=0, and the first record of the latter is earlier than the first record of the former, assign a value of the day interval between the two, otherwise fill in 0.

Calculated column Recurrence: When there is at least one record with status=1 in this group, and there is at least one record with status=0, and the first record of the former is earlier than the last record of the latter, assign a value of 1, otherwise fill in 0.

id

Nbr

dt

status

Cont1Sta1

DateLagInDays

Recurrence

1

2

2023-10-01

1

1

0

1

1

2

2023-11-02

0

0

0

0

1

2

2023-12-13

0

0

0

0

1

3

2023-10-01

0

1

73

0

1

3

2023-11-02

0

0

0

0

1

3

2023-12-13

1

0

0

0

1

9

2023-11-01

0

1

30

1

1

9

2023-12-01

1

0

0

0

1

9

2024-01-01

0

0

0

0

2

1

2023-11-01

0

0

0

0

2

1

2023-12-01

0

0

0

0

2

1

2024-01-01

0

0

0

0

2

2

2023-11-01

1

1

0

0

2

2

2023-12-01

1

0

0

0

2

2

2024-01-01

1

0

0

0

SQL solution


with cte as (
    select id, nbr, dt, status
        , row_number() over (partition by id, nbr order by id, nbr, dt asc, status desc) rn
        , max(status) over (partition by id, nbr) partition_status
        , min(case when Status = 1 then dt else null end) over (partition by id, nbr) partition_dt1
        , max(case when Status = 1 then dt else null end) over (partition by id, nbr) partition_dt2
    from tb
)
select id, nbr, dt, status
    , case when rn = 1 then partition_status else 0 end Cont1Sta1
    , case when rn = 1 then datediff(day, dt, coalesce(partition_dt1, dt)) else 0 end DateLagInDays
    , case when rn = 1 and exists (select 1 from cte c2 where c2.id = c1.id and c2.nbr = c1.nbr and c2.dt > c1.partition_dt2) then 1 else 0 end Recurrence
from cte c1
order by id, nbr, dt asc, status desc;
with cte as (
    select id, nbr, dt, status
        , row_number() over (partition by id, nbr order by id, nbr, dt asc, status desc) rn
        , max(status) over (partition by id, nbr) partition_status
        , min(case when Status = 1 then dt else null end) over (partition by id, nbr) partition_dt1
        , max(case when Status = 1 then dt else null end) over (partition by id, nbr) partition_dt2
    from tb
)
select id, nbr, dt, status
    , case when rn = 1 then partition_status else 0 end Cont1Sta1
    , case when rn = 1 then datediff(day, dt, coalesce(partition_dt1, dt)) else 0 end DateLagInDays
    , case when rn = 1 and exists (select 1 from cte c2 where c2.id = c1.id and c2.nbr = c1.nbr and c2.dt > c1.partition_dt2) then 1 else 0 end Recurrence
from cte c1
order by id, nbr, dt asc, status desc;

After SQL grouping, it must aggregate immediately, and subsets cannot be kept for more complex multi-step calculations. In this case, multiple window functions can only be used for repeated calculations, which is cumbersome in code. SQL does not have natural sequence numbers, so we need to use window functions to generate sequence numbers first. The expression of the last item is not convenient either, and here we need to compare by associating dates.

SPL can retain and reuse grouped subsets, with natural sequence numbers, and can directly reference the first or last one.


 A

1

=mssql.query("select *, 0 as Cont1Sta1, 0 as DateLagInDays, 0 as Recurrence from tb order by dt”)

2

=A1.group(id,nbr)

3

=A2.(p1F=~.pselect(status==1), p0=~.pselect@a(status==0), p0F=p0.m(1), p0L=p0.m(-1),

if(p1F, Cont1Sta1=1),

if(p1F && p0F && p0F<p1F, DateLagInDays=interval(~(p0F).dt,~(p1F).dt)),

if(p1F && p0L && p1F<p0L, Recurrence=1))

4

=A2.conj()

A1 Load data through JDBC, with 3 calculated columns default to 0 and sorted by date.

A2 Group but not aggregate.

A3 Calculate the data of each group: calculate the record position p1F for the first status=1, the record position p0F for the first status=0, and the record position p0L for the last status=0; Calculate three calculated columns in turn according to business logic.

The pselect function returns the position of the record that meets the criteria, and defaults to returning the first one, @a represents returning all positions. The function m can retrieve values by position, where -1 represents the last one.

Question sourcehttps://stackoverflow.com/questions/78271470/perform-determinations-within-a-data-partition