Data Analysis Programming from SQL to SPL – Consecutively Rising Stocks

 

Here we are talking about analysis of consecutively rising stocks. To avoid ambiguity, we specify that the number of consecutively rising dates includes the initial date. For example, rising consecutively for 5 days means that a stock rises in a row in the 5 dates. Yet actually there are only 4 price increases that begin on the 2nd date.

Data structure and sample data:

..

CODE is stock code, DT is transaction date, and CL is closing price.

1. Get the largest number of consecutively rising dates for a stock

Suppose the code of target stock is 100046, we have the following SQL statement:

select max(ContinuousDays)
from (
    select count(*) ContinuousDays
    from (
        select sum(UpDownTag) over (order by DT) NoRisingDays
        from (
            select DT,
                case when CL>lag(CL) over ( order by DT) then 0 
                else 1 end UpDownTag
            from Stock
            where CODE=100046))
    group by NoRisingDays )

Though window functions are used, there are still nested queries. The statement is difficult to understand, let alone to write or debug.

In fact, the task isn’t that difficult. The natural logical steps that are easy to think of are: sort the stock records by transaction date – each previous record is data of the previous transaction date; set a variable to store the number of consecutively rising dates, with the initial value being 1; compare each record’s closing price with that of the previous record, and add 1 to the variable if the closing price rises or reset the variable value as 1 if the closing price does not rise; store the variable value corresponding to each record in a new UP column, and we have the following table:

..

After all records are traversed, the largest UP value is the largest number of consecutively rising dates.

But SQL’s support for procedural computation is so weak that it is difficult for the language to express the above logic without using the stored procedure. For this reason, SQL takes a roundabout route. It counts the number of non-rising dates from the initial date to the current date. For transaction dates when closing prices rise continuously, their counts of non-rising dates are same. Then group records according to the counts of non-rising dates to get consecutively rising intervals, and find the largest member count among the groups, which is the expected final result.

Grouping operation is the key for the route. Each group is a consecutively rising interval of dates. But as SQL does not give a direct support of order-based grouping, it can only convert the order-based operation to an ordinary equi-grouping based on count of non-rising dates. The whole process is circuitous. Both price increase judgment and getting cumulative count involve the window function and the subquery, producing hard to write and understand code containing two window functions and 4-layer nested queries.

As general programming languages, SPL has procedural computation ability and can implement the above natural logic effortlessly. It also offers special loop functions to let programmers avoid writing multiple lines of looping code, as shown below:


A

B

1

=T("StockRecords.xlsx")

=A1.select(CODE==100046)

2

=B1.sort(DT)

=1

3

=A2.derive(B2=if(CL>CL[-1],B2+1,1):UP)


4

=A3.max(UP)


A1 reads stock data from the Excel file. SPL IDE is highly interactive, allowing step-by-step execution and intuitively checking result of each step on the result viewing panel on the right:

..

B1 selects records of stock whose code is 100046. A2 sorts them by transaction date. B2 defines a variable whose initial value is 1.

A3 implements the above-mentioned SPL natural logic, which computes an UP column. derive()is a loop function, which directly performs loop operations on members of a set, getting rid of the multi-line code block such as for…next. SPL allows using the cross-row reference in a loop function. So it judges whether the stock rises or not simply using CL>CL[-1], where CL[-1] is the previous row’s closing price.

A4 gets the largest UP value, which is the largest number of rising dates.

After you become familiar with the SPL programming, you can omit the step of generating the UP column and write key logical steps in the same cell:


A

1

=1

2

=T("StockRecords.xlsx").select(CODE==100046)

3

=A2.sort(DT).max(A1=if(CL>CL[-1],A1+1,1))

SPL has powerful order-based computation ability. This enables it to implement the above SQL logic in a very simple way:


A

B

1

=T("StockRecords.xlsx")

=A1.select(CODE==100046)

2

=B1.sort(DT)


3

=A2.group@i(CL<=CL[-1])


4

=A3.max(~.len())


In A3, group@i is an order-based grouping function unique to SPL. The function generates a new group whenever the result of parameter expression CL<=CL[-1] is true, that is, stock price does not rise. It directly puts stock records where prices rise continuously based on the chronological dates, getting rid of the complex computation that SQL does – first compute the cumulative count and then according to it perform equi-grouping to put together the order-based grouping.

Similarly, Click A3 and result is displayed on the panel to the right:

..

Each member is a grouped subset consisting of consecutively rising stock records. Double-click the 3rd member, for example, and we can check its details:

..

A4 computes the length of each group. The maximum length is the largest number of consecutively rising dates. The temporary variable is not needed any more.

The key logics in the above code can also be combined into one line, as shown below:


A

1

=T("StockRecords.xlsx").select(CODE==100046)

2

=A1.sort(DT).group@i(CL<=CL[-1]).max(~.len())

2. Find start date and end date of a stock’s longest consecutively rising interval

By changing from counting dates in a consecutively rising interval to getting start and end dates of the interval, SQL involves more twists in the code than one more step – it first finds the largest count of consecutively rising dates and then the corresponding starting date and ending date:

with t1 as (
    select DT, CL, 
	    case when CL>lag(CL) over(order by DT) then 0 else 1 end UpDownTag
    from stock
    where sid='100046'),
t2 as (
    select DT, CL, sum(UpDownTag) over(order by DT) NoRisingDays
    from t1),
t3 as (
    select NoRisingDays, count(*) ContinuousDays 
    from t2 group by NoRisingDays),
t4 as (
    select * 
    from t2
    where NoRisingDays in (
	    select NoRisingDays
	    from t3
	    where ContinuousDays =(select max(ContinuousDays) from t3)
    ))
select min(DT) start, max(DT) end
from t4
group by NoRisingDays;

Since the subqueries in the code will be re-used, it is not suitable to write them in the nested structure. Instead, the CTE syntax is used to express them. And the code becomes longer.

In SQLaggregation is forced on a grouping operation and grouped subsets cannot be retained. After the largest count of consecutively rising dates is obtained, it needs to re-query the corresponding detail record to get the stating and ending dates and thus produces even more complicated and hard to understand code.

SPL, however, can retain grouped subsets, and just adds a simple line of code based on the code of handling the previous task:


A

1

=T("StockRecords.xlsx").select(CODE==100046).sort(DT)

2

=A1.group@i(CL<CL[-1]).maxp@a(~.len())

3

=A2.new(~(1).DT:start,~.m(-1).DT:end)

A2 uses maxp@a instead of max to get all consecutively rising intervals. The maxp() function returns records corresponding to the largest value instead of the maximum value itself. This is a SPL-unique aggregate function; SQL does not have a counterpart. @a option enables returning all records corresponding to the maximum value, which is another manifestation of SPL’s thorough set orientation.

A3 uses new()function to generate the target result set; ~.m(1) and ~.m(-1) respectively retrieves the first and the last member of the current set. SPL sets are ordered, and members can be accessed according to positions.

3. Find stocks whose largest count of consecutively rising dates is over 5

The task involves one more step than task 1 – find the largest count of consecutively rising dates for each stock and then perform filtering. SQL needs to add a partitioning field CODE in the window function:

select CODE, max(ContinuousDays) as longestUpDays
from (
    select CODE, count(*) as ContinuousDays
    from (
        select CODE, DT, 
            sum(UpDownTag) over (partition by CODE order by CODE, DT) as NoRisingDays
        from (
            select CODE, DT, 
                case when CL > lag(CL) over (partition by CODE order by CODE, DT)  then 0
                else 1 end as UpDownTag
            from stock
        )
    )
    group by CODE, NoRisingDays
)
group by CODE
having max(ContinuousDays)>5

The added step is not difficult as long as the previous steps are well understood. But the whole code becomes even more cumbersome.

SPL can retain the grouped subsets without forcing any aggregations on the grouping operation. You just group records by stock code and perform the previously specified actions:


A

1

=T("StockRecords.xlsx")

2

=A1.sort(DT)

3

=A2.group(CODE;~.group@i(CL<=CL[-1]).max(~.len()):max_increase_days)

4

=A3.select(max_increase_days>5)

A3 groups stock records by CODE and perform the above-mentioned logics on each grouped subset to get the largest count of consecutively rising dates. The first ~ represents the subset obtained from the grouping operation by CODE; the second ~ represents the subset generated from order-based grouping group@i.

A4 finds records where the largest count of consecutively rising dates is over 5.

4. Find all intervals when the largest count of consecutively rising dates is over 5

SQL forces an aggregation on each grouping operation and cannot retain the grouped subset, so it has to make query to re-obtain the corresponding stock records – as task 2 does, after records of stocks rising consecutively for over 5 days are found. The code is too cumbersome and too hard to understand, and we just skip it.

SPL can retain the grouped subsets. It is easy for it to handle the task:


A

1

=T("StockRecords.xlsx")

2

=A1.sort(CODE,DT)

3

=A2.group@i(CODE!=CODE[-1] || CL<=CL[-1])

4

=A3.select(~.len()>5)

5

=A4.conj()

Here A3 uses a different grouping condition, which creates a new group whenever the stock code changes or stock price does not rise. Similar to CL[-1], CODE[-1] represents the stock code of the previous record. As the stock records are all ordered by CODE, different stocks won’t be put into the same group.

A3 only groups records without performing any aggregation, and thus retains grouped subsets:

..

Double-click a member, the fourth for example, and all records within this consecutively rising interval are displayed:

..

A4 selects groups that have more than 5 members.

A5 concatenates the grouped sets obtained in A4 and gets the expected final result.