SQL Order-based Calculations

 

What is order-based calculation?

  The union of field values in SQL is common, such as firstname+lastname and year(birthday). No matter how many fields an expression contains, they come from the same row. We call this intra-row calculation. Correspondingly, there are inter-row calculations. Examples include getting difference between the result of the champion and the runner-up and calculation of the accumulative sales amount starting from January to the current month. To identify the champion and the runner-up requires data be ordered by results. To do an accumulative sum from a certain point to another point also requires that data be ordered. So we call them ordered-based calculations. An intra-row calculation takes care of values within a single record, while an inter-row calculation handles the difference between ordered records.

 

Reference a value in the previous/next record

  The simplest and the most common order-based calculation is the reference of a value in the previous or next record when records are already sorted in a certain order. Below are three scenarios:

1. Calculate the growth rate of a stock per day (link relative ratio)

  Sort records by date and reference the closing price of the previous day.

2. Calculate the average price of a stock within three days, which are the previous day, the current day and the next day (moving average)

  Sort records by date, and reference the closing price of the previous day and that of the next day.

3. There are multiple stocks. Calculate the growth rate of each stock in each trading day (intra-group link relative ratio)

  Group records by stocks, sort each group by date and reference the closing price of the previous day.

  Now let’s look at how SQL handles this type of order-based calculation.

Early SQL solutions

  Early SQL doesn’t have window functions. To reference a value in a neighboring record, the language JOINS the two records into a single one.

  Below is the program for handling task 1:

  SELECT day, curr.price/pre.price rate

  FROM (

    SELECT day, price, rownum row1

    FROM tbl ORDER BY day ASC) curr

  LEFT JOIN (

    SELECT day, price, rownum row2

    FROM tbl ORDER BY day ASC) pre

  ON curr.row1=pre.row2+1

  Perform a self-join over the table through the current day and the previous day to put the closing price of the previous day and that of the current day into a single record, and then perform an intra-row calculation to get the growth rate. You can see that subqueries are used for a simple task.

 

  SQL calculates the moving average in task 2 using JOIN, too:

  SELECT day, (curr.price+pre.price+after.price)/3 movingAvg

  FROM (

    SELECT day, price, rownum row1

    FROM tbl ORDER BY day ASC) curr

  LEFT JOIN (

    SELECT day, price, rownum row2

    FROM tbl ORDER BY day ASC) pre

  ON curr.row1=pre.row2+1

  LEFT JOIN (

    SELECT day, price, rownum row3

    FROM tbl ORDER BY day ASC) after

  ON curr.row1=after.row3-1

  One more subquery to be JOINed for one more day. Imagine the program for getting the moving average of the past 10 days and the next 10 days. You will surely be bothered to death by writing 20 JOINs.

 

  Task 3 is more complicated. Since there is more than one stock, SQL adds a code column to differentiate different stocks. The growth rate is thus calculated within one group of stock records:

  SELECT code, day ,currPrice/prePrice rate

  FROM(

    SELECT code, day, curr.price currPrice, pre.price prePrice

    FROM (

      SELECT code, day, price, rownum row1

      FROM tbl ORDER BY code, day ASC) curr

    LEFT JOIN (

      SELECT code, day, price, rownum row2

      FROM tbl ORDER BY code, day ASC) pre

    ON curr.row1=pre.row2+1 AND curr.code=pre.code

  )

  There are two points I want to talk about. You have to do a composite sorting over a table using “code,day”. The code walks ahead because you need to first put records of same stock together and then sort them. You need to put code matching in the joining condition too because if you don’t the growth rate will be calculated between neighboring records of different stocks. That will produce useless dirty data.

Solutions with window functions

  SQL 2003 introduced window functions to express the concept of order. It’s easier to implement an order-based calculation in SQL. The above three tasks can be achieved in simpler ways:

  The following program calculates link relative ratio in task 1. I write the window function in several indentations for easier understanding:

  SELECT day, price /

    LAG(price,1)

      OVER (

        ORDER BY day ASC

      ) rate

  FROM tbl

  LAG function implements the reference of the previous record. It two parameters find the price of the record directly previous to it. OVER is the substatement within LAG function. Each window function has an OVER substatement. Its role is to define a to-be-analyzed ordered set. In this example, the to-be-analyzed data set is already ordered by date.

  The program below calculates moving average in task 2. One way is LAG function to get the previous value followed by LEAD function to get the next value. The other way is the AVG function used by this program, which is more desirable. The AVG function can get an average in the specified range at once, such as one that covers the previous 10 records and the next 10 records, while the LAG/LEAD functions can only get one value at a time

  SELECT price,

    AVG(price) OVER (

      ORDER BY day ASC

      RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING

    ) movingAvg

  FROM tbl;

  In this way, it’s easier to get a moving average from covering the previous n records and the next n records. You just need to change the range parameter defined by RANGE BETWEENT.

 

  This program performs the intra-group order-based calculation required in task 3. Records of all closing prices of same stock are put into a group, which can be implemented by the window function

  SELECT code, day, price /

    LAG(price,1)

      OVER (

        PARTITION BY code

        ORDER BY day ASC

      ) rate

    FROM tbl

  The PARTITION BY substatement within OVER function defines the way of grouping records and limits each LAG operation in a group. It’s more intuitive than the earlier JOIN method. The JOIN method sorts records by multiple fields, which is equivalent to PARTITION BY but difficult to understand.

 

Location by sequence numbers

  It’s a relative location to get a value in a neighboring record in an ordered set. At times we need to find the absolute position of a record, as calculation of the difference between the debut price and the closing price in each day requires:

  SELECT day, price-FIRST_VALUE(price) OVER (ORDER BY day ASC) FROM tbl

  Or as the calculation of the difference between the highest closing price, which is on the 10th trading day, and the closing price in each day requires:

  SELECT day, price-NTH_VALUE(price,10)OVER (ORDER BY day ASC) FROM tbl

  There are more complex scenarios, where the sequence number for locating a record is unknown and needs to be generated from the existing values:

4. There are stock records ordered by closing prices, and we want to find the price in the middle place (the median)

  Let’s start from the simple one stock scenario. After records are sorted by prices, we still don’t know where the middle place is. So we need to calculate the sequence number in the middle place according to the number of the records:

  SELECT *

  FROM

    SELECT day, price, ROW_NUMBER()OVER (ORDER BY day ASC) seq FROM tbl

  WHERE seq=(

    SELECT TRUNC((COUNT(*)+1)/2) middleSeq FROM tbl)

  The subquery in FROM statement uses ROW_NUMBER() to generate a sequence number for each row. Another subquery in WHERE statement gets the sequence number in the middle place. There are two points worthy of your attention in this SQL query. One is that you cannot perform filtering directly over the first subquery because WHERE clause cannot use a calculated field from the same-level SELECT clause. This is determined by the SQL execution order. The other is that the result of the subquery in WHERE clause is a one-column, one-row table with a one value. It can be treated as a single value to be compared with Seq.

  Below is the SQL program for getting the median of records of multiple stocks:

  SELECT *

  FROM

    (SELECT code, day, price,

      ROW_NUMBER() OVER (PARTITION BY code ORDER BY day ASC)seq  FROM tbl) t1

    WHERE seq=(

      SELECT TRUNC((COUNT(*)+1)/2) middleSeq

      FROM tbl t2

      WHERE t1.code=t2.code)

  Except for embedding PARTITION BY in the window function, you should make sure that the query condition is set for one stock when calculating the sequence number in the middle place.

5. Calculate the growth rate of the day with the highest closing price compared with the previous day

  Two sorting operations are needed to locate the record with the highest closing price. Let’s start from a single stock:

  SELECT day, price, seq, rate

  FROM (

    SELECT day, price, seq,

      price/LAG(price,1) OVER (ORDER BY day ASC) rate

    FROM (

      SELECT day, price,

        ROW_NUMBER ()OVER (ORDER BY price DESC) seq

      FROM tbl)

    )

  WHERE seq=1

  The continuous two levels of subqueries add useful new data to the original table. ROW_NUMBER indexes closing prices in ascending order. LAG function calculates the growth rate for each day. Finally we get the day with the highest closing price (seq=1) through the filtering operation.

  The filtering operation should be after the calculation of growth rate because you cannot calculate the growth rate if you filter away the day before the day with the highest closing price.

Order-based grouping

  The order is also useful in performing grouping. Here is an example:

6. Find the maximum number of trading days when a stock rises consecutively.

  It’s slightly complicated. The logic is this: divide stock records that are ordered by date into a number of groups by putting those that rise consecutively into same group. That is to say, if a record where the closing price is higher than that in the previous record, then put them into the same group; if a record has a lower closing price compared with the previous one, then put it into a new group. After all records are grouped, count the records in every group and get the maximum count, which is the result we want.

  This type of grouping is performed based on the order of the records. Since SQL supports equi-grouping only, it needs to transform the order-based grouping into an equi-grouping. It does it in this way:

  1) Sort records by date, and for each date get the closing price of the previous date;

  2Compare each closing price with the previous one, record 0 for a higher one and 1 for a lower one;

  3Sum all mark numbers before the current record and get a sequence of numbers in the format of 0,0,1,1,1,1,2,2,3,3,3…. These are the sequence numbers we need;

  4Now we can perform a SQL equi-grouping operation.

  Here’s the complete SQL program for getting this done:

  SELECT MAX(ContinuousDays)

  FROM (

    SELECT COUNT(*) ContinuousDays

    FROM (

      SELECT SUM(RisingFlag) OVER (ORDER BY day) NoRisingDays

      FROM (

        SELECT day, CASE WHEN price>

           LAG(price) OVER (ORDER BY day) THEN 0 ELSE 1 END RisingFlag

      FROM tbl

      )

    ) GROUP BY NoRisingDays

  )

  This SQL solution includes 4 levels of subqueries. Different from Java and C language, SQL is set-based. It offers methods that work directly over sets without explicitly controllable circularly operations and the temporarily intermediate variables. Far from the intuitive way of thinking, the SQL implementation takes a roundabout way to use standard set operations to achieve the result. Java or C language, however, is nearer to our natural way of thinking by processing each record circularly. It’s intuitive to generate a new group or append data to an existing one. But they don’t support set-operations. From this perspective, both SQL and Java/C have their own advantages and disadvantages.

  A real-world computing scenario can be more complicated than you can imagine:

7. Find the stocks that rise consecutively for 3 days.

  This scenario requires order-based grouping, operations over post-grouping subsets, a standard grouping and a HAVING clause. First we get all rising groups for each stock using the implementation method in the previous query task, enclose it by a grouping operation to calculate the maximum number of consecutively rising days, and then find the stocks that rising consecutively for 3 days through the HAVING clause:

  SELECT code, MAX(ContinuousDays)

  FROM (

    SELECT code, NoRisingDays, COUNT(*) ContinuousDays

    FROM (

      SELECT code,

      SUM(RisingFlag) OVER (PARTITION BY code ORDER BY day) NoRisingDays

      FROM (

        SELECT code, day,

          CASE WHEN price>

            LAG(price) OVER (PARTITION BY code ORDER BY day)

          THEN 0 ELSE 1 END RisingFlag

        FROM tbl

      )

    ) GROUP BY NoRisingDays

  )

  GROUP BY code

  HAVING MAX(ContinuousDays)>=3

  The SQL program is nearly unintelligible.

Summary

  SQL had been extremely awkward in handling order-based calculations before window functions were introduced (Even now some databases still don’t support the window functions). In theory it can manage all scenarios but actually all is nothing because the implementations are too complicated. Window functions greatly improve SQL’s dilemma, though it’s still roundabout when it handles complicated scenarios.

  SQL’s problems are rooted in its theoretical basis - the unordered-set-based relational algebra. Window functions are useful but cannot deal with the fundamental issue.

  Actually an array (set) in a computing language is naturally ordered (they have natural sequence numbers). It’s easy to understand and implement the feature with a high-level language like Java and C++. The problem is they have weak ability in handling set-operations. This means that they also produce lengthy program for handling order-based calculations (though the logic is not complicated).

 

  An excellent player in doing this is esProc SPL. esProc is a professional data computing engine. It is based on ordered sets and offers all-round functions for performing set operations. It inherits the merits of both Java and SQL. It’s rather easy to do an order-based calculation in SPL, to implement the above scenarios, for example, SPL has its own easy solutions:

  1 T.sort(day).derive(price/price[-1]:rate)

  2 T.sort(day).derive(avg(price[-1:1]):movingAvg)

  3 T.sort(day).group(code).(~.derive(price/price[-1]:rate)).conj()

  4 T.sort(price).group(code).(~((~.len()+1)\2))

  5 T.sort(day).group(code).((p=~.pmax(price),~.calc(p,price/price[-1])))

  6 T.sort(day).group@o(price >price[-1]).max(~.len()))

  7 T.sort(day).group(code).select(~.group@o(price>price[-1]).max(~.len())>3).(code)

  SPL provides syntax for achieving cross-row reference and gives solid support for all those calculations above. It enables programmers to phrase the logic in an intuitive, simple, and graceful way.