Looking for the Best Scripting Tool for Computing Time Series Data

 

【Abstract】

A scripting tool for handling time series data must have a good support of order-based computations. This article compares three scripting tools in terms of development efficiency, syntax expression ability and the support of structured computation function library, in order to find out how well each performs in handling sequence-number-based computations, relative-position-based computations and order-based set-operations. esProc SPL is the most efficient one among them. Looking for the Best Scripting Tool for Computing Time Series Data!

The time series data refers to the business data ordered by time. The handling of it involves quarters, months, workdays and weekends, and sometimes, complicated order-based calculations. So you need to choose a scripting language that has enough computing ability. The commonly-used scripting languages include SQL, Python Pandas and esProc SPL. Now lets examine their capabilities one by one.

SQL

As the long-lasting and most widely-used language, SQL has been already at its peak within its model frame. Almost every simple operation has their SQL solution, including the order-based calculations of course.

The calculation of link relative ration is one example. The table stock1001 stores the transaction information of a stock. Transaction date (transDate) and closing price (price) are the two important fields. The task is to calculate the growth rate of each transaction date compared with the previous one.

This is a relative-position-based calculation. The SQL query is quite simple when using the window function:

select transDate,price,

             price/lag(price)   over(order by transDate)-1 comp

 from stock1001

But the code will be roundabout if the SQL product doesnt support window functions:

With A as(SELECT    t1.transDate,t1.price, COUNT(*) AS rk

      FROM stock1001 AS t1,

           stock1001 AS t2

       WHERE t1.transDate   >t2.transDate or (t1.transDate=t2.transDate and t1.price<=t2.price)

      GROUP BY  t1.transDate, t1.price

      ORDER BY rk)

select t1.transDate,(t1.price/t2.price-1) comp  from A as t1 left join  A as t2

on t1.rk=t2.rk+1

There are two reasons that the code is so roundabout. First, SQL is based on unordered sets and doesn’t have sequence numbers. Its inconvenient to handle order-based calculations directly. The language has to invent a field of sequence numbers for such a calculation through self-join and grouping and aggregate operations. Second, SQL doesnt have relative sequence numbers. In a roundabout way, it associates the previous row to the current row to achieve the relative-position calculation. Both make the code complicated.

Window functions improve the situation in some degree. When the computing goal isnt simple, the code is still roundabout.

Heres an example of calculating median (If the number of records (L) in a table is an even number, the median is the average of the two values in the middle; their sequence numbers are L/2 and L/2+1 respectively. If L is an odd number, the median is the value in the middle, whose sequence number is (L+1)/2). scores table stores student scores and has two main fields  student ID (studentid) and math scores (math). The task is to calculate the median of math scores.

The SQL query:

With A as (select studentdid,math,   row_number() over (order by math) rk

                      from scores),

B as  (select count(1)  L from scores)

select avg(math)  from A where   rk in (

                select case when   mod(L,2)=1 then   ((L+1) div 2)  else ( L div 2) end no from B

                union

                select case when   mod(L,2)=1 then  ((L+1) div 2)  else (L div 2)+1 end  from B

))

The use of window function doesn’t make the SQL query simplified. Generating sequence numbers is redundant for the order-based set operation but it is necessary for the SQL solution. When explicit sequence numbers are required as in this example, the code is even more complex. Its awkward to perform branch judgement in SQL, too. So the language uses a small trick to try to produce a simpler query, but this only make it more difficult to understand. In handling the case when L is an odd, it calculates the average of the two same middle values instead of returning the only middle value.

Using trunc function can skillfully avoid the judgement to simplify the code while calculating the median. But as the trick is far from the definition of the median, it may lead to great difficulty in understanding the code.  

Getting the number of consecutively rising days is a more complicated case. Database table AAPL stores price information of a stock. The main fields are transaction date (transDate) and closing price (price). The task is to count the longest consecutively rising transaction days.

The SQL query:

select max(continue_inc_days)

from (select count(*) continue_inc_days

        from (select sum(inc_de_flag) over(order by transDate) continue_de_days

            from (select transDate,

                      case when

                          price>LAG(price)   over(order by transDate)

                     then 0 else 1 end inc_de_flag

                  from AAPL) )

group by continue_de_days)

According to the intuitive way of thinking, we loop through the stock records ordered by date. In the process we add 1 to the number of consecutively rising days (the initial value is 0) if the current closing price is higher than the previous one; and if the current price is lower, compare the current total number of the consecutively rising days with the existing largest number (initial value is 0), get the new largest number and reset the current number as 0. Repeat the operations until all records are looped over. Then the current number is what we want.

SQL cant implement the natural algorithm because of the lack of support for order-based calculations. So it resorts to an unusually trick. The trick divides the stock records ordered by date into groups, where records with consecutively rising prices are put into one group, that is the current record will be put into the same group with the previous one if its closing price rises, and those with consecutively decreasing prices are put into a new group. Then it counts the records in each group and finds the largest count, which is the desired result.

As SQL is already so hard to handle the slightly complicated scenarios, you can imagine its inability in dealing with more complicated cases. SQLs awkwardness is rooted in unordered sets, its theoretical foundation. The intrinsic weak cant be compensated by any updates or tricks.

Python Pandas

Pandas is Pythons function library for processing structured data. It is one of the common scripting tools to manipulate time series data.

As a library intended specifically for structured data computations, Pandas handles order-based calculations effortlessly. To calculate link relative ratio, for instance, Pandas query is as follows:

import pandas as pd
  stock1001=pd.read_csv('d:/stock1001.csv')       #return  as a DataFrame

stock1001 ['comp'] = stock1001.math/ stock1001.shift(1).math-1

The first two lines import data from the file and the key code occupies only one line. Pandas, however, cant express the concept of the previous row to implement a relative-position-based calculation directly. It uses shift(1) function to shift the column value one row down to implement the calculation in a roundabout way. for a Pandas beginner its easy to confuse with the row value and column value, the previous row and the next row.  

As a young programming language, Pandas has more advanced methods in handling order-based calculations than SQL. Thats because Pandas is constructed on ordered sets. Its dataFrame data type has intrinsic sequence numbers that are suitable for handling order-based calculations. Its easy to do the above tasks in Pandas.

To calculate median, Pandas has the following core code:

…
  df=pd.read_csv('d:/scores.csv')       #return  as a DataFrame
  math=df['math']
  L=len(df)
  if L % 2 == 1:
      result= math[int(L / 2)]
  else:
      result= (math[int(L / 2 - 1)] +   math[int(L / 2)]) / 2
  print(result)

In the code Pandas uses [N] to represent a sequence number rather than inventing one specifically. That makes code simpler. Pandas is a procedural language. It does branch judgement in an easy way and doesnt need a weird trick to simplify the code.

Its also simple to calculate the largest number of rising days. Below is the Pandas key code:

…

aapl = pd.read_sql_query("select price from AAPL order by   transDate", conn)

continue_inc_days=0 ; max_continue_inc_days=0

for i in aapl['price'].shift(0)>aapl['price'].shift(1):

    continue_inc_days =0 if   i==False else continue_inc_days +1

    max_continue_inc_days = continue_inc_days   if max_continue_inc_days < continue_inc_days else max_continue_inc_days

print(max_continue_inc_days)

conn.close()

Pandas has much stronger syntax expression ability. It is able to implement the natural algorithm without the need of showing off tricks.

Pandas, however, also cant express a relative position directly. It shifts the column value one row down to represent the previous row in a hard to understand way.

Though more convenient to use than SQL, Pandas is still difficult to use in certain complex scenarios.

Finding the accumulated sales according to the specified condition is an example. sales table stores sales amounts of each client. The main fields include client (client) and sales amount (amount). The task is to find the first n big clients whose sales amount reaches half of the total and list them by amount in descending order.

The Pandas query:

import pandas as pd

sale_info = pd.read_csv("d:/sales.csv")

sale_info.sort_values(by=‘Amount’,inplace=True,ascending=False)

half_amount = sale_info[‘Amount’].sum()/2

vip_list = []

amount = 0

for client_info in sale_info.itertuples():

    amount += getattr(client_info, ‘Amount’)

    if amount < half_amount:

          vip_list.append(getattr(client_info, ‘Client’))

    else:

          vip_list.append(getattr(client_info, ‘Client’))

        break

print(vip_list)

 

Another scenario is to calculate the growth rate of each of the three day with the highest prices. stock1001 stores daily prices of a stock. Its main fields are transaction date (transDate) and closing price (price). The task is to list the three days with the highest prices in descending order and calculate the growth rate for each day.

The Pandas query:

import pandas as pd

stock1001 = pd.read_csv("d:/stock1001_price.txt",sep   = ‘\t’)

CL = stock1001[‘CL’]

CL_psort = CL.argsort()[::-1].iloc[:3].values

CL_psort_shift1 = CL_psort-1

CL_rise = CL[CL_psort].values/CL[CL_psort_shift1].values-1

max_3 = stock1001.loc[CL_psort].reset_index(drop = True)

max_3[‘RISE’] = CL_rise

print(max_3)

In these more complicated scenarios, Pandas also turns to some difficult tricks. It makes the code hard to write and understand.

 

esProc SPL

eaProc boasts a wealth of functions for processing structured data. Besides its ordered-set base and the support of sequence numbers, the language offers convenient neighbor reference mechanism and abundant position-based functions to implement order-based calculations fast and efficiently.

SPL handles simple order-based calculations effortlessly. Here’s the SPL query for calculating link relative ratio:


A

B

1

=file("d:/stock1001.csv").import@tc()

/ Import the CSV file

2

=A1.derive(price/price[-1]-1:comp)

/ Calculate the link relative ratio using the relative position reference mechanism

A1 retrieves data from the CSV file. A2 does the key work. SPL uses the intuitive symbol [-1] to represent the previous row relative to the current one. This is a feature that both SQL and Pandas havent, and a reflection of SPLs computing ability.

To calculate the median, SPL has the following key code:


A

1

2

=L=A1.len()

3

=if(A2%2==0,A1([L/2,L/2+1]).avg(math),A1((L+1)/2).math)

SPL uses [N] to represent a sequence number instead of inventing one. It is a procedural language. So it can use both if/else statement to implement a complicated branch and the if function, as in this case, to achieve a concise judgment.

Its easier to calculate the longest consecutively rising days in SPL than in SQL/Pandas. Below is the SPL key code:


A

1

2

=a=0,A1.max(a=if(price>price[-1],a+1,0))

SPL implements the intuitive algorithm in an efficient way. With SPL, you can write a block of loop code using the loop statement, or use the loop function max to express the more concise loop aggregation as in this case.

SPL specializes in structured data processing. It can cope with even more complicated order-based calculations effortlessly.

To get certain running totals, for example, SPL has the following simple and short script:


A

B

1

=demo.query(“select client,amount from sales”).sort(amount:-1)

Retrieve data and order it in descending order

2

=A1.cumulate(amount)

Get the sequence of running totals

3

=A2.m(-1)/2

The final cumulative sum is the total

4

=A2.pselect(~>=A3)

Get the position where the cumulative value is half the total  

5

=A1(to(A4))


In an intuitive way, the SPL code calculates the cumulative sums for each client in A2, gets half of the total in A3, finds the position where the cumulative value is greater than A3, and gets the desired at the A4s position. There are two SPL features in the script. One is the m()function used in A3. The function can get a value in a reversed order. -1 represents the last record. The other is pselect() function used in A4. The function returns a sequence number according to the specified condition. Both functions effectively simplify the code.

To calculate the growth rate for each of the three day with the highest prices, SPL uses the following script:


A

B

1

=file("d:/stock1001.csv").import@tc()

/Retrieve data

2

=A1.ptop(-3,price)

/Get the positions of records containing the three days with the highest prices

3

=A1.calc(A2,price/price[-1]-1)

/Calculate the growth rates

4

=A1(A2).new(transDate,price,A3(#):comp)

/Create a two-dimensional table consisting of 3 fields

The ptop()function in A2 finds the positions of the first N records. Similar to the pselect() function, it returns a sequence of sequence numbers instead of a sequence of records. There are many similar SPL functions that can be used to simplify the order-based calculations. The number sign # is another SPL feature. It represents the sequence number of a field value. Its convenient because you dont need to invent sequence numbers specifically as SQL requires or define an index as Pandas does.

SPL is a specialized language for processing structured data and has a great wealth of relative functions. These better enable it to handle the order-based calculations, even the very complicated scenarios, efficiently and effortlessly. In a nutshell, esProc SPL is the best choice in handling time series data.