This Is Probably the Most Perfect Programming for Solving SQL Data Analysis Pain Points

 

Various data operations, such as filtering, grouping, sorting and summarization, constitute a data analyst’s everyday work. SQL is really handy for handling these basic needs, such as calculating sales amount for users meeting the specified criteria and finding VIP customers meeting the specified condition. As the following shows, the SQL statement for implementing such a task is like an English sentence. It is simple and efficient:

SELECT area, SUM(amount) 
FROM sales 
WHERE amount > 1000 
GROUP BY area;

It seems that SQL is the synonym of “simple and efficient”. The language queries data directly and returns result directly – this is one of the reasons that it has become the standard data analysis tool. But when the scenario changes, the seemingly “simple” status is broken.

For example, what if data to be processed comes from a local file rather than a database table? In this case, SQL probably fails to work because it is actually a tool firmly bound up with the database. By contrast, SPL (Structured Process Language) does not need to “load data to the database” and just skips the database to directly compute the file data:


A

1

=T("sales.txt")

2

=A1.select(amount>1000)

3

=A2.groups(area;sum(amount):total_amount)

This piece of SPL code implements the same logic as the above SQL code does. But the more important thing is SPL does not need to set up a database environment! Several lines of code are enough to analyze a text file. SPL’s flexibility helps not only get rid of the data loading time but also greatly lower the skill threshold of using the language.

Of course, technologies intended to directly process files with SQL are now available. Using them for file processing isn’t troublesome. But they only work well with the simple or preliminary computing goals. Once the computing requirements become more complicated, SPL’s advantages become prominent. Particularly when SQL’s two pain points – difficult to write and difficult to debug – loom ahead, SPL’s design can eliminate them.

Easy programming helps you keep happy and healthy

SPL: Coding complex logics is like building blocks

The fact is that real-world analysis requirements are more than “COUNT or GROUP”. One example is to find the largest count of consecutively rising dates for a certain stock. The logic becomes complex, but SPL code is still clear and concise:


A

1

=T(“StockRecords.txt”)

2

=A1.sort(CODE,DT)

3

=A2.group(CODE;~.group@i(CL<CL[-1]).max(~.len()):max_increase_days)

Step 1: Read data from the text file;

Step 2: Sort records by stock code (CODE) and transaction dates (DT);

Step 3: Group records by stock code; and then on each group perform grouping operation according to the specified rising/falling condition to compute the largest length.

Every step is the product of the analyst’s natural way of thinking – like building each single layer of blocks – without the complex nested queries. It is SPL’s procedural syntax that enables such a computing process.

SQL: Coding complex logics is like solving math contest problems

Below is how SQL codes to achieve the computing goal. The logic is comparable to “brain cell killer”:

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT, 
               SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
                   CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0
                        ELSE 1 END AS updown_flag
            FROM stock
        )
    )
    GROUP BY CODE, no_up_days
)
GROUP BY CODE;

Nested queries, window functions and conditional judgments. You need a lot of perseverance to understand the code. And what if the goal is modified? Probably the whole code needs to be re-written. When performing complex data analysis in SQL, it feels more like to participate in a “Nested Query Olympiad” rather than writing code.

Quick debugging lets you sleep more

SPL: Debugging toolkit feast enables easy and efficient interactive analysis

I suppose you must have this SQL debugging experience – write a large query but errors occur, and to find the sources of errors you have to split away each clause and execute it separately. Each split is accompanied by a modification. It is practically a disaster! SPL’s solution to address this SQL pain point is offering a complete set of debugging functionalities:

  • Set breakpoint: Specify a key point where the execution pauses to check the local result in real-time.

  • Step over: Execute code step by step, enabling users to check intermediate values and perform debugging clearly and conveniently.

  • Real-time check: The result of executing each line of code is directly displayed on the result viewing panel on the right. Users no longer need to guess whether the result is right or wrong.

..

The debugging experience is like playing a game with unlimited gold coins. It is so cool! The analyst does not need to go to the trouble to split the code in order to find source of the error. They can see the full picture and stay in control forever.

Equipped with a rich library of debugging functionalities, particularly the WYSIWYG result viewing panel on the right of the interface, SPL greatly enhances data analysis interactivity. After each step of code is written, the analyst can directly check the result, modify parameters or adjust the logic. And the modification and adjustment take effect instantly. Analysts no longer need to write a large block of code in one go and execute the code fearing that errors occur. Instead, like building blocks they can build the analysis workflow step by step, during which they can verify their guesses in real-time.

To get consecutively rising intervals for stock analysis, for example, SPL allows for step-by-step observation and coding and real-time adjustment and result-viewing. This helps create highly interactive experience.

..

SQL: Nearly zero debugging functionalities

SQL’s debugging experience is frustrating. The language does not support setting any breakpoints, let alone step-by-step execution. To debug the code, you have to bite the bullet to take it apart layer by layer and execute each clause separately. And each modification requires a re-execution. Analysts’ time is wasted by such a debugging process.

Take the above stock analysis SQL code as an example. If its result is wrong, debugging code needs to:

  1. Execute the innermost subquery separately to check whether the error occurs there;

  2. Execute subqueries in both the innermost and its higher level to make sure the logic is right;

  3. Check the top-level query. You may need to rewrite the code repeatedly.

..

It is hard not to get despaired when working in such a debugging workflow.

SPL: The perfect data analysis tool handles complex tasks effortlessly

As it is difficult to code the task of getting consecutively rising intervals for each stock in SQL, we use SPL to handle it. Now feel SPL’s simplicity:


A

1

=T("StockRecords.xlsx")

2

=A1. sort(CODE,DT)

3

=A2.group@i(CODE!=CODE[-1] || CL<CL[-1])

4

=A3.select(~.len()>5)

5

=A4.conj()

After data is imported and sorted, A3 groups records using SPL’s set operation and order-based operation abilities – put consecutively rising records of same stock in one group, and then A4 selects groups that have more than 5 members. The whole logic is simple and proceeds naturally.

Here is a more complicated computing requirement – the ecommerce funnel analysis to find churn rate. Still, SPL codes the computing task in an intuitive and natural way:


A

1

=["etype1","etype2","etype3"]

2

=file("event.ctx").open()

3

=A2.cursor(id,etime,etype;etime>=end_date-14 && etime<end_date && A1.contain(etype) )

4

=A3.group(uid)

5

=A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first)

6

=A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null))))

7

=A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3)

The analysis operations proceed step by step. Each line of code corresponds to a step. This makes both the logical progression and the possible subsequent modifications conveniently and effortlessly. This piece of code can be used to deal with funnel analysis involving any number of steps. It is simple and general. Moreover, with effective support of order-based operations, SPL can process records of one user at one time without the need of JOIN, and thus executes faster.

Below is the SQL code counterpart, which involves a pile of nested CTEs and countless condition combinations. In the end it becomes so hard to read that even the programmer themselves cannot understand.

WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime<end_date AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime<end_date AND e2.etime>t1 AND e2.etime<t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime<end_date AND e3.etime>t2 AND e3.etime<t1+7 AND etype='etype3'
    GROUP BY 1)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid

Not only does SQL produce complicated code but the SQL code has low performance. This is because SQL lacks discreteness and cannot use procedural statements to express complex cross-row computing logics. It can only turn to JOIN to combine the logic in one line, making the code difficult to understand and execute slow.

Yet the cumbersome and inefficient SQL is widely used, and the code seems “elegant”. The language is convenient for simple computing requirements. But once the requirement becomes even slightly complicated, its long-standing problems – difficult to write and difficult to debug – immediately show up. IT forums are flooded with posts that ask for help with “I can’t write xxx in SQL”. SQL’s these problems already become long-lasting data analysis pain points.

SPL appropriately rises to the occasion to help solve the SQL pain points.

Simple and intuitive syntax: No complex nested queries any more, because SPL’s procedural syntax makes the analysis convenient;

Powerful debugging functionalities: “Set breakpoint”, “Step over” and “Real-time check” allow analysts walk each step clearly, steadily, effortlessly and efficiently.

If you already get tired of SQL’s “digging holes and then filling them up” approach, just try SPL. SPL will free you from heavy coding workload and allow you focus on analysis itself rather than struggling with code.