What should a Java programmer do when it is too difficult to write complex SQL?

To other questions:
How to easily handle text file calculations in Java?
How to easily consolidate data from different sources in Java?
What is the lightweight Java library to read and write Excel dynamically?
What would be a dynamic and flexible way to transform Json to Java object?
How to perform SQL-like queries on MongoDB in Java?
What should I do when it is difficult to implement code with Java Stream?

Solution: esProc – the professional computational package for Java

imagepng

esProc is a class library dedicated to Java-based calculations and aims to simplify Java code. SPL is a scripting language based on the esProc computing package. It can be deployed together with Java programs and understood as a stored procedure outside the database. Its usage is the same as calling a stored procedure in a Java program. It is passed to the Java program for execution through the JDBC interface, realizing step-by-step structured computing, return the ResultSet object.

In most cases, we can use SQL to complete database calculations. However, if we encounter some complex operations that SQL is not good at, we can only read the data from the database and then use other language programs to process. Give an example of calculation: find the stocks that have risen for more than three days.

The specific calculation using SQL is as follows:

SELECT code, MAX(ContinuousDays)
    FROM (
        SELECT code, NoRisingDays, COUNT(*) ContinuousDays
        FROM (
            SELECT code,
            SUM(RisingFlag) OVER (PARTITION BY code ORDER BY day) NoRisingDays
            FROM (
                SELECT code, day,
                CASE WHEN price>
                    LAG(price) OVER (PARTITION BY code ORDER BY day)
                THEN 0 ELSE 1 END RisingFlag
                FROM tbl
            )
        ) GROUP BY NoRisingDays
    )
    GROUP BY code
    HAVING MAX(ContinuousDays)>=3

This piece of SQL is not very long, but it is nested four levels, and the techniques used are strange and difficult to understand. It is difficult for ordinary people to come up with such a code. If we implement it in Java after reading the data from the database, it involves calculating structured data such as grouping and sorting. The code is very lengthy.

With SPL, only two lines:


A

1

=mysqlDB.query@x("select * from tbl ")

2

=A1.sort(day).group(code).select(~.group@o(price>price\[-1\]).max(~.len())>3).(code)

Import the stock market data table and sort by date, with the option @o of the function group to group according to whether the stock price rises. When grouping, it is only compared with neighboring ones and creates a new group when the stock price rises and changes. Next, calculate the maximum number of consecutive for each stock, and finally select the ones that have risen more than three straight days.

This block of code can be debugged or executed in esProc IDE, or stored as a script file (like condition.dfx) for invocation from a Java program through the JDBC interface. Below is the code for invocation:

  package Test;
  import java.sql.Connection;
  import java.sql.DriverManager;
  import java.sql.ResultSet;
  import java.sql.Statement;
  public class test1 {
      public static void main(String[]   args)throws Exception {
          Class.forName("com.esproc.jdbc.InternalDriver");
          Connection connection   =DriverManager.getConnection("jdbc:esproc:local://");
          Statement statement =   connection.createStatement();
          ResultSet result =   statement.executeQuery("call condition.dfx");
          printResult(result);
          if(connection != null)   connection.close();
      }

…

}

This is similar to calling a stored procedure. SPL also supports the SQL-like way of embedding the code directly into a Java program without the need of storing it as a script file. Below is the code for embedding:

…
ResultSet result = statement.executeQuery(”
mysqlDB.query@x(\"select * from tbl\").sort(day).group(code).select(~.group@o(price>price\[-1\])
.max(~.len())>3).(code)“);
…

For details on integration with Java programs, please refer to How to Call an SPL Script in Java


SQL is not good at mainly including complex set calculations, ordered calculations, associative calculations, and multi-step calculations. SQL collection is not thorough enough, and there is no explicit collection data type, which makes it difficult to reuse the collection generated during the calculation process. For example, after grouping, it must be mandatory to summarize. It cannot calculate the subset based on the group itself and does not support multi-step code, and often forces programmers to write long statements with several levels of nesting. Although stored procedures can solve this problem to a certain extent, sometimes the actual environment does not allow us to use stored procedures. For example, DBA strictly controls the database procedure permissions. Old databases and small databases do not support stored procedures, etc., and the debugging of stored procedures is also very inconvenient. Therefore, it is not very suitable for writing calculations with stored procedures.

SPL is a professional structured computing language. It is designed based on ordered sets and provides complete set operations. It is equivalent to the combination of the advantages of Java and SQL. It is good at simplifying SQL complex operations. Problems like ordered calculations are very easy. Let me give you a few more examples of different types.

Simplify SQL Grouping

According to Employee table (Below is a part of it), calculate the average salary for each state of [California, Texas, New York, Florida] and for all the other states as a whole, which are classified as “Other”.

ID

NAME

BIRTHDAY

STATE

DEPT

SALARY

1

Rebecca

1974/11/20

California

R&D

7000

2

Ashley

1980/07/19

New   York

Finance

11000

3

Rachel

1970/12/17

New   Mexico

Sales

9000

4

Emily

1985/03/07

Texas

HR

7000

5

Ashley

1975/05/13

Texas

R&D

16000

SQL:

with cte1(ID,STATE) as
      (select 1,'California' from DUAL
         UNION ALL select 2,'Texas' from DUAL
         UNION ALL select 3,'New York' from DUAL
         UNION ALL select 4,'Florida' from DUAL
         UNION ALL select 5,'OTHER' from DUAL)
   select
      t1.STATE, t2.AVG_SALARY
   from cte1 t1
   left join
      (select
         STATE,avg(SALARY) AVG_SALARY
      from
         ( select
            CASE WHEN
            STATE IN ('California','Texas','New York','Florida')
            THEN STATE
            ELSE 'OTHER' END STATE,
            SALARY
         from EMPLOYEE)
      group by STATE) t2
   on t1.STATE=t2.STATE
   order by t1.ID

SPL:

SPL has A.align() function to do this. It uses @n option to put the non-matching members into a new group.


A

1

=T("Employee.csv")

2

[California,Texas,New York,Florida]

3

=A1.align@an(A2,STATE)

4

=A3.new(if   (#>A2.len(),"OTHER",STATE):STATE,~.avg(SALARY):AVG_SALARY)

A1: Import Employee table.

A2: Define a constant set of regions.

A3: A.align() groups Employee table according to A2’s set of states. @a option is used to return all matching members for each group; @n option enables putting the non-matching members into a new group.

A4: Name the new group “OTHER” and calculate the average salary of each group.

SQL uses JOIN to achieve an alignment grouping because it does not have a method to do it directly. As a SQL result set is unordered, we record the order of the records in the original table through their original row numbers. This makes SQL solution to an alignment grouping problem complicated. SPL offers A.align() function for specifically handling alignment grouping operations, which features concise syntax and efficient execution.

 

Simplify SQL Join

Get order information of the year 2014 (order IDs, product names and total amounts) where the product name contains “water” and order amount is greater than 200, and that do not pay in installment and get 5-star evaluation. Below is part of the source data and the relationships between tables:

SQL:

This task involves one-to-many relationship, many-to-one relationship and one-to-one relationship. It is wrong to join them all with the JOIN operation because that will result in many-to-many relationship. The right way is to handle the many-to-one relationship (foreign key table) first by attaching foreign key values or desired field values to the table at the “many” end, and we have one-to-one relationship and one-to-many relationship only. Then we group the sub table by the primary table’s primary key (order ID), which makes the key the sub table’s actual primary key. Finally, we join the four tables through order ID. Below are SQL statements:

SELECT
      Orders.ID,Detail1.NAME, Detail1.AMOUNT
   FROM (
      SELECT ID
      FROM ORDERS
      WHERE
         EXTRACT (YEAR FROM Orders.ORDER_DATE)=2014
   ) Orders
   INNER JOIN (
      SELECT ID,NAME, SUM(AMOUNT) AMOUNT
      FROM (
         SELECT
            Detail.ID,Product.NAME,Detail.PRICE*Detail.COUNT AMOUNT
         FROM ORDER_DETAIL Detail
         INNER JOIN
         PRODUCT Product
         ON Detail.PRODUCTID=Product.ID
         WHERE NAME LIKE '%water%'
      )
      GROUP BY ID,NAME
   ) Detail1
   ON Orders.ID=Detail1.ID
   INNER JOIN(
      SELECT
         DISTINCT ID
      FROM ORDER_PAYMENT
      WHERE INSTALMENTS=0
   ) Payment
   ON Orders.ID = Payment.ID
   INNER JOIN(
      SELECT ID
      FROM EVALUATION
      WHERE SCORE=5
   ) Evaluation
   ON Orders.ID = Evaluation.ID

The SQL statements are difficult to write, hard to understand and maintain. More importantly, it is inconvenient to check whether the statements are correctly written since there are too many joins and nested queries.

SPL:


A

1

=T("Orders.txt").select(year(ORDER_DATE)==2014)

2

=T("Product.txt").select(like(NAME, "*water*"))

3

=T("OrderDetail.txt").switch@i(PRODUCTID, A2:ID)

4

=A3.group(ID).select(sum(PRICE*COUNT)>200)

5

=T("OrderPayment.txt").select(INSTALMENTS==0).group(ID)

6

=T("Evaluation.txt").select(SCORE==5)

7

=join(A1:Orders,ID;A4:Detail,ID;A5:Payment,ID;A6:Evaluation,ID)

8

=A7.new(Orders.ID,Detail.PRODUCTID.NAME,Detail.sum(PRICE*COUNT):AMOUNT)

A1: Import Orders table and select records of the year 2014.

A2: Import Product table and select records that contain water.

A3: Import OrderDetail table and objectify the foreign key PRODUCTID by replacing its values with corresponding records in Product table.

A4: Group OrderDetail table by ID field and select records where the amount is above 200.

A5: Import OrderPayment table and select records that do not have installment information.

A6: Import Evaluation table and select records containing 5-star evaluations.

A7: The join() function joins Orders table, OrderDetail table, OrderPayment table, and Evaluation table according to ID fields.

A8: Return the eligible order IDs, product names and order amounts.

The SPL script has two more lines of code. The import, select, and group operations on each table are separately performed, and only one line (A7) is for the join. The logic is natural and clear.

 

Simplify SQL Transposition

Based on the following channel-based sales table, generate a new table storing information by date. Below is part of the source table:

YEAR

MONTH

ONLINE

STORE

2020

1

2440

3746.2

2020

2

1863.4

448.0

2020

3

1813.0

624.8

2020

4

670.8

2464.8

2020

5

3730.0

724.5

Below is the result of expected layout:

CATEGORY

1

2

3

ONLINE

2440

1863.4

1813.0

STORE

3746.2

448.0

624.8

SQL:

We need to perform both row-to-column transposition and column-to-row transposition to get this done. First, we perform column-to-row transposition to transform channel types into values under CATEGORY field:

YEAR

MONTH

CATEGORY

AMOUNT

2020

1

ONLINE

2440

2020

1

STORE

3746.2

2020

2

ONLINE

1863.4

2020

2

STORE

448.0

Then we perform row-to-column transposition to convert MONTH values into column names. The complete SQL query is as follows:

SELECT *
   FROM (
      SELECT *
      FROM MONTH_SALES
      UNPIVOT (
         AMOUNT FOR CATEGORY IN (
            "ONLINE",STORE
         )
      )
      WHERE YEAR=2020
   )
   PIVOT (
      MAX(AMOUNT) FOR MONTH
      IN (
         1 AS "1",2 AS "2",2 AS "3",
         4 AS "4",5 AS "5",6 AS "6",
         7 AS "7",8 AS "8",9 AS "9",
         10 AS "10",11 AS "11",12 AS "12"
      )
   )

SQL does not support using a non-constant expression as PIVOT/UNPIVOT value, so all months need to be enumerated for the row-to-column transposition.

SPL:

According to the natural logic, SPL handles the task using A.pivot@r() and A.pivot() respectively for column-to-row transposition and row-to-column transposition:


A

1

=T("MonthSales.csv").select(YEAR:2020)

2

=A1.pivot@r(YEAR,MONTH; CATEGORY, AMOUNT)

3

=A2.pivot(CATEGORY; MONTH, AMOUNT)

A1: Import MonthSales table and select records of the year 2020.

A2: A1.pivot@r() performs column-to-row transposition to convert channel types into values of CATEGORY field.

A3: A2.pivot () performs row-to-column transposition to transform MONTH field values into column names.

Both SQL and SPL handle simple transposition scenarios well. The issue is that the real-world situations cannot always dealt with using the mode of grouping & aggregation plus PIVOT. In the subsequent essays in our transposition series, we will introduce how the two languages handle complicated static transpositions and dynamic transpositions.

Simplify SQL Recursion

According to a company’s organizational structure below, get Beijing Branch Office’s subordinate organizations and its superiors’ names (separate multilevel organizations by comma).

ID

ORG_NAME

PARENT_ID

1

Head Office

0

2

Beijing Branch Office

1

3

Shanghai Branch Office

1

4

Chengdu Branch Office

1

5

Beijing R&D Center

2

SQL:

The first idea for doing this is that, during the recursive process for searching for the superiors for each organization, stop the recursion if the specified value (the Beijing Branch Office) is found, keep its records and filter away records of the organizations that cannot be found. It is difficult for SQL to hit the target with one round of recursion. So we divide the task into two steps. First, find all subordinate organizations of Beijing Branch Office; second, as the solution to the above problem does, find all superior organizations for each of those organizations until Beijing Branch Office appears. Below is the SQL queries:

WITH CTE1 (ID,ORG_NAME,PARENT_ID) AS(
   SELECT
      ID,ORG_NAME,PARENT_ID
   FROM ORGANIZATION
   WHERE ORG_NAME='Beijing Branch Office'
   UNION ALL
   SELECT
      ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID
   FROM ORGANIZATION ORG
   INNER JOIN CTE1
   ON ORG.PARENT_ID=CTE1.ID
)
,CTE2 (ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM) AS(
   SELECT
      ID,ORG_NAME,PARENT_ID,ID O_ID,1 GROUP_NUM
   FROM CTE1
   UNION ALL
   SELECT ORG.ID,ORG.ORG_NAME,ORG.PARENT_ID,
      CTE2.O_ID,CTE2.GROUP_NUM+1 GROUP_NUM
   FROM ORGANIZATION ORG
   INNER JOIN CTE2
   ON ORG.ID=CTE2.PARENT_ID AND
   CTE2.ORG_NAME<>'Beijing Branch Office'
)
SELECT
   MAX(O_ID) ID, MAX(O_ORG_NAME) ORG_NAME,
   MAX(PARENT_NAME) PARENT_NAME
FROM(
   SELECT
      O_ID, O_ORG_NAME,
      WM_CONCAT(ORG_NAME) OVER
      (PARTITION BY O_ID ORDER BY O_ID,GROUP_NUM) PARENT_NAME
   FROM(
      SELECT
         ID,PARENT_ID,O_ID,GROUP_NUM,
         CASE WHEN GROUP_NUM=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
         CASE WHEN GROUP_NUM=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
      FROM (
         SELECT
            ID,ORG_NAME,PARENT_ID,O_ID,GROUP_NUM,ROWNUM RN
         FROM CTE2
         ORDER BY O_ID,GROUP_NUM
      )
   )
)
GROUP BY O_ID
ORDER BY O_ID

With Oracle, you can also use START WITH … CONNECT BY … PRIOR … to do the recursive query, as shown below:

WITH CTE1 AS (
   SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
   FROM ORGANIZATION ORG
   START WITH ORG.ORG_NAME='Beijing Branch Office'
   CONNECT BY ORG.PARENT_ID=PRIOR ORG.ID
)
,CTE2 AS (
   SELECT
      ID,ORG_NAME,PARENT_ID,RN,
      CASE WHEN LAG(ORG_NAME,1) OVER(ORDER BY RN ASC)= 'Beijing Branch Office' OR
      LAG(ORG_NAME,1) OVER(ORDER BY RN ASC) IS NULL THEN 1 ELSE 0 END FLAG
   FROM(
      SELECT ID,ORG_NAME,PARENT_ID,ROWNUM RN
      FROM CTE1
      START WITH 1=1
      CONNECT BY CTE1.ID=PRIOR CTE1.PARENT_ID
   )
)
SELECT
   MAX(ID) ID, MAX(O_ORG_NAME) ORG_NAME,
   MAX(PARENT_NAME) PARENT_NAME
FROM(
   SELECT
      ID,O_ORG_NAME,GROUP_ID,
      WM_CONCAT(ORG_NAME) OVER
      (PARTITION BY GROUP_ID ORDER BY RN) PARENT_NAME
   FROM(
      SELECT
         ID, ORG_NAME, O_ORG_NAME,RN,
         SUM(ID) OVER (ORDER BY RN) GROUP_ID
      FROM(
         SELECT
            PARENT_ID,RN,
            CASE WHEN FLAG=1 THEN NULL ELSE ORG_NAME END ORG_NAME,
            CASE WHEN FLAG=1 THEN ID ELSE NULL END ID,
            CASE WHEN FLAG=1 THEN ORG_NAME ELSE NULL END O_ORG_NAME
         FROM CTE2
      )
   )
)
GROUP BY GROUP_ID
ORDER BY GROUP_ID

SPL:

SPL offers A.prior(F,r) function to do the recursive query to find the refences until a specific value appears:


A

1

=T("Organization.txt")

2

>A1.switch(PARENT_ID,A1:ID)

3

=A1.select@1(ORG_NAME=="Beijing Branch Office")

4

=A1.new(ID,ORG_NAME,~.prior(PARENT_ID,A3)   :PARENT_NAME)

5

=A4.select(PARENT_NAME!=null)

6

=A5.run(PARENT_NAME=PARENT_NAME.(PARENT_ID.ORG_NAME).concat@c())

A1: Import the Organization table.

A2: Objectify the ID foreign key of the superior organization to convert it into the corresponding record and achieve foreign key objectification.

A3: Get records of Beijing Branch Office.

A4: Create a new table made up of ID, organization name and the set of records of all superior organizations.

A5: Get records that has at least one superior organization, that is, those of Beijing Branch Office.

A6: Join up names of superior organizations into a comma-separated string in a circular way.

The SPL solution is logically clear and concise. It uses records of Beijing Branch Office as parameters to do the recursive search of references.

 

More SQL Comparisons

Refer to Use SPL in applications - Comparison with SQL