"In terms of structured data calculation outside the database, Stream has taken a step from scrat .."

Hoo RaqForum 19 No.
1 Reply • 234 View • 2 Years ago

From Stream to Kotlin to SPL

In terms of structured data calculation outside the database, Stream has taken a step from scratch; Kotlin has slightly enhanced this capability, but the nature of the compiled language makes it impossible to go further; To really solve the problem of structured data computing outside the database, SPL, a professional structured data computing language, is needed.

In java development, we often encounter the situation that it is inconvenient to use database but structured data calculation is needed. For a long time, Java has not provided a class library to deal with this situation. Even basic calculations such as sorting and grouping have to be hard coded by developers, and normal business logic is more difficult to implement. Until java8 launched the Stream class library, the problem of structured data calculation outside the database was finally preliminarily solved.

Next, let's review the data computing power of stream with a few examples.

Sorting: the order table has fields such as OrderID, Client, SellerId, Amount and OrderDate. First sort the Client field in reverse order, and then sort the Amount field. The key codes of stream are as follows:

record Order(int OrderID, String Client, int SellerId, double Amount, Date OrderDate) {}

Stream<Order> Orders=…..   //_Generate the order table, and the data retrieval process is omitted_

Stream<Order>  result=Orders

.sorted((sAmount1,sAmount2)->Double.compare(sAmount1.Amount,sAmount2.Amount))

.sorted((sClient1,sClient2)->CharSequence.compare(sClient2.Client,sClient1.Client));

The structured data type Stream <Order> can be sorted by using the function sorted, and the function compare can compare the size and return true or false. Note that the combination of two sorted functions is called streaming programming. The parameters of sorted use anonymous functions, namely lambda syntax, which is a kind of functional programming. The fields sorted in the code must be reversed in order to meet the business sorting order.

Group aggregation: group the order table by year and client, sum the amount in each group and count. The key codes are as follows:

Calendar cal=Calendar.getInstance();
  Map<Object, DoubleSummaryStatistics> c=Orders.collect(Collectors.groupingBy(
          r->{
              cal.setTime(r.OrderDate);
              return   cal.get(Calendar.YEAR)+"_"+r.SellerId;
              },
                Collectors.summarizingDouble(r->{
                  return r.Amount;
              })
          )
  );
      for(Object sellerid:c.keySet()){
          DoubleSummaryStatistics r   =c.get(sellerid);
          String   year_sellerid[]=((String)sellerid).split("_");
          System.out.println("group   is (year):"+year_sellerid[0]+"\t (sellerid):"+year_sellerid[1]+"\t   sum is："+r.getSum()+"\t   count is："+r.getCount());
      }

The order table can be grouped and aggregated by using the groupingBy function, supplemented by classes and functions such as collect, Collectors, summarizingDouble, DoubleSummaryStatistics. Note that the result of group aggregation is no longer a structured data object, but a map object. The function grouping only supports one grouping variable. We can use record type to combine two grouping fields into one grouping variable, but the code will become complex. In order to simplify the code, the two fields are combined into a string with underscore in here.

Association calculation: the main fields of the employee table include Eid, Dept, gender, etc. Eid is the logical foreign key of the SellerId field of the order table. Perform an inner join to the two tables, and then group by the dept field of the employee table and get the sum of the amount field.

Map<Integer, Employee> EIds = Employees.collect(Collectors.toMap(Employee::EId, Function.identity()));  
  //_Create a new OrderRelation class, in which SellerId is a single value and points to the corresponding employee object._  
  record OrderRelation(int OrderID, String Client, Employee SellerId, double   Amount, Date OrderDate){}  
  Stream<OrderRelation> ORS=Orders.map(r -> {  
      Employee e=EIds.get(r.SellerId);  
      OrderRelation or=new OrderRelation(r.OrderID,r.Client,e,r.Amount,r.OrderDate);  
      return or;  
  }).filter(e->e.SellerId!=null);  
  Map<String, DoubleSummaryStatistics> c=ORS.collect(Collectors.groupingBy(r->r.SellerId.Dept,Collectors.summarizingDouble(r->r.Amount)));  
  for(String dept:c.keySet()){  
      DoubleSummaryStatistics r =c.get(dept);  
      System.out.println("group(dept):"+dept+"   sum(Amount):"+r.getSum());  
  }

Stream does not directly support join calculation, so first calculate the corresponding relationship between employee ID and employee record, and then replace SellerId in order table with employee record; Finally, group and aggregate the order table. Note that the Eid of the original order table is of integer type, and the data type is different after replacing it with a record. A new order table OrderRelation should be generated. We also need to filter out the records with empty SellerId in the new order table to meet the definition of inner join. The calculation result is no longer a structured data object, but a map object.

From the above examples, we can see the advantages of Stream in structured data computing: it has certain computing power and can improve development efficiency. Specifically, Stream provides some basic calculation functions, and does not need to hard code when encountering corresponding tasks, and the code length is significantly shortened; Lambda syntax, a simple functional programming, is provided to simplify the writing of user-defined functions; streaming programming is provided to make multi-step calculation easy.

Although Stream has made a breakthrough contribution, its shortcomings cannot be ignored. The most fatal disadvantage is the lack of computing power. The intermediate calculation result and final result of Stream must be defined in advance, and the definition and assignment of structure are very troublesome. For example, for the new order table in the example, map can be used directly without definition in order to simplify the code, but it is not intuitive to read and use. Although Stream supports lambda syntax, the interface rules are complex, the code is not short enough, but the reading difficulty increases significantly. Stream structured objects such as record\entiry\Map are inconvenient to use, and they must be expressed by "object x.unit price*x.quantity", instead of omitting the object name as "unit price*quantity".

The fundamental reason for the insufficient computing power of Stream is that Java lacks professional structured data object and fundamental support from the bottom. Java is a compiled language. The structure of return value must be defined in advance. It can't support dynamic structure like interpreted language. Java must use a set of complex rules to implement lambda syntax, and can’t conveniently specify a parameter expression as value parameter or function parameter like interpreted language. Java structured data objects are not professional enough, which is also reflected in other aspects, for example, it does not support omitting data object names and directly referencing fields; Lack of some basic functions, such as various join calculations and set calculations; Even the supported basic functions can only be calculated with the help of multiple functions, such as group aggregation; Even the seemingly simplest calculations have problems with odd usage, such as multi field sorting.

Stream has insufficient computing power, but the demand for out of database computing will not disappear, so challengers emerge one after another, especially Kotlin. Kotlin is a development language that is fully compatible with Java ecosystem and additionally supports JavaScript. It has made significant improvements based on stream and further improved its computing power, so that it is jokingly called the most important third-party class library of Java.

Next, let's use the same examples to experience Kotlin's improvements in structured data computing.

data class Order(var OrderID: Int,var Client: String,var SellerId: Int, var Amount: Double, var OrderDate:   Date)

var Orders:List<Order> =…..//_Generate the order table, and the data retrieval process is omitted_

var resutl=Orders.sortedBy{it.Amount}.sortedByDescending{it.Client}

For the structured data type List<Order>, sorting can be realized by using the function sortedBy, and there is no need to use other auxiliary functions. Note that the sorting fields should be reversed.

Group aggregation：

data class Grp(var   OrderYear:Int,var SellerId:Int)  
data class Agg(var sumAmount: Double,var rowCount:Int)  
var result=Orders.groupingBy{Grp(it.OrderDate.year+1900,it.SellerId)}  
      .fold(Agg(0.0,0),{  
          acc, elem -> Agg(acc.sumAmount + elem.Amount,acc.rowCount+1)  
      })

.toSortedMap(compareBy<Grp> { it. OrderYear}.thenBy {it. SellerId})  
  result.forEach{println("group fields:${it.key.OrderYear}\\t${it.key.SellerId}\\t aggregate fields:${it.value.sumAmount}\\t${it.value.rowCount}") }

Use the function groupingBy to perform grouping and the function fold to perform aggregation. Note that sorting after aggregation is to keep the results consistent with SQL, not a necessary step. The calculation result is a map type, not a structured data type (data class). The grouping fields use structured type. Although the structure should be defined in advance, it is relatively convenient to use. You can also combine the two grouping fields. Although it is not necessary to define the structure in advance, the code is more complex.

Association calculation：

data class OrderNew(var OrderID:Int ,var Client:String, var SellerId:Employee ,var Amount:Double ,var OrderDate:Date)  
  val result = Orders.map {o->var emp=Employees.firstOrNull{it.EId==o.SellerId}  
      emp?.let{OrderNew(o.OrderID,o.Client,emp,o.Amount,o.OrderDate)}  
      }  
      .filter {o->o!=null}  
  data class Agg(var sumAmount: Double,var rowCount:Int)

var result1=result.groupingBy{it!!.SellerId.Dept}

    .fold(Agg(0.0,0),{

        acc, elem -> Agg(acc.sumAmount + elem!!.Amount,acc.rowCount+1)

    }).toSortedMap()

Kotlin does not directly support association, so first loop through the orders, replace SellerId with employee record, so as to indirectly realize join calculation, and finally perform group aggregation. Note that Kotlin can easily find records according to the ID without preparing the corresponding relationship between employee ID and employee record in advance. The Eid of the original order table is of integer type, and the data type is different after it is replaced by a record, thus a new order table must be generated. We need to filter out the records with empty SellerId in the new order table to meet the definition of inner join. The calculation result is no longer a structured data object, but a map object.

It can be seen from the examples that Kotlin has indeed made some improvements and has stronger computing power than Stream. Specifically, Kotlin's lambda syntax is more concise and the code is shorter; Some basic calculation functions have also been improved, and the calculation can be completed without the assistance of other functions, such as sorting; Some basic calculation functions are added, such as intersection, union and complement.

However, Kotlin has only made slight improvements, and its computing power is still seriously insufficient. Kotlin's intermediate calculation results and final results still need to be defined in advance and cannot be generated dynamically in the calculation. Kotlin's lambda syntax is still difficult to read and far less understandable than SQL. Kotlin's structured data object calculation still cannot omit the object name, and cannot be simply expressed by"unit price * quantity". In fact, these are all problems existing in Stream.

Like Stream, Kotlin has insufficient computing power. It also lacks professional structured data object and cannot support dynamic data structure. It is difficult to really simplify lambda syntax and cannot directly reference fields. Kotlin still lacks some important basic functions, such as join, and developers still have to hard code to complete the calculation. For the business algorithms composed of multiple basic calculations, the development process is still difficult.

Kotlin's computing power is limited by the ceiling of compiled language. If developers need more professional computing capability outside the database, what tools can they choose?

esProc SPL is a reliable choice.

esProc SPL is a professional open source structured data computing language with rich built-in computing functions, perfect structured data objects, and provides structured data computing power that does not depend on the database. For the calculations listed above, SPL is much simpler to code.

Sorting：

	A
1	=Orders=file("Orders.txt").import@t()
2	=Orders.sort(-Client, Amount)

Sorting can be realized by using the function sort for the structured data type table sequence. There is no need to use other functions to assist in calculation, and there is no need to reverse the fields.

Group aggregation：

=Orders.groups(year(OrderDate),Client;sum(Amount))

The function groups is used for group aggregation, and there is no need to use other functions to assist in calculation. The calculation result is also a table sequence without prior definition,

Association calculation：

=join(Orders:o,SellerId ; Employees:e,EId).groups(e.Dept; sum(o.Amount))

First use the join function to realize inner join, and then perform group aggregation. Both the intermediate result and the final result are table sequences and need not be defined in advance. You can switch join types with only a slight change, such as join@1 indicates left join, join@f indicates full join.

SPL provides a general JDBC driver. These SPL codes can be easily embedded in Java (similar to SQL) or called by Java in the form of script files (similar to stored procedures).

…
Class.forName("com.esproc.jdbc.InternalDriver");  
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");  
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("=file(\\"Orders.txt\\").import@t().sort(-Client, Amount)");
//result = statement.executeQuery("call splFileName(?)");
...

For more details, please refer to the official website, and it is not explained in detail here.

In fact, SPL has far more computing power than SQL. For example, in the following relatively complex examples, using SQL is troublesome, but using SPL is much easier.

Continuous duty: duty.xlsx records the daily duty situation. A person will usually be on duty for several continuous working days, and then someone else. Now we need to calculate the continuous duty state of each person in turn, and output the results as a two-dimensional table. Some data before and after processing are as follows:

Before processing（Duty.xlsx）

Date	Name
2018-03-01	Emily
2018-03-02	Emily
2018-03-04	Emily
2018-03-04	Johnson
2018-04-05	Ashley
2018-03-06	Emily
2018-03-07	Emily
…	…

After processing

Name	Begin	End
Emily	2018-03-01	2018-03-03
Johnson	2018-03-04	2018-03-04
Ashley	2018-03-05	2018-03-05
Emily	2018-03-06	2018-03-07
…	…	…

SQL is not good at dealing with the task of orderly grouping, and it needs to use window functions to do nested subqueries to implement, which is very difficult. SPL provides an ordered grouping function, and the key code is only one line.

	A
1	=T("D:/data/Duty.xlsx")
2	=A1.group@o(name)
3	=A2.new(name,~.m(1).date:begin,~.m(-1).date:end)

Find key clients: the library table sales stores the sales data of clients. The main fields are client and amount. Find the top n key clients whose cumulative sales account for half of the total sales, and sort them from large to small. In case of such complex calculations, SPL is usually more convenient than SQL. The code is as follows:

	A	B
1	=demo.query(“select client,amount from sales”).sort(amount:-1)	Fetch data, sort in descending order
2	=A1.cumulate(amount)	Calculate the cumulative sequence
3	=A2.m(-1)/2	The last cumulative value is the sum
4	=A2.pselect(~>=A3)	The required position (more than half)
5	=A1(to(A4))	Get values by position

In terms of computing outside the database, Stream has taken a key step from scratch; Kotlin has slightly enhanced this capability, but the nature of the compiled language makes it impossible to go further; To really solve the problem of structured data computing outside the database, SPL, a professional structured data computing language, is needed.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc