How to implement an efficient logical data warehouse? Try SPL!

 

Logical DW (data warehouse) offers users the ability to logically integrate a variety of different type of data sources without moving the original data, presenting itself as a physical DW. Logical DW can address traditional DW’s problem of the inability to respond to real-time data processing needs due to long data chain caused by data moving, and hence logical DW can well meet the fast business change scenarios. Moreover, logical DW has the cross-source computing ability. However, due to the absence of physical storage, logical DW needs to map the data of each source as SQL table so as to implement mixed computation on multiple data sources.

There is no problem with the idea and application scenarios of logical DW, but the way to implement remains to be discussed. Currently, the external interface of the majority of logical DWs is still SQL, because almost all the traditional DWs are built based on SQL, and the same goes for logical DW. The benefit of using SQL is that it is universal, which can lower the learning and application thresholds.

However, SQL will weaken the logical ability, in other words, it lacks support for diverse data sources.

Unlike physical DW, logical DW will face a wider variety of data sources. Since many data sources do not meet the constraints of DW (SQL), it is useless to map them as SQL table, so the action of loading data into database to make them meet constraints is not needed (physical DW needs this action, so it doesn’t have to face diverse data sources, whereas logical DW does). As a result, logical DW lacks the support for diverse data sources. In general, it is relatively easy for logical DW to support RDB-based data sources, yet it is difficult to support multi-layer data structures such as NoSQL, Webservice and JSON, and more difficult to support various types of data sources like file system. In fact, most of today’s logical DWs can only support RDB well, and support other types of data sources poorly.

The lack of support for diverse data sources is also manifested in functionality. The reason why we use multiple data sources is that different data sources have different capabilities to adapt to different application scenarios. We know that even for RDB, there are some syntax differences between different RDBs, i.e., the databases’ respective dialect. These dialects are not fabricated out of thin air, but are designed to leverage their own abilities. Unless a logical DW could take account of all dialects when it operates based on the said databases (obviously impossible), the situation where the ability (can only be reflected in database dialect) of many databases cannot be utilized occurs. This problem is even worse for non-SQL databases, for example, the syntax of Mongodb for filtering is quite different from that of SQL. Normally, the ability to directly use the syntax of data source should be provided besides the automatic translation mechanism. Unfortunately, SQL obviously does not have this ability.

On the other hand, logical DWs are generally poor in physical computing ability.

The read of data from diverse data sources will also face performance issues. When the data amount is small, reading the data instantly is OK. However, when the data amount is large, the IO cost will be very high, and instant reading will make the performance too low to tolerate. In order to ensure computing performance, logical DW usually provides certain physical computing abilities (store the data on physical device) but, there is still a big gap between logical DW and physical DW in terms of adaptive storage and related computing performance due to certain reasons such as long-standing habit.

Essentially, logical DW should be a combination of physical DW and logical ability, and the physical computing ability should be fundamental for the logical DW. Unfortunately, a purely logical DW is very poor in physical computing (it is only suitable for small data amount and low performance scenario). In addition, logical DW should be more open and flexible. Besides being able to connect different data sources for mixed computing, logical DW should have the ability to fully utilize (leverage) the advantages of each data source based on the computing scenario.

The current dilemma is that logical DW is poor in physical computing ability, while physical DW is poor in logical data source, so we need to combine them into one and draw on each other’s merits.

Based on these factors, using SPL to implement a logical DW is a better choice.

Implement logical DW in SPL

SPL is an open-source computing engine, and has sufficiently open computing ability, making it possible to integrate multiple types of data sources for mixed computation. Inherent powerful physical computing ability, together with high-performance guarantee mechanisms and logical cross-source computing ability, makes SPL fully implement logical DW.

Logical data source ability

Currently, SPL can connect dozens of data sources and, these data sources are not limited to RDB but other data sources like NoSQL, CSV, Excel, JSON, XML, HDFS, Efficientsearch and Kafka.

imagepng

When connecting these data sources, SPL will regard them as the table sequence (small data) and the cursor (big data) instead of mapping them as database table. How to generate the table sequence/cursor is the business of the data source itself (any data source offers such interface, but may not and cannot provide SQL access interface with unified syntax). In this way, the ability of each data source can be fully utilized.

It is easy for SPL to perform cross-source mixed computation based on these data sources. As an example, the following code uses SPL to handle a common cross-database computing scenario (mixed computation across different types of databases):

A
1 =oracle.query("select EId,Name from employees")
2 =mysql.query("select SellerId, sum(Amount) subtotal from Orders group by SellerId")
3 =join(A1:O,SellerId; A2:E,EId)
4 =A3.new(O.Name,E.subtotal)

In this example, SPL doesn’t read all raw data from MySQL, and instead performs a grouping and aggregating operation in SQL before reading. As a result, the data volume of the large table Orders is significantly reduced, and the IO efficiency is greatly increased upon fetching the data trough interfaces (like JDBC).

As mentioned earlier, SQL translation will face the dialect issues, resulting in a failure to play the role of database’s many functionalities. SPL also provides similar translation function, which can translate standard SQL into corresponding database statements. But more importantly, SPL supports the direct use of data source’s syntax, which makes it possible to use their own syntax to give full play to its own advantages whether it is the dialect of SQL database or a non-SQL data.

In addition to cross-database computation, SPL can perform mixed calculation between data sources of any type. For example, sometimes storing the cold data to file system is more cost-effective and more flexible to process (redundantly store data at will), and the hot data are still stored in the database. If we want to do a real-time query for full data, coding in SPL:

A
1 =cold=file("/data/orders.ctx").open().cursor(area,customer,amount) / Take the cold data from file system (data before the current day)
2 =hot=db.cursor("select area,customer,amount from orders where odate>=?",date(now())) / Take the hot data from production database (data of the current day)
3 =[cold,hot].conjx()
4 =A3.groups(area,customer;sum(amout):amout) / Mixed computing to implement real-time query

SPL is also capable of integrating data sources other than RDB. In particular, SPL provides good support for multi-layer data structure, which makes it convenient to process the data from Web interface, IoT, and NoSQL. For example, SPL reads JSON multi-layer data and performs association query with database:

A
1 =json(file("/data/EO.json").read()) JSON data
2 =A1.conj(Orders)
3 =A2.select(Amount>1000 && Amount<=3000 && like@c(Client,"*s*")) Conditional filtering
4 =db.query@x("select ID,Name,Area from Client") Database data
5 =join(A3:o,Client;A4:c,ID) Association calculation

Likewise, SPL supports NoSQL such as MongoDB:

A
1 =mongo_open("mongodb://127.0.0.1:27017/mongo")
2 =mongo_shell(A1,"test1.find()")
3 =A2.new(Orders.OrderID,Orders.Client,Name,Gender,Dept).fetch()
4 =mongo_close(A1)
5 =db.query@x("select ID,Name,Area from Client")
6 =join(A3:o, Orders.Client;A4:c,ID)

Another example, mixed computing of RESTful data and text data:

A
1 =httpfile("http://127.0.0.1:6868/api/getData").read() RESTful data
2 =json(A1)
3 =T("/data/Client.csv") Text data
4 =join(A2:o,Client;A3:c,ClientID) Association calculation

By now, we can see that SPL provides independent computing ability that has nothing to do with data source, yet the ability of data source itself can still be utilized. Users can choose where to do the calculation, data source end or logical DW (SPL), which is where the flexibility of SPL comes in.

Physical computing ability

In the previous section, we took several examples of SPL’s ability to integrate multiple data sources. In addition, SPL offers powerful physical computing ability.

SPL provides a professional structured data object: table sequence, and offers rich computing library based on the table sequence, thereby making SPL have complete and simple structured data process ability.

Here below are part of common calculation codes written in SPL:

Orders.sort(Amount)				// sort
Orders.select(Amount*Quantity>3000 && like(Client,"*S*"))		// filter
Orders.groups(Client; sum(Amount))			// group
Orders.id(Client)				// distinct
join(Orders:o,SellerId ; Employees:e,EId)			// join

By means of the procedural computation and table sequence, SPL can implement more calculations. For example, SPL supports ordered operation more directly and thoroughly; SPL also supports grouping operation, which can retain the grouped subset, i.e., the set of sets, allowing us to conveniently perform further operation on the grouped result. Compared with SQL, SPL syntax has many differences. To be precise, these differences should be advantages, which we will discuss later.

In addition to rich algorithms and libraries, SPL provides high-performance guarantee mechanism. As mentioned earlier, logical DW should be a combination of physical DW and logical ability, and the physical computing ability is very important. Only the combination of the two can provide sufficient performance guarantee. SPL designs a number of high-performance algorithms specifically for high-performance computing:

  • In-memory computing: binary search, sequence number positioning, position index, hash index, multi-layer sequence number positioning…

  • External storage search: binary search, hash index, sorting index, index-with-values, full-text retrieval…

  • Traversal computing: delayed cursor, multipurpose traversal, parallel multi-cursor, ordered grouping and aggregating, sequence number grouping…

  • Foreign key association: foreign key addressization, foreign key sequence-numberization, index reuse, aligned sequence, one-side partitioning…

  • Merge and join: ordered merging, merge by segment, association positioning, attached table…

  • Multidimensional analysis: partial pre-aggregation, time period pre-aggregation, redundant sorting, boolean dimension sequence, tag bit dimension…

  • Cluster computing: cluster multi-zone composite table, duplicate dimension table, segmented dimension table, redundancy-pattern fault tolerance and spare-wheel-pattern fault tolerance, load balancing…

Of course, both logical and physical calculations cannot be separated from data storage. Sometimes organizing the data according to the computing objective (such as sorting them by specified field) can obtain higher computing performance. And conversely, the implementation of some high-performance algorithms needs to be backed up by storage. For this reason, SPL provides high-performance file storage. Please note that what SPL provides is file storage, which is completely different from the closed storage of traditional databases, and SPL does not bind storage. From a logical point of view, SPL’s high-performance files are equal to any other data source, except that SPL provides engineering methods for the file storage to improve performance, such as compression, columnar storage and index. Moreover, SPL provides many high-performance algorithms based on file storage.

Physical storage enables SPL to have physical computing ability that logical DW cannot match, and also gives SPL a significant performance advantage relative to other physical DWs. In real-world applications, SPL can often achieve a performance improvement of several times to dozens of times.

Here below are part of performance improvement cases:

Another benefit of file storage is its flexibility, which allows us to organize the data at will based on the computing objective, and avoids the situation where the data cannot be intervened as is the case with databases. Since the storage is relatively cheap, we can copy as much file as we want, as it is nothing more than a few more files. The same data can be organized in different forms (such as ordered by different fields) to adapt to different computing scenarios.

SPL itself has complete and high-performance computing ability. Powerful physical computing ability, along with rich interfaces for diverse data sources, forms a complete logical DW, which is why we think SPL is more suitable for building logical DW.

The advantages of SPL don’t stop there. In the process of building logical DW, being lightweight and simple is also its key characteristics.

More lightweight

Any mention of a DW makes it seem that it will be something heavy, and that it will be a server system even if just a logical DW. In fact, however, such operation may occur in various scenarios, and many in-application cross-source calculations essentially fall in the scope of logical DW. To this end, SPL supports not only independent deployment but integrating in applications.

SPL has very low hardware requirements and is light to install. SPL can run on any operating system as long as a JVM environment with JDK 1.8 or higher version, including common VMs and Container, is available, and it only takes up less than 1G of installation space. When integrating in an application, it only needs to embed a few jars to run, which is very convenient. Now some reporting/BI tool vendors also claim to support logical DW, but the actual effect is very poor, far inferior to professional DWs. Embedding SPL in such tool can make up for this lack, enabling it to do logical calculations such as in-application cross-source computing.

Multiple data source interfaces, which are very convenient to use, together with physical file storage that makes the use and management of file flexible and simple, as well as the agile syntax that comes with SPL, makes SPL, as a logical DW, very light to use.

It is very convenient to manage data files based on file system. Specifically, files in the file system can be managed in a multi-level directory structure, and we can set up different directories for different businesses or modules; a certain directory and subdirectories are dedicated to serving a single business, eliminating the coupling with each other; data modification will not affect other businesses; if a certain business goes down, the corresponding data (directory) can be safely deleted, making overall management very neat. Moreover, SPL has no metadata and does not require a complex management system like a database. All of these will make O&M very light.

Due to its lightness in terms of installation/embedment, use, operation and maintenance, SPL, as a logical DW, is very light in overall performance.

Lower development cost

Using SPL to perform data calculation will make it easier to develop and debug, and lower the development cost.

SPL provides the syntax that supports procedural computation, which will greatly simplify complex calculation. It is clear that to perform a computing task, writing 100 lines of code in one statement (SQL), or writing 100 lines of code in 100 statements (SPL), the complexities are completely different.

In addition, SPL provides an IDE that makes developing and debugging easier. In addition to the editing and debugging functions, the IDE allows us to code cell by cell, and offers a panel to view the result of each step, which is very convenient.

2png

More importantly, SQL does not have complete language ability, and even cannot handle a pure data task alone. For example, for the calculation of maximum number of days that a stock keeps rising, and more complex e-commerce funnel calculation (such calculations are not rare and often appear in practice), it is extremely difficult to implement in SQL, and often needs to resort to Python or Java. Consequently, it will make the technology stack complex, and bring inconvenience to the operation and maintenance.

Compared to SQL (many scenarios are difficult or even impossible to implement in SQL), SPL provides more concise syntax and more complete ability. For example, to calculate the maximum number of days that a stock keeps rising, coding in SPL needs just one statement:

stock.sort(trade_date).group@i(close_price<close_price [-1]).max(~.len())

In contrast, when this calculation is implemented in SQL, it needs to nest multiple layers of code and implement in a very roundabout way.

In addition to conventional structured data computing library, SPL provides the ordered operation that SQL is not good at, the grouping operation that retains the grouped sets, as well as various association methods, and so on. In addition, SPL syntax provides many unique features, such as the option syntax, cascaded parameter and advanced Lambda syntax, making complex calculations easier to implement.

For more information about SPL syntax, visit: A programming language coding in a grid

Concise syntax, along with complete language ability, directly makes the development work very efficient, and eliminates the need to resort to other technologies, hereby making the technology stack simpler, enabling everything to be done in one system, simplifying and facilitating O&M, and reducing the cost.


For a logical DW, the logical ability and the physical computing ability are equally important. Only by combining the two abilities can a logical DW fully play its role. In addition, the integration degree of data sources, support degree for data types, performance guarantee, ease of use, and development and O&M costs are also important considerations. Overall, using SPL to build logical DW is a good choice.