Q&A of esProc Architecture

 

Q1. Operating environment

esProc is currently a pure Java software that can run on any operating system with JVM of JDK1.8 or above version, including common VMs and Containers.

After normal installation, esProc occupies less than 1GB of space, but the vast majority are referenced third-party external data source driver packages. Its core package is less than 15M and can even run on Android.

Except for the JVM, esProc has no other special requirements for the operating environment. The capacity requirements for hard disk and memory are related to computing tasks, and different computing tasks vary greatly. When performing the same computing task, the hardware resources required by esProc are usually less than those required by traditional databases (This is especially true for distributed databases). Increasing memory, choosing CPUs with higher frequencies and cores, and using SSDs often have an impact on improving computing performance.

Q2. Data storage and exchange

esProc does not have the concept of a “(data)base” in traditional data warehouses, nor does it have the concept of metadata, and does not manage the data of a certain theme uniformly. For esProc, there is no distinction between “inside the database” and “outside the database” for data, and there is no clear action for “import into the database” or “export out of the database”.

Any accessible data source can be regarded as esProc data and can be directly calculated. Before calculation, there is no need to ’ import into the database ’ first. After calculation, the result data can also be written to the target data source via the interface, without the need to deliberately ’ export out of the database ’.

esProc has encapsulated access interfaces for common data sources, such as various relational databases (using JDBC), MongoDB, HBase, HDFS, Cassandra, ElasticSearch, Kafka, HTTP/Restful, …, Even SalesForces, SAP BW…

The logical status of various data sources is basically the same, with the only difference being the access interface and the performance shown by different interfaces. These interfaces are provided by the data source vendors, and esProc cannot interfere with their functionality and performance.

esProc has designed special format files to store data for more functionality and better performance. These data files are stored in the file system, and esProc does not technically own these data files. esProc has publicly disclosed the file formats (open source access code), which can be read and written by any application that can access these data files according to public specifications (or based on open source code). Of course, it is more convenient to access these files directly using SPL language.

In this sense, there is no "import into the database" or "export out of the database" action for the data exchange between esProc and external data source, but there may be a “conversion” action, which is to convert external data to esProc format files for more functionality and better performance, or to convert esProc format files to external data for other applications to continue using. These conversion tasks can be implemented using SPL.

Q3. Data types supported

The data that esProc is good at includes conventional structured data, multi-layer structured data (json/xml, etc.), string text data, matrix vector and other mathematical data. esProc is not good at processing data such as audio, images, and videos.

In particular, esProc has strong support capability for json, xml and other multi-layer structured data, far better than traditional databases can do. So esProc can work well with JSON data sources such as Mongodb and Kafka, and can also easily exchange data and provide computing services with HTTP/Resful and microservices.

esProc can also conveniently calculate data in Excel files, but it is not good at handling Excel formats.

Q4. Framework Integration and Streaming Computing

esProc can run as an independent server process like traditional databases, providing standard JDBC drivers and HTTP services for application calls. A Java application can invoke esProc by transferring SPL statements through JDBC, and calling script code on esProc is equivalent to calling stored procedures in a relational database. Non Java applications can access the computing services provided by esProc using the HTTP/RESTful mechanism.

For applications developed in Java, esProc also provides a fully embedded approach, which encapsulates all computing functions in the JDBC driver and runs in the same process of the main application, without relying on external independent server processes.

Because esProc is a pure Java software and can also run embedded, it can be seamlessly integrated into various Java frameworks and application servers, such as Spring, Tomcat,… It can be dispatched and operated by these frameworks, and for these frameworks, the logical status of esProc is exactly the same as that of Java applications written by users.

It should be pointed out that for computational frameworks (such as Spark), although esProc can be seamlessly integrated, it has no practical significance. esProc requires the conversion of data into SPL specific data objects in order to perform calculations. Not only does the conversion consume time, but the data objects in the original computing framework will lose meaning, and the advantages of the two types of data objects cannot be integrated. The key point of these computing frameworks is mainly their data objects (such as Spark’s RDD), and if they cannot continue to be used, the computing framework itself will also lose its meaning. The computing power of esProc far exceeds that of common computing frameworks, and there is no need to use these frameworks any more.

Specifically, for streaming computing frameworks (such as Flink), esProc cannot function even if it can be integrated. esProc independently serves multiple streaming computing scenarios and does not require the support of streaming computing frameworks. The resource consumption for implementing the same computational workload is usually one order of magnitude lower than these streaming computing frameworks, and the functionality is more abundant.

Q5. Function extension

esProc is a software written in Java that provides an interface to call static functions written in Java, which can expand the functionality of esProc. esProc also opens an interface for custom functions, allowing application programmers to write new functions in Java and mount them to esProc, which can then be used in SPL.

Q6. Elastic computing and computing reliability

esProc supports distributed computing and can work with multiple nodes, but it is rarely used in practical applications because, except for high concurrency scenarios, the vast majority of tasks’ data volume and response expectations can be achieved using a single machine.

The upcoming cloud version of esProc (which can support privatization deployment) will support automatic elastic computing. When the request volume increases, new VMs will be automatically enabled for computing, and when the request volume decreases, idle VMs will be automatically shut down.

The embedded esProc only provides computing services to the main application and cannot provide external services, nor can it be responsible for the reliability of external services. This is the responsibility of the main application and framework.

The esProc of independent processes supports a hot standby mechanism, and JDBC will choose the less burdened service processes that can still work to implement task. The distributed computing of esProc also provides fault tolerance solution, but the design goal of esProc is not large-scale clustering. If a node failure is found during a computing task, the task will be announced as failed. The fault tolerance level of esProc only allows the cluster to accept new tasks when node failures are discovered, and is only suitable for small-scale clusters.

The service process of esProc currently does not provide automatic recovery after a fault, which requires an administer to handle. However, it is not difficult to create a monitoring process to achieve this automatic function.

The elastic computing mechanism of the esProc cloud version avoids the current failed nodes when allocating VMs, achieving high availability to a certain extent.

Q7. Data security and reliability

As per Q2, esProc does not manage data in principle and is not responsible for data security. To some extent, it can be said that esProc does not necessarily need a security mechanism.

In principle, the data source is responsible for the security of persistence data. For esProc format data files, many file systems or VMs provide comprehensive security mechanisms (access control, encryption, etc.) that can be directly utilized. The cloud version of esProc also supports data acquisition and computation from object storage services such as S3, and can also utilize their security mechanisms.

The embedded esProc and the main Java application are the same process, only providing computing services to the main program without external service interfaces, and there are no security or authority issues. The esProc of independent server processes uses standard TCP/IP and HTTP communication, and can be monitored and managed by professional network security products. Specific security measures will be taken by these products.

esProc focuses on computing and is not responsible for the reliability of persistence storage. Similarly, there are professional technologies and products in these areas, and esProc tries to use standard specifications so that it can work with these technologies and products. For example, data can be persistence to highly reliable object storage. esProc implements the interface of these storage schemes and can access these data sources for computing.

esProc is a professional computing technology that does not possess professional security capabilities. The philosophy of esProc is to cooperate with other professional technologies.

Q8. SQL compatibility

esProc is not a computing engine of the SQL system, currently only supports simple SQL that does not involve large amounts of data, and does not guarantee performance. In the context of big data requirements, it can be considered that esProc does not support SQL, and of course it is not compatible with any SQL stored procedures.

esProc will develop dual engines that support SQL in the future, but it is still difficult to ensure high performance and big data, just to make the existing SQL code easy to migrate to esProc.