Is there any open source package for Java that can handle huge text?

With the upgrade of the JDK version, JAVA provides a better solution for large text streaming reading. The size of the text that can be processed and the reading efficiency have been greatly improved with adjusting the size of the JVM heap. If the text is more standardized, using the open-source packages Commons CSV, OpenCV, SuperCSV, etc., can further simplify the processing. However, fast reading or parsing of large texts is not the ultimate goal. The top destination of most data reading is databases, big data processing engines, and cloud computing storage, waiting to be recalculated. But in reality, many requirements do not need to be calculated repeatedly, such as comparing two large texts to find out the difference data and filtering out the desired data with a condition of 1T text data. Importing such requirements into the database is time-consuming and laborious. When there are no hardware and software conditions, Java programmers need to spend a lot of energy implementing related calculations.

Suppose there is a more advanced class library. In that case, it will be very convenient to read data in a stream and efficiently and cooperate with the application to complete SQL-like calculations on the machine. The feature-rich and easy-to-use computing package Open-esProc is more advanced in the open-source field, but Open-esProc is different from available open-source packages. It encapsulates data types and methods into a scripting language called SPL and then Calls the SPL script in the Java program and returns the ResultSet object.

First, introduce a relatively simple example of using Open-esProc. For example, to query the file emplyee.csv for female employees born after January 1, 1981 (inclusive), the SPL script is as follows:


A

1

=file("D:/emplyee.csv").cursor@tc()

2

=A1.select(BIRTHDAY>=date("1981-01-01") && GENDER=="M")

3

=A1.fetch()

SPL reads through cursors, similar to cursors in databases, does small-batch streaming reads, and then binds calculations on the cursors to achieve filtering, grouping, and correlation calculations, so that large data calculations can be completed using small memory.

If the condition is uncertain, you can change the code of A2 to: A1.select(${where}), so that you can write the query condition in the where parameter to realize the dynamic query. If there are too many query results and cannot be stored inside, you can also change A3 to file(“D:/result.txt”).export(A2), which can directly output the calculation results to a file.

In addition to the above writing method, SPL also supports querying csv files with embedded SQL, such as =connect().query(“select * from D:/emplyee.csv where”+where)

Script files (such as condition.dfx) are stored together with Java and are called in JAVA through the JDBC interface. Thus, the usage is similar to stored procedures.

…
 ResultSet result = statement.executeQuery("call condition.dfx");
…

For another example, compare two large files to find the difference: the existing national real-estate-property owner summary table all.txt and the Washington property owner registration table. Check whether there is any missing Washington personnel in the summary table. The result is recorded in lost_w.txt.

The SPL script is as follows:


A

Annotation

1

=file("e:/txt/washington.txt").cursor@s().sortx(_1)

Create Washington data cursor and sort it.

2

=file("e:/txt/all.txt").cursor@s().sortx(_1)

Create a national data cursor and sort it.

3

=[A1,A2].mergex@d()

The two cursor data are merged in order-based. @d means that all personnel in A2 are removed from A1, that is, the Washington personnel omitted from the general table.

4

=file("e:/txt/lost_w.txt").export(A3)

Write the calculated result into the filelost_w.txt

Importing a database is often for the purpose of correlation calculation. If the correlation calculation can be completed in the Java calculation package, it will be very convenient.


A

1

=file("E:/txt/Employees.txt").cursor@t().sortx(EId)

2

=file("E:/txt/Orders.txt").cursor@t().sortx(SellerId)

3

=joinx(A2:O,SellerId; A1:E,EId)

4

=A3.groups(E.Dept;sum(O.Amount))

Large text processing often needs to add parallel computing to improve computing efficiency. Each thread processes a piece of data in a large file computing method and finally summarizes each thread’s processing results.

A
1 =file(“E:/txt/user_info_reg.csv”).cursor@tcm(;4)
2 =A1.groups(id_province;count(~):cnt)

It is very easy to use parallel speed in SPL. @m means parallel computing, and parameter 4 means 4-way parallel. Compared with single-threaded code, there is only one more cursor option and parameter, making it very convenient for users to use parallelism.

For more calculation examples, please refer to Use SPL in applications - File Calculation