"It is common to process data files of open formats like txt, csv, json, xml and xls in Java appl .."

Hoo RaqForum 19 No.
1 Reply • 584 View • 2 Years ago

Open-source SPL Helps JAVA Handle Files of Open Formats (txt/csv/json/xml/xls)

It is common to process data files of open formats like txt, csv, json, xml and xls in Java applications. Hardcoding in Java is extremely complicated, so we often turn to certain ready-for-use open-source packages. But each package has their weaknesses.

Parsing library. This type of class libraries enables to read an external file in Java as the latter’s internal object through a simpler coding process than hardcoded data retrieval. Common products include OpenCSV for parsing txt and csv files, SJ.json, Gson and JsonPath for parsing json files, XOM, Xerces-J, Jdom, Dom4J intended to parse xml files, and POI, which is the xls parser. JsonPath offers JsonPath syntax and Dom4J provides XPath syntax to handle simple filtering computations. But these libraries generally have weak computing ability and need to turn to hardcoding or other class libraries to accomplish computing tasks.

Spark. As Scala’s class library, Spark supports structured data files and has relatively strong computing ability. The library’s disadvantage is the lack of parsing ability and needs the assistance of a third-party class library, like spark-xml for parsing xml and spark-excel or for parsing xls. This makes the computation not as stable as using native class libraries. The Scala programming language has shortcomings. The far steeper learning curve than Java’s means high costs of learning, and too frequent releases of newer versions causes inconvenience to practical applications in practice.

Embedded databases. Parsing a file and writing it to an embedded database, like SQLite, HSQLDB and Derby, makes it possible to use SQL’s powerful computing ability. Yet, embedded databases have complicated frameworks, and data loading process is quite a hassle that results in serious delay. SQL’s powerful computing ability is not always powerful because it is only good at computing two-dimensional data but not good at handling hierarchical data like json/xml.

Certain class libraries, including file JDBC drivers like simoc csvjdbc, xiao321, Csvjdbc and xlsjdbc and DataFrame libraries such as Tablesaw and Joinery are able to compute structured data files, but they are immature, have weak computing ability and thus have low practical value.

esProc SPL is a better choice.

SPL is a JVM-based, open-source programming language. It offers simple parsing methods to read regular or irregular txt, csv, json, xml and xls files, provides specialized data objects to express two-dimensional data and hierarchical data in the uniform way, and supplies a wealth of functions that can meet various business computing requirements.

txt\csv

SPL has a variety of built-in parsing functions that enable parsing various text files using simple code, and a rich set of functions for computing parsed text files in a consistent way.

Text files of regular formats. A two-dimensional text file, like a database table, contains column names in the first row, corresponds one record to one row, and separates columns using a fixed symbol. The comma-separated csv and the tab-separated txt are two most commonly seen formats. SPL offers T function to parse any text file using one line of code:

s=T("D:\\data\\Orders.csv")

Text files of irregular formats. SPL uses import function that can work with a rich set of options to parse them. To parse a text file separated by horizontal double-dash, for instance:

s=file("D:/Orders.txt").import@t(;,"--")

Text files of irregular formats. SPL uses import function that can work with a rich set of options to parse them. To parse a text file separated by horizontal double-dash, for instance:

A wealth of functions. On a parsed text file, SPL can accomplish SQL-style computations effortless:

Filter: s.select(Amount>1000 && Amount<=3000 && like(Client,"*s*"))

Sort: s.sort(Client,-Amount)

Distinct: s.id(Client)

Group & Aggregate: s.groups(year(OrderDate);sum(Amount))

Join: join(T ("D:/data/Orders.csv"):O,SellerId; T("D:/data/Employees.txt"):E,EId)

Get topN: s.top(-3;Amount)

Get TopN in each group: s.groups(Client;top(3,Amount))

Text files of more irregular formats. Generally, such a text file cannot be directly parsed into structured data. SPL offers flexible functional syntax to obtain desired data through simple handling. In a text file, for instance, every three lines form one record and the second row of each record contains multiple fields. We are trying to rearrange the file to transform it to a structured one sorted by the 3^rd and the 4^th fields:

	A
1	=file("D:\\data.txt").import@si()
2	=A1.group((#-1)\3)
3	=A2.new(~(1):OrderID, (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )
4	=A3.sort(_3,_4)

SPL also supports SQL syntax of SQL92 standard, covering set-oriented computations, case when, with and nested queries. To perform a grouping & aggregation, for instance:

$select year(OrderDate),sum(Amount) from D:/data/Orders.txt group by year(OrderDate)

json\xml

SPL processes hierarchical data like json and xml conveniently by accessing any hierarchical level freely and computing data in a consistent way.

Specialized hierarchical structured data object. SPL expresses the hierarchical structure of json\xml data conveniently. To read the hierarchical json string from a file and parse it, for instance:

	A
1	=file("d:\\xml\\emp_orders.json").read()
2	=json(A1)

The following screenshot shows the hierarchical structure:

undefined

It is similar to read and parse an xml string:

	A
1	=file("d:\\xml\\emp_orders.xml").read()
2	=xml(A1,"xml/row")

Access hierarchical data. SPL accesses data on a certain level through the dot and in a certain position through the subscript.

Get a set of Client field values: A2.(Client)

Get Orders field of the 10^th record (the two-dimensional table value): A2(10).Orders

Get the 5^th record of Orders field in the 10^th record: (A2(10).Orders)(5)

Compute hierarchical data. SPL computes two-dimensional data and hierarchical data in uniform code:

	A
3	=A2.conj(Orders).groups(year(OrderDate);sum(Amount))
4	=A2(10).Orders.select(Amount>1000 && Amount<=3000 && like(Client,"s"))

Handle hierarchical data downloaded from web. Besides local hierarchical data, SPL can also handle hierarchical data downloaded from web like Webservice and RESTful. To retrieve hierarchical json data from RESTful and perform a conditional query, for instance:

	A
1	=httpfile("http://127.0.0.1:6868/restful/emp_orders").read()
2	=json(A1)
3	=A2.conj(Orders)
4	=A3.select(Amount>1000 && Amount<=2000 && like@c(Client,"business"))

Many special data sources, like MongoDB, ElasticSearch and SalesForce also store data in hierarchical level. SPL can directly retrieve data from them for further computation.

xls

SPL can read/write various regular- or irregular-format xls files effortlessly with strongly encapsulated POI and compute them with built-in functions and syntax through consistent coding.

SPL still uses T function to read row-wise xls files of regular formats:

=T("d:\\Orders.xls")

And performs subsequent computations in a similar way of handling text files.
SPL uses xlsexport function to generate a regular-format, row-wise xls file. To write data table A1 to the first sheet of a new xls file and make the first row contain column names, for instance, SPL needs only one line of code:

=file("e:/result.xlsx").xlsexport@t(A1)

xlsexport function has many functionalities. It can write a data table to a specified sheet, certain rows of the data table to it, or specified columns of the data table to it:

=file("e:/scores.xlsx").xlsexport@t(A1,No,Name,Class,Maths)

xlsexport function can also be conveniently used to append data. Suppose there is an xls file containing data and we are trying to append data in data table A1 to the end of the file, with same appearance as the last row of the existing xls file:

=file("e:/scores.xlsx").xlsexport@a(A1)

SPL uses xlsimport function to read data from row-wise xls files of irregular formats. The function has rich and simplistic functionalities.

Import an xls file without column headers and detailed data starts from the first row: file("D:\\Orders.xlsx").xlsimport()

Import an xls file by skipping the title in the first two row: file("D:/Orders.xlsx").xlsimport@t(;,3)

Import xls data from the 3^rd row to the 10^th row: file("D:/Orders.xlsx").xlsimport@t(;,3:10)

Import 3 columns of an xls file: file("D:/Orders.xlsx").xlsimport@t(OrderID,Amount,OrderDate)

Import a sheet named "sales": file("D:/Orders.xlsx").xlsimport@t(;"sales")

The xlsimport function also have other functionalities like reading N rows backwards, opening an xls file using password and reading a large xls file.

xls files with extremely irregular formats. SPL uses xlscell function to read/write data in a specified range of a given sheet. To read cell A2 in sheet1, for instance:

=file("d:/Orders.xlsx").xlsopen().xlscell("C2")

SPL is capable of parsing free-format xls files with its agile syntax. One instance is to parse the following file into a standard two-dimensional table (table sequence).

undefined

The file has a very irregular format. Writing Java code directly with POI will be a heavy and time-consuming job, but SPL code is short and concise:

	A	B	C
1	=create(ID,Name,Sex,Position,Birthday,Phone,Address,PostCode)
2	=file("e:/excel/employe.xlsx").xlsopen()
3	[C,C,F,C,C,D,C,C]	[1,2,2,3,4,5,7,8]
4	for	=A3.(~/B3(#)).(A2.xlscell(~))
5		if len(B4(1))==0	break
6		>A1.record(B4)
7		>B3=B3.(~+9)

The xlscell function can also be used to write data to an irregular-format range. For instance, the blue cells in the following xls file contain irregular table headers and we are trying to fill data in the corresponding blank cells:

undefined

The POI code will be bloated and lengthy. SPL code, as shown below, is short and concise:

	A	B	C	D	E	F
1	Mengniu Funds	2017	3	58.2	364	300
2	8.5	50	200	100	400	200
3	182.6	76.3	43.7	28.5	16.4
4	120	1.07	30	0.27	90	0.8
5	154	6	4
6	=file("e:/result.xlsx")		=A6.xlsopen()
7	=C6.xlscell("B2",1;A1)		=C6.xlscell("J2",1;B1)		=C6.xlscell("L2",1;C1)
8	=C6.xlscell("B3",1;D1)		=C6.xlscell("G3",1;E1)		=C6.xlscell("K3",1;F1)
9	=C6.xlscell("B6",1;[A2:F2].concat("\t"))			=C6.xlscell("H6",1;[A3:E3].concat("\t"))
10	=C6.xlscell("B9",1;[A4:F4].concat("\t"))			=C6.xlscell("B11",1;[A5:C5].concat("\t"))
11	=A6.xlswrite(B6)

Note that row6, row9 and row11 have continuous cells where SPL condenses code to fill them together. POI, however, can only operate cell by cell.

Remarkable computational capacity

SPL supplies a great number of string functions and date functions, as well as convenient syntax to effectively simplify code for achieving complex logics that both SQL and stored procedures find hard to handle.

A great wealth of date and string functions. Besides functions for performing regular computations like getting a date before or after a specified date and string truncation, SPL offers more date and string functions that outrun SQL in both quantity and functionality:

Get the date before or after a specified number of quarters: elapse@q("2020-02-27",-3) // Return 2019-05-27

Get the date N workdays after:

workday(date("2022-01-01"),25) // Return 2022-02-04

String functions:

Check whether a string all consists of letters:

isdigit("12345") // Return true

Get a string before a specified substring:

substr@l("abCDcdef","cd") // Return abCD

Split a string into an array of substrings by vertical bar: ["aa","bb","cc"]

"aa|bb|cc".split("|") // Return

SPL also offers functions to get a date before or after a number of years, get which quarter the date belongs to, split a string according to a regular expression, get the where or select part of a SQL statement, get words from a string, split HTML by the specific marker, etc.

Convenient function syntax. SPL supports function options. This allows functions with similar functionalities to use same name and distinguishes them with different options. The basic functionality of select function is to filter data. If we need to get the first eligible record, we use @1 option:

T.select@1(Amount>1000)

SPL uses @b option to perform a quick filtering on ordered data using the binary search algorithm:

T.select@b(Amount>1000)

SPL uses @o option to perform order-based grouping on data ordered by grouping field that puts neighboring records having same grouping field values:

T.groups@o(Client;sum(Amount))

Function options can work together:

Orders.select@1b(Amount>1000)

Usually, parameters in a structured computation function are complicated. SQL, for instance, uses a lot of keywords to divide parameters of a statement into multiple groups, causing inconsistent statement structure. SPL has hierarchical parameters. It employs semicolon, comma and colon to identify parameters to three levels, and writes complicated parameters in a simple way.

join(Orders:o,SellerId ; Employees:e,EId)

Achieving complex business logics simply. SPL has excellent computational capability. It handles order-based computation, set-oriented computations, joins and stepwise computations that SQL/stored procedures find hard to handle effortlessly. To count the longest number of days when a stock rises consecutively, SPL has the following code:

	A
1	// File parsing
2	=a=0,A1.max(a=if(price>price[-1],a+1,0))

To find the first n big customers whose orders amount takes up at least half of the total and sort them by amount in descending order:

	A	B
1	//File parsing
2	=A1.sort(amount:-1)	/ Sort orders amount in descending order
3	=A2.cumulate(amount)	/ Get the sequence of cumulative amounts
4	=A3.m(-1)/2	/ Calculate the final cumulative amount, i.e. the total
5	=A3.pselect(~>=A4)	/ Find the position where the cumulative amount exceeds half of the total
6	=A2(to(A5))	/ Get target values by positions

Cross-data-source computations. SPL supports diverse data sources, including not only structured data files but various types of databases and NoSQL like Hadoop, Redis, Kafka and Cassandra, and thus can accomplish computations involving different types of sources, like a join between xls and txt:

=join(T("D:/Orders.xlsx"):O,SellerId; T("D:/Employees.txt"):E,EId)

Ease of integration

SPL offers convenient and easy to use JDBC driver. A piece of simple code, like SQL, can be directly embedded in a Java program:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*s*\"))";
ResultSet result = statement.executeQuery(str);

Ease of integration

SPL offers convenient and easy to use JDBC driver. A piece of simple code, like SQL, can be directly embedded in a Java program:

Storing computing code separately from the Java program reduces coupling. For complicated SPL code, we can first save it as a script file and then invoke it in the Java program as we call the stored procedure. This effectively reduces coupling between computing code and the front-end application.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call scriptFileName(?, ?)}");
statement.setObject(1, "2020-01-01");
statement.setObject(2, "2020-01-31");
statement.execute();

SPL is interpreted language that can achieve hot swap by placing computing code outside the Java program. An interpreted language executes in real-time without recompilation after any changes, without restarting the Java application. This makes maintenance convenient and creates a more stable system.

Though many class libraries are available for computing txt\csv\json\xml\xls files, they have their defects. SPL, as a JVM-based, open-source programming language, can parse structured data files of regular-or irregular-format, represent two-dimensional data and hierarchical data in the uniform way and perform common SQL-style computations using consistent coding. SPL boasts a richer set of string and date functions, more convenient syntax and more powerful computing capacity. It offers integration-friendly JDBC driver, supports placing algorithms in or outside the application – which can effectively decrease system coupling, and achieves hot-swappable code.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

Promote

Hoo • 584 View • 2 Years ago

Open-source SPL Helps JAVA Handle Files of Open Formats (txt/csv/json/xml/xls)

txt\csv

json\xml

xls

Remarkable computational capacity

Ease of integration

Ease of integration

ToC