Parsing and Generating Excel Files
Excel files are frequently used. It is common that we need to parse an Excel file into structured data or store processed structured data into an Excel file during data processing and analysis. Users can use the open-source Apache Poi package to read and write Excel cell data in Java. Yet powerful as it is, the API package has a certain learning curve to overcome. Moreover, many Excel files have irregular formats and there are various file structures. These cause heavy programming work for reading and writing Excel files, and programs have low applicability. For each file structure, we need to specifically write a program.
The open-source Open esProc computing package offers easier to use and more convenient functionalities of reading and writing Excel files. It generates simple and succinct SPL scripts that only have one or two lines of code.
1. Parsing
SPL opens an Excel file using file function and reads data from it as a table sequence or a cursor using xlsimport function. When the file can fit into the memory, just read it as a table sequence. When the file is too large to fit into the memory, read it as a cursor to reduce memory usage. SPL also supports reading data in a specified cell, or data in cells of a specified range.
Here are examples of reading data from Excel files.
1.1 Read simple rowwise files
In a rowwise Excel file, each row is a record and all rows have same columns. The file is similar to a database table.
=file("e:/scores.xlsx").xlsimport@t()
The parameter in file function can be the target file’s absolute path or relative path (which is relative to the mainPath configured in SPL configuration file). The xlsimport function, by default, reads all rows and columns in the file’s first Sheet to generate a table sequence. @t option enables reading the first row as headers, and column values in this row are column names in the table sequence. The absence of @t option means no header row, and columns will be referenced using their ordinals, such as #1 and #2.
1.2 Read part of the Excel file
=file("e:/scores.xlsx").xlsimport@t(No,Name,Class,Maths)
The above statement specifies column name parameters to read data from No, Name, Class and Maths columns only.
=file("e:/scores.xlsx").xlsimport@t(;,1:100)
The statement reads all columns from row 1 to row 100 in the file’s first Sheet.
=file("e:/scores.xlsx").xlsimport@t(;,1:-10)
The statement reads all columns from the first row to the 10th from the last row in the file’s first Sheet. A negative number represents a row ordinal counted backward.
1.3 Read a specified Sheet
=file("e:/scores.xlsx").xlsimport@t(;"School2")
The statement reads all columns in a Sheet named School2.
1.4 Read Sheet information
As the above Excel file has more than one Sheets, SPL xlsopen function opens the file and generates an Excel object, which is a table sequence containing information of the Sheets.
=file("e:/scores.xlsx").xlsopen()
Below is data in the Excel object:
Each row records information of a Sheet. The column names (stname, nrows and ncols) represents sheet name, number of rows and number of columns respectively.
1.5 Read a file with free-form layout
In an Excel file with free-form layout, a record may span across multiple rows, a cell containing a column value may be next to or directly below a cell containing a column name, or there may be merged cells spanning columns or rows. Yet each record covers the same number of rows of same structure. The circular data reading will read each record according to the number of rows it covers.
To read data in specified cells, SPL xlsopen function reads the Excel file as an Excel object and then calls xlscell function to read data from the object.
Below is part of an Excel file (employee.xlsx) with the free-form layout:
SPL has the following script to read data from it:
A |
B |
C |
|
1 |
=create(ID,Name,Sex,Position,Birthday,Phone,Address,PostCode) |
||
2 |
=file(“e:/excel/employe.xlsx").xlsopen() |
||
3 |
[C,C,F,C,C,D,C,C] |
[1,2,2,3,4,5,7,8] |
|
4 |
For |
=A3.(~/B3(#)).(A2.xlscell(~)) |
|
5 |
if len(B4(1))==0 |
break |
|
6 |
>A1.record(B4) |
||
7 |
>B3=B3.(~+9) |
A1 Create an empty table sequence made up of ID, Name, Sex, Position, Birthday, Phone, Address and PostCode columns.
A2 Open the target Excel file and generate an Excel object.
A3 Define a sequence of column numbers commanding cells containing employee information.
B3 Define a sequence of row numbers commanding cells containing employee information.
A4 Read information of employees by loop using the for statement.
B4 A3.(~/B3(#)) gets the sequence of cells containing information of the current employee and then reads in the cell values to generate a sequence of employee information. The first round of reading gets [C1,C2,F2,C3,C4,D5,C7,C8] and the second round gets [C10,C11,F11,C12,C13,D14,C16,C17] and so on, by adding 9 to the row number each time.
B5 Check whether the current employee ID is null, and exit the loop and terminate the data reading if it is null.
B6 Append an employee record to A1’s table sequence.
B7 Add 9 to row numbers of the current employee to read information of the next employee.
Below is the data retrieved from cell A1:
Refer to Reading and Writing Excel Files in SPL to learn more about parsing Excel files.
2. Generation
SPL offers file function to open an Excel file, and xlsexport function to write data in a table sequence or a cursor to the Excel file. The xlsopen function opens an Excel file as an Excel object and call the object’s xlscell function to write data into the specified cells in the Excel file.
Here are examples of writing data to Excel files.
2.1 Write as simple rowwise files
Suppose cell A1 in the SPL script has the table sequence data to be written into the Excel file.
=file("e:/scores.xlsx").xlsexport@t(A1)
The above statement writes all data in A1’s table sequence into the first Sheet in scores.xlsx. @t option enables writing column names in the first row, that is, making the first row the column header row. The absence of @t option means no header row.
=file("e:/scores.xlsx").xlsexport@t(A1,No,Name,Class,Maths)
The above statement specifies column name parameters to write only No, Name, Class and Maths columns to the Excel file.
2.2 Write into the specified Sheet
=file("e:/scores.xlsx").xlsexport@t(A1;2)
The statement writes all data in A1’s table sequence to the second Sheet in scores.xlsx.
=file("e:/scores.xlsx").xlsexport@t(A1; "School2")
The statement writes all data in A1’s table sequence to the Sheet named School2 in scores.xlsx.
2.3 Append data and specify the format
=file("e:/scores.xlsx").xlsexport@a(A1)
@a means that, when the target Excel file already exists, write all data in A1’s table sequence to the end of the first Sheet in scores.xlsx according to the format of the last row.
In order to write data to an Excel file with pre-defined format, we can pre-create the file and, in each column of the last row, define the data type for the corresponding column. Then we use @a option to perform append-style writing to write data according to the defined format.
We define the file format as follows, for instance:
Then we delete data from the second row to make it empty. Data appending will identify the last non-empty row as the header row and the next row to it as the beginning of detail rows. Below is the desired Excel file fed with data:
2.4 Write as Excel files of fixed row and column formats
In an Excel file, the number of rows and columns, as well as the cell format, are fixed. There are also empty cells, to which certain data will be entered. SPL xlsopen function reads the Excel file as an Excel object and calls the object’s xlscell function to write data to specific cells.
Based on the following Excel table, a fund company’s head office requires that each branch office fill data in it and sends the table back.
Below is the SPL script for performing the data filling task:
A |
B |
C |
D |
E |
F |
|
1 |
Mengniu Funds |
2017 |
3 |
58.2 |
364 |
300 |
2 |
8.5 |
50 |
200 |
100 |
400 |
200 |
3 |
182.6 |
76.3 |
43.7 |
28.5 |
16.4 |
|
4 |
120 |
1.07 |
30 |
0.27 |
90 |
0.8 |
5 |
154 |
6 |
4 |
|||
6 |
=file("e:/excel/result.xlsx") |
=A6.xlsopen() |
||||
7 |
=C6.xlscell("B2",1;A1) |
=C6.xlscell("J2",1;B1) |
=C6.xlscell("L2",1;C1) |
|||
8 |
=C6.xlscell("B3",1;D1) |
=C6.xlscell("G3",1;E1) |
=C6.xlscell("K3",1;F1) |
|||
9 |
=C6.xlscell("B6",1;[A2:F2].concat("\t")) |
=C6.xlscell("H6",1;[A3:E3].concat("\t")) |
||||
10 |
=C6.xlscell("B9",1;[A4:F4].concat("\t")) |
=C6.xlscell("B11",1;[A5:C5].concat("\t")) |
||||
11 |
=A6.xlswrite(C6) |
Suppose the data to be filled in is already prepared (which is stored in the first five rows). In the above example Excel table, the first six empty cells in which data will be entered are separated and discontinuous. Each time data can only be entered into one cell. From the sixth row, cells to be filled in are continuous. Now we concatenate the to-be-entered data into a string separated by \t and fill in data from the specified cell in turn. When data entry is finished, we write C6’s Excel object back to result.xlsx.
Below is the finished Excel table result.xlsx:
Refer to SPL: Reading and Writing Excel Files to learn more about Excel file generation.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version