Retrieve data from complex row-based text files
1. Methods
During data processing, the data sometimes come from text files with complex formats. To retrieve required data from them, we need to consider the following points:
1 Identity the structure of data and fields to be retrieved.
2 Determine whether a line of the text file contains valid data.
3 Find the rules of retrieving each field from the valid data rows.
The rules may be different for different text data, but there is always a rule to be parsed.
2. Examples
There is a customer quotation file item.txt in text format, as shown in the figure below:
The row before the horizontal line is the complex table header, and each row after it is a quotation record with blank lines between them. What the above figure shows is just one table header and quotation record area, which will repeatedly appear in the text file. The red boxes show the Unit Price and Exp. Date field columns respectively, between which there are Quotation Number Customer Code, Customer Name field columns with spaces between the data in each column.
Now we need to retrieve the quotation data in the text file and store it to an Excel file as shown in the figure below:
1. Observe and find the rules of the text file.
The following rules can be found:
(1) Lines with less than 136 characters have no valid information and can be skipped.
(2) The required data is located from the 59th to 136th columns of each row.
(3) Split the valid information of each row by spaces. If the first split value is a number, then this row is quotation record; otherwise it can be skipped. The first split value is Unit Price column, the second one is Quotation Number column, the third is Customer Code column, the last is Contract Expiry Date column, and the fourth to the penultimate one connected by spaces is Customer Name column.
2. Write the script:
A |
B |
C |
|
1 |
=create(Customer_Code,Customer_Name,Quotation_No,Unit_Price,Contract_Expiry_Date) |
||
2 |
=file("E:/txt2excel/item.txt").read@n() |
||
3 |
for A2 |
if len(A3)<136 |
next |
4 |
=right(left(A3,136),-58) |
=B4.split@tp() |
|
5 |
if !ifnumber(C4(1)) |
next |
|
6 |
=C4.m(4:C4.len()-1).concat(" ") |
||
7 |
>A1.insert(0,C4(3),B6,C4(2),C4(1),C4(C4.len())) |
||
8 |
=file("E:/txt2excel/item.xlsx").xlsexport@t(A1) |
A1 Create the target data set.
A2 Open the quotation text file item.txt and read in the content of the file. The @n option means to read each line as a string.
A3 Loop through each line of the text file and apply the rules found previously.
B3C3 Skip the current line if the length of the line is less than 136.
B4 Retrieve the data located from the 59th to 136th columns of the row.
C4 Split the data retrieved in B4 by spaces; the t option indicates to remove the blank at both ends after the splitting, and the p option means to parse the split strings into corresponding data types.
B5C5 Skip the current line if the first value split in C4 is not a number.
B6 Concatenate the fourth to the penultimate values split in C4 as a string with spaces.
B7 Insert the third value split in C4, the second value split in B6 and C4, the first value, and the last value in the new record of A1 in order.
A8 Store all retrieved data to Excel file item.xlsx.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_Desktop/
SPL Learning Material 👉 https://c.scudata.com
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProcDesktop
Linkedin Group 👉 https://www.linkedin.com/groups/14419406/