Structured Text File Processing in SPL (II)
Binary operations
Set operations (File comparison)
Here are f1.txt and f2.txt (Sample data is listed below). Their first lines are column names. Task: calculate intersection between their Name fields.
f1.txt:
Name Dept Rachel Sales Ashley R&D Matthew Sales Alexis Sales Megan Marketing |
f2.txt:
Name Dept Emily HR Ashley R&D Matthew Sales Alexis Sales Megan Marketing |
SPL script:
A |
B |
|
1 |
=file("E:\\f1.txt").import@t() |
=file("E:\\f2.txt").import@t() |
2 |
=[A1.(Name),B1.(Name)].isect() |
isect() function calculates intersection of two sets. A1.(Name) retrieves A1’s Name column and return it as a sequence; B1.(Name) retrieves B1’s Name column and return it as a sequence. Below is the final result:
Similarly, we use union() function to calculate union, diff() function to calculate difference, and conj() function to calculate concatenation (which is equivalent to union all). You can also use operators to generate more concise statement, as shown below:
A1.(Name) ^ B1.(Name)
A1.(Name) & B1.(Name)
A1.(Name) \ B1.(Name)
A1.(Name) | B1.(Name)
In the above example, the whole text file is imported and automatically split into multiple fields and one field is retrieved for calculating intersection. To compare lines of data, just add @s option to import the text file as a single-column table sequence where the column name is Name(Tab)Dept, which contains the invisible character tab. The column name is illegal and can’t be referenced directly. In this case, we can use numbers to represent the column names to write the following code:
A |
B |
|
1 |
=file("D:\\f1.txt").import@ts() |
=file("D:\\f2.txt").import@ts() |
2 |
=A1.(#1)^B1.(#1) |
A convenient alternative is to return only the single-column detailed data as a set (sequence) by adding @i option. This way you can directly calculate the intersection with A1^B1:
A |
B |
|
1 |
=file("D:\\f1.txt").import@tis() |
=file("D:\\f2.txt").import@tis() |
2 |
=A1^B1 |
The two algorithms have same results:
Merge
Below are two text files – table1.txt and table2.txt, sorted by logical primary key ID1 and ID2 respectively. Task: Update table1 with data in table2 according to the primary keys. Rule: Replace a record’s field values in table1 with values in the corresponding record in table2 if their primary key values are same, and insert the table2’s record to table1 if their primary key values are different.
Source files:
Table1 |
Table2 |
ID1 ID2 Col1 Col2 2 B1 2B1 row 2 B2 2B2 row 3 B1 3B1 4 B3 4B3 Row |
ID1 ID2 Col1 Col2 1 B1 diff row 2 B1 diff row 2 diff diff row 3 B1 3B1 |
Here’s the updated table1:
ID1 |
ID2 |
Col1 |
Col2 |
1 |
B1 |
diff |
row |
2 |
B1 |
diff |
row |
2 |
B2 |
2B2 |
row |
2 |
diff |
diff |
row |
3 |
B1 |
3B1 |
|
4 |
B3 |
4B3 |
row |
SPL script:
A |
B |
|
1 |
=file("D:\\table1.txt").cursor@t() |
=file("D:\\table2.txt").cursor@t() |
2 |
=[B1,A1].mergex@u(ID1,ID2) |
|
3 |
=file("D:\\result.txt").export@t(A2) |
Read in the two files as cursors and update A1 with B1 by their logical primary keys. mergex() function merges multiple data sets while keeping the result set in order; @u option enables calculating their union. The merging result will be exported to a target file.
Cursors can handle files of any size. If a file is unordered, you need to sort it first by adding a sorting expression, as shown below:
[B1.sortx(ID1,ID2),A1.sortx(ID1,ID2)].mergex@u(ID1,ID2)
Order-based set operations
Both files below are ordered by Name and Dept. Task: calculate their intersection.
f1.txt:
Name Dept Alexis Sales Ashley R&D Matthew Sales Megan Marketing Rachel Sales |
f2.txt:
Name Dept Alexi Sales Ashley R&D Emily HR Matthew Sales Megan Marketing |
This is an order-based set operation. We use merge() function to calculate the intersection, which is more efficient. Below is SPL script:
A |
B |
|
1 |
=file("D:\\f1.txt").import@t() |
=file("D:\\f2.txt").import@t() |
2 |
=[B1,A1].merge@i(Name,Dept) |
|
3 |
=file("D:\\result.txt").export@t(A2) |
merge() function merges table sequences. It works with @i option to calculate intersection, @u option to calculate union, and @d option to calculate difference.
Here’s the final result:
Sort files with sort() function if they are unordered. Note that sorting and merging small files is slower than processing them with common set-operation functions. merge() is more suitable for handling large files.
Association operations
emp.txt is a tab separated text file, where EId field points to sales.txt’s SellerId field. Task: Align emp.txt’s Dept field, Name field and Gender field to sales.txt.
emp.txt:
EId |
State |
Dept |
Name |
Gender |
Salary |
Birthday |
2 |
New York |
Finance |
Ashley |
F |
11000 |
1980/07/19 |
3 |
New Mexico |
Sales |
Rachel |
F |
9000 |
1970/12/17 |
4 |
Texas |
HR |
Emily |
F |
7000 |
1985/03/07 |
5 |
Texas |
R&D |
Ashley |
F |
16000 |
1975/05/13 |
6 |
California |
Sales |
Matthew |
M |
11000 |
1984/07/07 |
7 |
Illinois |
Sales |
Alexis |
F |
9000 |
1972/08/16 |
8 |
California |
Marketing |
Megan |
F |
11000 |
1979/04/19 |
SPL script:
A |
|
1 |
=sOrder=file("D:\\sales.txt").import@t() |
2 |
=emp=file("D:\\emp.txt").import@t(EId,Name,Dept,Gender) |
3 |
=join@1(sOrder:s,SellerId;emp:e,EId) |
4 |
=A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate, e.Name, e.Dept, e.Gender) |
join() function performs a left join where two tables are renamed s and e respectively. @l enables a left join and @f enables a full join. By default the function performs an inner join. A4 retrieves desired fields from the joining result to form a new two-dimensional table, as shown below:
Complex calculations
Multilevel association
In the following diagram, there are 5 data files. Orders is the fact table and the other four are dimension tables. Task: Find orders where the customer and the supplier are in the same region, group those orders by cities, and count orders and sum the orders amount for each city.
The tables’ relationships are as follows:
SPL script:
A |
|
1 |
=file("D:/files/orders.txt").import@t() |
2 |
=file("D:/files/customer.txt").import@t() |
3 |
=file("D:/files/product.txt").import@t() |
4 |
=file("D:/files/supplier.txt").import@t() |
5 |
=file("D:/files/region.txt").import@t() |
6 |
=A2.switch(city,A5:city) |
7 |
=A4.switch(city,A5:city) |
8 |
=A3.switch(sid,A4:sid) |
9 |
=A1.switch(pid,A3:pid; cid,A2:cid) |
10 |
=A1.select(pid.sid.city.region==cid.city.region) |
11 |
=A10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount) |
Import the files respectively, create associations between the fact table and each of the dimension tables, get eligible orders according to the associations, and then groups them and performs aggregations. switch() function creates foreign key style associations.
Compare files of different structures
Data.txt is a tab-delimited, six-field text file, where here field contains semicolon-separated strings. List.txt is a single-column file. Task: If a record in Data.txt where the split here field value matches any line in List.txt, output this record to result.txt.
List.txt:
Gee |
Whiz |
Lol |
Data.txt:
field1 |
field2 |
field3 |
here |
field5 |
etc |
A |
B |
2 |
Gee;Whiz;Hello |
13 |
12 |
A |
B |
2 |
Gee;Whizz;Hi |
56 |
32 |
E |
4 |
Btm;Lol |
16 |
2 |
|
T |
3 |
Whizz |
13 |
3 |
SPL script:
A |
|
1 |
=file("d:\\Data.txt").import@t() |
2 |
=file("d:\\List.txt").read@n() |
3 |
=A1.select(here.array(";")^A2!=[]) |
4 |
=file("d:\\result.txt").export@t(A3) |
select() function gets desired records by splitting each here field value into a sequence of strings with array() function and calculating its intersection with A2’s sequence using operator “^” and according to the condition that the intersection result isn’t empty.
The final result:
field1 |
field2 |
field3 |
here |
field5 |
etc |
A |
B |
2 |
Gee;Whiz;Hello |
13 |
12 |
A |
B |
2 |
Gee;Whizz;Hi |
56 |
32 |
E |
4 |
Btm;Lol |
16 |
2 |
Data extraction from a multilevel directory
D:\files is a multilevel directory where each subdirectory contains many text files. Task: Retrieve a specified line (say, the 2nd one) from each file to write to d:\result.txt.
SPL script:
A |
|
1 |
=directory@p(path) |
2 |
=A1.(file(~).cursor@s()) |
3 |
=A2.((~.skip(1),~.fetch@x(1))) |
4 |
=A3.union() |
5 |
=file("d:\\result.txt").export@a(A4) |
6 |
=directory@dp(path) |
7 |
=A6.(call("c:\\readfile.dfx",~)) |
Set the initial value of parameter path as D:\files to begin data extraction; then call the script (c:\readfile.dfx) iteratively to pass different value to parameter path each time.
directory() function gets files under the root directory specified by parameter path; @p option requires full paths and @d gets directory names only.
~.skip(1) skips the first line.
~.fetch@x(1) retrieves the line at the current position (the 2nd line) and closes the cursor.
Split a file as multiple ones
sales.txt records a large number of orders. Task: Split the file into multiple files by year and month; the file name format is year-month.txt.
SPL script:
A |
|
1 |
=file("D:\\ sales.txt").import@t() |
2 |
=A1.group(string(OrderDate,"yyyy-MM");~) |
3 |
=A2.run(file("d:\\temp\\"+#1+".txt").export(#2)) |
Group the file by year and month and loop each group to write data to a target file. Below is file 2009-01.txt:
65 YZ 8 29600.0 2009-01-06 62 JAXE 11 8134.0 2009-01-06 64 HP 13 20000.0 2009-01-02 60 PWQ 16 3430.0 2009-01-05 63 SJCH 16 5880.0 2009-01-02 61 SJCH 19 1078.0 2009-01-08 |
Use cursor to import the file if its size is greater than memory capacity. If a group is still can’t be imported into the memory wholly, use groupx() function to group the file. The code structure maintains unchanged.
Stock calculation
Stock.txt stores in-out stock records. One item may have multiple in-out records per day, or may have no records in days. The initial value for an item is 0. In means in-stock and out means out-stock. Task: Calculate stock for each item. Below is the source data:
date name quantity flag 2014-04-01 Item1 15 In 2014-04-01 Item1 4 In 2014-04-02 Item1 3 In 2014-04-02 Item1 10 Out 2014-04-03 Item1 3 In 2014-04-04 Item1 5 Out 2014-04-07 Item1 4 In 2014-04-10 Item1 2 Out 2014-04-01 Item2 20 In 2014-04-02 Item3 30 In 2014-04-03 Item3 14 Out |
SPL script:
A |
B |
|
1 |
=file("D:\\stock.txt").import@t() |
|
2 |
=A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out) |
|
3 |
=A2.group(name) |
=periods((t=A2.id(date)).min(),t.max(),1) |
4 |
for A3 |
=A4.align(B3,date) |
5 |
>c=0 |
|
6 |
=B4.new(A4.name:name, |
|
7 |
>A8=A8|B6 |
|
8 |
A2 calculates the sum of quantities per item per day. B3 gets complete, continuous dates according to the earliest and the latest dates. Then group records by items and loop over each group to align records to B3. B6 calculates the each item’ daily stock and stores all stocks in A8.
Below is A8’s result:
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL