Structured Text File Processing in SPL (II)

 

Binary operations

*  Set operations (File comparison)

  Here are f1.txt and f2.txt (Sample data is listed below). Their first lines are column names. Task: calculate intersection between their Name fields.

  f1.txt:

Name  Dept

Rachel  Sales

Ashley  R&D

Matthew Sales

Alexis  Sales

Megan  Marketing

 

  f2.txt:

Name  Dept

Emily  HR

Ashley  R&D

Matthew  Sales

Alexis  Sales

Megan  Marketing

 

  SPL script:


A

B

1

=file("E:\\f1.txt").import@t()

=file("E:\\f2.txt").import@t()

2

=[A1.(Name),B1.(Name)].isect()


 

  isect() function calculates intersection of two sets. A1.(Name) retrieves A1’s Name column and return it as a sequence; B1.(Name) retrieves B1’s Name column and return it as a sequence. Below is the final result:

  undefined

  Similarly, we use union() function to calculate union, diff() function to calculate difference, and conj() function to calculate concatenation (which is equivalent to union all). You can also use operators to generate more concise statement, as shown below:

  A1.(Name) ^ B1.(Name)

  A1.(Name) & B1.(Name)

  A1.(Name) \ B1.(Name)

  A1.(Name) | B1.(Name)

  In the above example, the whole text file is imported and automatically split into multiple fields and one field is retrieved for calculating intersection. To compare lines of data, just add @s option to import the text file as a single-column table sequence where the column name is Name(Tab)Dept, which contains the invisible character tab. The column name is illegal and can’t be referenced directly. In this case, we can use numbers to represent the column names to write the following code:


A

B

1

=file("D:\\f1.txt").import@ts()

=file("D:\\f2.txt").import@ts()

2

=A1.(#1)^B1.(#1)


 

  A convenient alternative is to return only the single-column detailed data as a set (sequence) by adding @i option. This way you can directly calculate the intersection with A1^B1:


A

B

1

=file("D:\\f1.txt").import@tis()

=file("D:\\f2.txt").import@tis()

2

=A1^B1


 

  The two algorithms have same results:

  undefined

 

*  Merge

  Below are two text files – table1.txt and table2.txt, sorted by logical primary key ID1 and ID2 respectively. Task: Update table1 with data in table2 according to the primary keys. Rule: Replace a record’s field values in table1 with values in the corresponding record in table2 if their primary key values are same, and insert the table2’s record to table1 if their primary key values are different.
  Source files:

Table1

Table2

ID1 ID2 Col1 Col2

2  B1  2B1  row

2  B2  2B2  row

3  B1  3B1

4  B3  4B3  Row

ID1  ID2  Col1 Col2

1  B1  diff  row

2  B1  diff  row

2  diff  diff  row

3  B1  3B1

 

  Here’s the updated table1:

ID1

ID2

Col1

Col2

1

B1

diff

row

2

B1

diff

row

2

B2

2B2

row

2

diff

diff

row

3

B1

3B1


4

B3

4B3

row

 

  SPL script:


A

B

1

=file("D:\\table1.txt").cursor@t()

=file("D:\\table2.txt").cursor@t()

2

=[B1,A1].mergex@u(ID1,ID2)


3

=file("D:\\result.txt").export@t(A2)


 

  Read in the two files as cursors and update A1 with B1 by their logical primary keys. mergex() function merges multiple data sets while keeping the result set in order; @u option enables calculating their union. The merging result will be exported to a target file.

  Cursors can handle files of any size. If a file is unordered, you need to sort it first by adding a sorting expression, as shown below:

  [B1.sortx(ID1,ID2),A1.sortx(ID1,ID2)].mergex@u(ID1,ID2)

*  Order-based set operations

  Both files below are ordered by Name and Dept. Task: calculate their intersection.

  f1.txt:

Name  Dept

Alexis  Sales

Ashley  R&D

Matthew Sales

Megan  Marketing

Rachel  Sales

 

  f2.txt:

Name  Dept

Alexi  Sales

Ashley  R&D

Emily  HR

Matthew Sales

Megan  Marketing

 

  This is an order-based set operation. We use merge() function to calculate the intersection, which is more efficient. Below is SPL script:


A

B

1

=file("D:\\f1.txt").import@t()

=file("D:\\f2.txt").import@t()

2

=[B1,A1].merge@i(Name,Dept)


3

=file("D:\\result.txt").export@t(A2)


 

  merge() function merges table sequences. It works with @i option to calculate intersection, @u option to calculate union, and @d option to calculate difference.

  Here’s the final result:

  undefined

  Sort files with sort() function if they are unordered. Note that sorting and merging small files is slower than processing them with common set-operation functions. merge() is more suitable for handling large files.

*  Association operations

  emp.txt is a tab separated text file, where EId field points to sales.txt’s SellerId field. Task: Align emp.txt’s Dept field, Name field and Gender field to sales.txt.

  emp.txt:

EId

State

Dept

Name

Gender

Salary

Birthday

2

New York

Finance

Ashley

F

11000

1980/07/19

3

New Mexico

Sales

Rachel

F

9000

1970/12/17

4

Texas

HR

Emily

F

7000

1985/03/07

5

Texas

R&D

Ashley

F

16000

1975/05/13

6

California

Sales

Matthew

M

11000

1984/07/07

7

Illinois

Sales

Alexis

F

9000

1972/08/16

8

California

Marketing

Megan

F

11000

1979/04/19

 

  SPL script:


A

1

=sOrder=file("D:\\sales.txt").import@t()

2

=emp=file("D:\\emp.txt").import@t(EId,Name,Dept,Gender)

3

=join@1(sOrder:s,SellerId;emp:e,EId)

4

=A3.new(s.OrderID,   s.Client, s.SellerId,  s.Amount,   s.OrderDate,

e.Name, e.Dept,   e.Gender)

 

  join() function performs a left join where two tables are renamed s and e respectively. @l enables a left join and @f enables a full join. By default the function performs an inner join. A4 retrieves desired fields from the joining result to form a new two-dimensional table, as shown below:

  undefined

Complex calculations

*  Multilevel association

  In the following diagram, there are 5 data files. Orders is the fact table and the other four are dimension tables. Task: Find orders where the customer and the supplier are in the same region, group those orders by cities, and count orders and sum the orders amount for each city.

  The tables’ relationships are as follows:

  undefined

  SPL script:


A

1

=file("D:/files/orders.txt").import@t()

2

=file("D:/files/customer.txt").import@t()

3

=file("D:/files/product.txt").import@t()

4

=file("D:/files/supplier.txt").import@t()

5

=file("D:/files/region.txt").import@t()

6

=A2.switch(city,A5:city)

7

=A4.switch(city,A5:city)

8

=A3.switch(sid,A4:sid)

9

=A1.switch(pid,A3:pid;   cid,A2:cid)

10

=A1.select(pid.sid.city.region==cid.city.region)

11

=A10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)

 

  Import the files respectively, create associations between the fact table and each of the dimension tables, get eligible orders according to the associations, and then groups them and performs aggregations. switch() function creates foreign key style associations.

*  Compare files of different structures

  Data.txt is a tab-delimited, six-field text file, where here field contains semicolon-separated strings. List.txt is a single-column file. Task: If a record in Data.txt where the split here field value matches any line in List.txt, output this record to result.txt.

  List.txt:

Gee

Whiz

Lol

 

  Data.txt:

field1

field2

field3

here

field5

etc

A

B

2

Gee;Whiz;Hello

13

12

A

B

2

Gee;Whizz;Hi

56

32

E


4

Btm;Lol

16

2

T


3

Whizz

13

3

 

  SPL script:


A

1

=file("d:\\Data.txt").import@t()

2

=file("d:\\List.txt").read@n()

3

=A1.select(here.array(";")^A2!=[])

4

=file("d:\\result.txt").export@t(A3)

 

  select() function gets desired records by splitting each here field value into a sequence of strings with array() function and calculating its intersection with A2’s sequence using operator “^” and according to the condition that the intersection result isn’t empty.

  The final result:

field1

field2

field3

here

field5

etc

A

B

2

Gee;Whiz;Hello

13

12

A

B

2

Gee;Whizz;Hi

56

32

E


4

Btm;Lol

16

2

 

*  Data extraction from a multilevel directory

  D:\files is a multilevel directory where each subdirectory contains many text files. Task: Retrieve a specified line (say, the 2nd one) from each file to write to d:\result.txt.

SPL script:


A

1

=directory@p(path)

2

=A1.(file(~).cursor@s())

3

=A2.((~.skip(1),~.fetch@x(1)))

4

=A3.union()

5

=file("d:\\result.txt").export@a(A4)

6

=directory@dp(path)

7

=A6.(call("c:\\readfile.dfx",~))

 

  Set the initial value of parameter path as D:\files to begin data extraction; then call the script (c:\readfile.dfx) iteratively to pass different value to parameter path each time.

  directory() function gets files under the root directory specified by parameter path; @p option requires full paths and @d gets directory names only.

  ~.skip(1) skips the first line.

  ~.fetch@x(1) retrieves the line at the current position (the 2nd line) and closes the cursor.

*  Split a file as multiple ones

  sales.txt records a large number of orders. Task: Split the file into multiple files by year and month; the file name format is year-month.txt.

  SPL script:


A

1

=file("D:\\   sales.txt").import@t()

2

=A1.group(string(OrderDate,"yyyy-MM");~)

3

=A2.run(file("d:\\temp\\"+#1+".txt").export(#2))

 

  Group the file by year and month and loop each group to write data to a target file. Below is file 2009-01.txt:

65  YZ  8  29600.0  2009-01-06

62  JAXE 11  8134.0  2009-01-06

64  HP  13    20000.0  2009-01-02

60  PWQ  16  3430.0  2009-01-05

63  SJCH  16  5880.0  2009-01-02

61  SJCH  19  1078.0  2009-01-08

 

  Use cursor to import the file if its size is greater than memory capacity. If a group is still can’t be imported into the memory wholly, use groupx() function to group the file. The code structure maintains unchanged.

*  Stock calculation

  Stock.txt stores in-out stock records. One item may have multiple in-out records per day, or may have no records in days. The initial value for an item is 0. In means in-stock and out means out-stock. Task: Calculate stock for each item. Below is the source data:

date name  quantity  flag

2014-04-01  Item1  15  In

2014-04-01  Item1  4  In

2014-04-02  Item1  3  In

2014-04-02  Item1  10  Out

2014-04-03  Item1  3  In

2014-04-04  Item1  5  Out

2014-04-07  Item1  4  In

2014-04-10  Item1  2  Out

2014-04-01  Item2  20  In

2014-04-02  Item3  30  In

2014-04-03  Item3  14  Out

 

  SPL script:


A

B

1

=file("D:\\stock.txt").import@t()


2

=A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)

3

=A2.group(name)

=periods((t=A2.id(date)).min(),t.max(),1)

4

for A3

=A4.align(B3,date)

5


>c=0

6


=B4.new(A4.name:name,
  B3(#):date,
  c:Opening,
  in,
  (b=c+in):Total,
  out,
  (c=b-out):Close)

7


>A8=A8|B6

8



 

  A2 calculates the sum of quantities per item per day. B3 gets complete, continuous dates according to the earliest and the latest dates. Then group records by items and loop over each group to align records to B3. B6 calculates the each item’ daily stock and stores all stocks in A8.

  Below is A8’s result:

  undefined