String Split, Parse & Structuralization
【Question】
I have the following code. However there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.
Code:
import sys
outfile_name = sys.argv[-1]
filename1 = sys.argv[-2]
filename2 = sys.argv[-3]
fileIn1 = open(filename1, "r")
fileIn2 = open(filename2, "r")
fileOut = open(outfile_name, "w")
dict = {}
a = open(filename1)
b = open(filename2)
for line in a:
words = line.split("\t")
if len(words) != 1:
target = words[0]
for word in words[1:]:
dict[word] = target
for line in b:
words = line.split("\t")
if words[0] in dict.keys()and words[1] in dict.keys():
fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")
elif words[0] in dict.keys()and words[1] not in dict.keys():
fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")
elif words[0] not in dict.keys()and words[1] in dict.keys():
fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")
elif words[0] not in dict.keys()and words[1] not in dict.keys():
fileOut.write(words[0] + "\t" + words[1] + "\n")
fileOut.close()
filename1:
Area_1 Area_2
A B
A C
A D
D B
D C
L B
L C
L A
D L
K A
K B
K C
K D
K L
D P
D R
L P
L R
K P
K R
A H
D H
L H
K H
B P
B R
R P
A I
D I
I L
I K
C H
I H
C H
J K
J X
J Y
J Z
K X
K Y
Y Z
K Z
X Y
X Z
M G
N T
O S
S Q
filename2:
Incident_00000001 A D L K
Incident_00000002 B P R
Incident_00000003 C F W
Incident_00000004 J I
M
N
O
Incident_00000005 Q S
X
Y
Z
G
T
output (b) - undesired output that I am getting:
Area_1 Area_2
Incident_00000001 B
Incident_00000001 C
Incident_00000001 D
Incident_00000001 B
Incident_00000001 C
Incident_00000001 B
Incident_00000001 C
Incident_00000001 A
Incident_00000001 L
K A
K B
K C
K D
K L
Incident_00000001 P
Incident_00000001 Incident_00000002
Incident_00000001 P
Incident_00000001 Incident_00000002
K P
K Incident_00000002
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
K H
Incident_00000002 P
Incident_00000002 Incident_00000002
R P
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000003
I L
I Incident_00000004
Incident_00000003 H
I H
Incident_00000003 H
Incident_00000004 Incident_00000004
Incident_00000004 X
Incident_00000004 Y
Incident_00000004 Z
K X
K Y
Y Z
K Z
X Y
X Z
M G
N T
O S
Incident_00000005 Incident_00000005
What I am looking to get (output (c)) is:
Area_1 Area_2
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000003
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000001
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 Incident_00000002
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
Incident_00000001 H
Incident_00000002 Incident_00000002
Incident_00000002 Incident_00000002
Incident_00000002 Incident_00000002
Incident_00000001 Incident_00000004
Incident_00000001 Incident_00000004
Incident_00000004 Incident_00000001
Incident_00000004 Incident_00000001
Incident_00000003 H
Incident_00000004 H
Incident_00000003 H
Incident_00000004 Incident_00000001
Incident_00000004 X
Incident_00000004 Y
Incident_00000004 Z
Incident_00000001 X
Incident_00000001 Y
Y Z
Incident_00000001 Z
X Y
X Z
M G
N T
O Incident_00000005
Incident_00000005 Incident_00000005
【Answer】
You can perform a filter over file2 to make it an easy to query two-dimensional table, where incident field is used to display values and code field contains a series of sets which will be matched with file1. If the code in a record of file1 is a subset of a record in file2, output the latter’s incident value; otherwise output the original code in file1.
The logic is simple but the algorithm involves structuralization, set operations and order-based string split. To do this Python has to write code from the low level. A better alternative is SPL (Structured Process Language). It generates simple and easy to understand code:
A |
|
1 |
=file("d:/file1.txt").import@t() |
2 |
=file("d:/file2.txt").read@n() |
3 |
=A2.select(pos(~,"Incident")) |
4 |
=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code) |
5 |
=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2) |
A1: Read in file1.
A2: Read in file2 line by line.
A3: Get lines containing "Incident" from A2.
A4: Generate a table sequence consisting of incident field and code field based on A3.
~.split("\t") splits each line in A3 into a sequence by separators. incident values are the first members and code values are sequences of the other members.
A5: Generate a table sequence of Area_1 field and Area_2 field based on A1’s table sequence. If file1’s code is a subset of the code set in one of file2’s records, values of Area_1 and Area_2 are the incident values of the file2’s records; otherwise they are the original code.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL