"【 Question 】 I have the following code. However there seems to be an error within it somewhere. .."

blackduckie RaqForum 28 No.
452 View • 5 Years ago

String Split, Parse & Structuralization

text(125)

【Question】

I have the following code. However there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.

Code:

import sys

outfile_name = sys.argv[-1]

filename1 = sys.argv[-2]

filename2 = sys.argv[-3]

fileIn1 = open(filename1, "r")

fileIn2 = open(filename2, "r")

fileOut = open(outfile_name, "w")

dict = {}

a = open(filename1)

b = open(filename2)

for line in a:

words = line.split("\t")

if len(words) != 1:

target = words[0]

for word in words[1:]:

dict[word] = target

for line in b:

words = line.split("\t")

if words[0] in dict.keys()and words[1] in dict.keys():

fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")

elif words[0] in dict.keys()and words[1] not in dict.keys():

fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")

elif words[0] not in dict.keys()and words[1] in dict.keys():

fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")

elif words[0] not in dict.keys()and words[1] not in dict.keys():

fileOut.write(words[0] + "\t" + words[1] + "\n")

fileOut.close()

filename1:

Area_1 Area_2

A B

A C

A D

D B

D C

L B

L C

L A

D L

K A

K B

K C

K D

K L

D P

D R

L P

L R

K P

K R

A H

D H

L H

K H

B P

B R

R P

A I

D I

I L

I K

C H

I H

C H

J K

J X

J Y

J Z

K X

K Y

Y Z

K Z

X Y

X Z

M G

N T

O S

S Q

filename2:

Incident_00000001 A D L K

Incident_00000002 B P R

Incident_00000003 C F W

Incident_00000004 J I

Incident_00000005 Q S

output (b) - undesired output that I am getting:

Area_1 Area_2

Incident_00000001 B

Incident_00000001 C

Incident_00000001 D

Incident_00000001 B

Incident_00000001 C

Incident_00000001 B

Incident_00000001 C

Incident_00000001 A

Incident_00000001 L

K A

K B

K C

K D

K L

Incident_00000001 P

Incident_00000001 Incident_00000002

Incident_00000001 P

Incident_00000001 Incident_00000002

K P

K Incident_00000002

Incident_00000001 H

K H

Incident_00000002 P

Incident_00000002 Incident_00000002

R P

Incident_00000001 Incident_00000003

I L

I Incident_00000004

Incident_00000003 H

I H

Incident_00000003 H

Incident_00000004 Incident_00000004

Incident_00000004 X

Incident_00000004 Y

Incident_00000004 Z

K X

K Y

Y Z

K Z

X Y

X Z

M G

N T

O S

Incident_00000005 Incident_00000005

What I am looking to get (output (c)) is:

Area_1 Area_2

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 H

Incident_00000002 Incident_00000002

Incident_00000001 Incident_00000004

Incident_00000004 Incident_00000001

Incident_00000003 H

Incident_00000004 H

Incident_00000003 H

Incident_00000004 Incident_00000001

Incident_00000004 X

Incident_00000004 Y

Incident_00000004 Z

Incident_00000001 X

Incident_00000001 Y

Y Z

Incident_00000001 Z

X Y

X Z

M G

N T

O Incident_00000005

Incident_00000005 Incident_00000005

【Answer】

You can perform a filter over file2 to make it an easy to query two-dimensional table, where incident field is used to display values and code field contains a series of sets which will be matched with file1. If the code in a record of file1 is a subset of a record in file2, output the latter’s incident value; otherwise output the original code in file1.

The logic is simple but the algorithm involves structuralization, set operations and order-based string split. To do this Python has to write code from the low level. A better alternative is SPL (Structured Process Language). It generates simple and easy to understand code:

	A
1	=file("d:/file1.txt").import@t()
2	=file("d:/file2.txt").read@n()
3	=A2.select(pos(~,"Incident"))
4	=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code)
5	=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2)

A1: Read in file1.

A2: Read in file2 line by line.

A3: Get lines containing "Incident" from A2.

A4: Generate a table sequence consisting of incident field and code field based on A3.

~.split("\t") splits each line in A3 into a sequence by separators. incident values are the first members and code values are sequences of the other members.

A5: Generate a table sequence of Area_1 field and Area_2 field based on A1’s table sequence. If file1’s code is a subset of the code set in one of file2’s records, values of Area_1 and Area_2 are the incident values of the file2’s records; otherwise they are the original code.

SPL Official Website 👉 https://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/2bkGwqTj

Youtube 👉 https://www.youtube.com/@esProc_SPL

text(125)

Application

blackduckie • 452 View • 5 Years ago

String Split, Parse & Structuralization

【Question】

ToC