String Split, Parse & Structuralization

Question

I have the following code. However there seems to be an error within it somewhere. I get output (a) but require output (b) - see below. Can anyone see where I am going wrong? All files are tab-delimited.

Code:

import sys

outfile_name = sys.argv[-1]

filename1 = sys.argv[-2]

filename2 = sys.argv[-3]

fileIn1 = open(filename1, "r")

fileIn2 = open(filename2, "r")

fileOut = open(outfile_name, "w")

dict = {}

a = open(filename1)

b = open(filename2)

for line in a:

words = line.split("\t")

if len(words) != 1:

target = words[0]

for word in words[1:]:

dict[word] = target

for line in b:

words = line.split("\t")

if words[0] in dict.keys()and words[1] in dict.keys():

fileOut.write(dict[words[0]] + "\t" + dict[words[1]] + "\n")

elif words[0] in dict.keys()and words[1] not in dict.keys():

fileOut.write(dict[words[0]] + "\t" + words[1] + "\n")

elif words[0] not in dict.keys()and words[1] in dict.keys():

fileOut.write(words[0] + "\t" + dict[words[1]] + "\n")

elif words[0] not in dict.keys()and words[1] not in dict.keys():

fileOut.write(words[0] + "\t" + words[1] + "\n")

fileOut.close()

 

filename1:

Area_1 Area_2

A  B

A  C

A  D

D  B

D  C

L  B

L  C

L  A

D  L

K  A

K  B

K  C

K  D

K  L

D  P

D  R

L  P

L  R

K  P

K  R

A  H

D  H

L  H

K  H

B  P

B  R

R  P

A  I

D  I

I  L

I  K

C  H

I  H

C  H

J  K

J  X

J  Y

J  Z

K  X

K  Y

Y  Z

K  Z

X  Y

X  Z

M  G

N  T

O  S

S  Q

 

filename2:

Incident_00000001  A  D  L  K

Incident_00000002  B  P  R

Incident_00000003  C  F  W

Incident_00000004  J  I

M

N

O

Incident_00000005  Q  S

X

Y

Z

G

T

output (b) - undesired output that I am getting:

Area_1  Area_2

Incident_00000001  B

Incident_00000001  C

Incident_00000001  D

Incident_00000001  B

Incident_00000001  C

Incident_00000001  B

Incident_00000001  C

Incident_00000001  A

Incident_00000001  L

K  A

K  B

K  C

K  D

K  L

Incident_00000001  P

Incident_00000001  Incident_00000002

Incident_00000001  P

Incident_00000001  Incident_00000002

K  P

K  Incident_00000002

Incident_00000001  H

Incident_00000001  H

Incident_00000001  H

K  H

Incident_00000002  P

Incident_00000002  Incident_00000002

R  P

Incident_00000001  Incident_00000003

Incident_00000001  Incident_00000003

I  L

I  Incident_00000004

Incident_00000003  H

I  H

Incident_00000003  H

Incident_00000004  Incident_00000004

Incident_00000004  X

Incident_00000004  Y

Incident_00000004  Z

K  X

K  Y

Y  Z

K  Z

X  Y

X  Z

M  G

N  T

O  S

Incident_00000005 Incident_00000005

 

What I am looking to get (output (c)) is:

Area_1 Area_2

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000003

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000001

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000002

Incident_00000001 Incident_00000002

Incident_00000001 H

Incident_00000001 H

Incident_00000001 H

Incident_00000001 H

Incident_00000002 Incident_00000002

Incident_00000002 Incident_00000002

Incident_00000002 Incident_00000002

Incident_00000001 Incident_00000004

Incident_00000001 Incident_00000004

Incident_00000004 Incident_00000001

Incident_00000004 Incident_00000001

Incident_00000003 H

Incident_00000004 H

Incident_00000003 H

Incident_00000004 Incident_00000001

Incident_00000004 X

Incident_00000004 Y

Incident_00000004 Z

Incident_00000001 X

Incident_00000001 Y

Y  Z

Incident_00000001 Z

X Y

X Z

M G

N T

O Incident_00000005

Incident_00000005 Incident_00000005

 

Answer

You can perform a filter over file2 to make it an easy to query two-dimensional table, where incident field is used to display values and code field contains a series of sets which will be matched with file1. If the code in a record of file1 is a subset of a record in file2, output the latter’s incident value; otherwise output the original code in file1.

The logic is simple but the algorithm involves structuralization, set operations and order-based string split. To do this Python has to write code from the low level. A better alternative is SPL (Structured Process Language). It generates simple and easy to understand code:

A

1

=file("d:/file1.txt").import@t()

2

=file("d:/file2.txt").read@n()

3

=A2.select(pos(~,"Incident"))

4

=A3.new((t=~.split("\t"))(1):incident,t.to(2,):code)

5

=A1.new(ifn(A4.select@1(code.pos(A1.Area_1)).incident,Area_1):Area_1,ifn(A4.select@1(code.pos(A1.Area_2)).incident,Area_2):Area_2)

A1: Read in file1.

A2: Read in file2 line by line.

A3: Get lines containing "Incident" from A2.

A4: Generate a table sequence consisting of incident field and code field based on A3.

~.split("\t") splits each line in A3 into a sequence by separators. incident values are the first members and code values are sequences of the other members.

A5: Generate a table sequence of Area_1 field and Area_2 field based on A1’s table sequence. If file1’s code is a subset of the code set in one of file2’s records, values of Area_1 and Area_2 are the incident values of the file2’s records; otherwise they are the original code.