Check Whether Certain Strings Appear in a Text File

 

Question

I’m digging Python sooo much. I love it!

I have come upon my first snag: scanning a file for some strings, and then correctly building separate arrays each of matched, unmatched, and “empty” strings, and then printing each array.

I’ve tried this several ways with several different Python file and sequence iteration constructs (6, I think), with both UTF-8 HTML and ASCII text files.

I have had mixed results - many positive - but none of file.read(), file.readline(), file.readlines(), or file.xreadlines() works as expected after opening the file for reading vai file = open(‘afile’, ‘r’).

I can get part of a given array built, and loop through, but for some reason, a given array is built only partially because the various read*() functions are not working as I have been expecting them to in the script properly….even after testing them successfully in the interpreter!

Code to follow soon, but basically:

tagsf = \[<'tagsfloat1>', '<tagsfloat2>'\] 

tagso = \['<openingtag1>', '<openingtag2>',\] 

tagsc = \['<closingtag1>', '<closingtag2>',\] 

tagp = tagsf +tagso + tagsc 

 

doc = open('afile.html', 'r') 

page = \[\] 

tagy = \[\] 

tagn = \[\] 

for lines in doc.read() 

page.append(lines) 

 

for line in page: 

for tag in tagp: 

if tag in tagy: 

break 

if tag in tagn: 

break 

if tag in line: 

if tag not in tagy: 

tagy.append(tag) 

if tag not in tagn: 

tagn.append(tag) 

 

for tagyes in tagy: 

print tagyes, 'found' 

for tagno in tagn: 

print tagno 'not found!'  

 

All I ever get is the first tag found, or all tags NOT found!

 

Answer

The algorithm is simple. Read in the file as a large string, match it with a list of keywords tagall to get a sequence which is tagy, and then calculate the difference between tagall and tagy to get tagn. Besides the loop statement in Python, you can also use SPL (Structured Process Language) to do this. Below is the SPL script, which is simple and easy to understand:

A

1

=[  "tagsfloat1","tagsfloat2","openingtag1","openingtag2"]

2

=file("E:\\afile.html").read()

3

=A1.select(pos(A2,~))

4

=A1\A3

A3 uses select() function to make a query over A1’s members by loop, and returns the matching ones. The condition is that whether a member in A1 is contained in A2’s large string. A1\A3 calculates the difference.