Find Text Enclosed in Quotation Marks That Contains At Least Five English Words

Problem description & analysis

Below is text file txt.txt:

Attorney General William Barr said the volume of information compromised was "staggering" and the largest breach in U.S. history."This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft." said Mr. Barr.

We are trying to return each enclosed string that contains at least 5 words. Below is the desired result:

This theft not only caused significant financial damage to Equifax but invaded the privacy of many, millions of Americans and imposed substantial costs and burdens on them as they had to take measures to protect themselves from identity theft.

Solutions:

Solution 1: Conditional grouping

Write the following script (p1.dfx) in esProc:

A

1

=file("txt.txt").read()

2

=A1.words@w().group@i(~[-1]=="\"").select(#%2==0     && ~.count(isalpha(~))>=5).(~.m(:-2).concat())

3

=file("result.txt").export(A2)

Explanation:

A1   Read the text file as a string.

A2  The words()function splits away English words from A1’s string; @w option enables extracting all characters from the string, during which English characters and numbers are extracted as words. The group() function groups the sequence after splitting according to the condition that the previous member value is the quotation marks. The select()function gets rows whose numbers are even and the current member (a sequence) contains members that are made up of letters and has at least 5 members. The concat() function concatenates members, except for the last one, of each sequence member into a string.

A3  Export A2’s result to result.txt.

Solution 1: Regular expression matching

Write the following script (p1.dfx) in esProc:

A

1

=file("txt.txt").read()

2

=A1.regex("\"([^\"]*)\"").select(~.words().len()>=5)

3

=file("result.txt").export(A2)

Explanation:

A1   Read the text file as a string.

A2  Find all matching strings enclosed in quotation marks according to the specified regular expression, and get those containing at least 5 words.

A3  Export A2’s result to result.txt.

Q & A Collection

https://stackoverflow.com/questions/60310558/regex-to-match-quote-with-minimum-number-of-words