Big Text File Query

Question

I have a text file of size 13 GB. Each new line of the file contains a table like row where vales are separated by a comma. My problem is that I have to search for a certain entry in this file. But because of the file size the normal file operations are not responding. I am using SplFileObject Class of PHP to deal with file operations and arrays to store the tokenized value of each row and then performing comparison in each iteration for the new line.

So can anyone one suggest how should I proceed in terms of data Structure or programming methodology to get better methodology.

 

Answer

You can export the text file to a database if the data is static. But for dynamic data the method becomes unsuitable. The exporting process will be rather slow due to the handling of data consistency. Another solution is the SPL, which can directly query the text file. Here’s the SPL script:

 

A

1

=file("D:/employee.txt").cursor@tc()

2

=A1.select(BIRTHDAY>=date(1981-01-01) &&   GENDER=="F")

3

=A2.fetch()

 

The query condition in A2 is “female employees born on or after Jan. 1, 1981. You can change the condition, or use a parameter to ask for input to create a dynamic query. If there are many query result sets, you can export them directly to a text file by changing A3 to file(“D:/result.txt”).export(A2).

Beside, you can write a SQL query in esProc SPL to query a text file, like this:

 

A

1

$select   * from test.txt where BIRTHDAY>=date(1981-01-01) and GENDER=‘F’

 

To increase the performance, you can use multithreaded processing. For details, see Parallel Computing. Just one more thing, a text file will be imported into the database as binary data. The export processing is slow but the performance of querying binary data will be faster than querying text data. esProc supports binary file format (esProc bin file) and creating an index for it. Transforming a text file to a binary file in esProc SPL is much faster than importing a text file to a database. With such fast query performance, the performance of parallel processing in SPL is better than that with most of the databases. See Bin Files to learn more about esProc binary files.