Aggregate Query over a Large Text File

Question

I want to:

read a few 100 GB from csv files > convert some columns to a specific format > save to DB. This should be run a few times a day.

· query the DB with 4 parameters and order the results by an attribute (eg PRICE ASC), then pick the TOP entry.

Postgres performs fine in general, means: 1-2 GB / min (using java). Also selects are quite fine when I put an index on my 4 parameters that are used for the select.

Still, it will take a “long” time when importing a few hundred GBs.

Question: could it be worthwhile to try the same using a NoSQL engine like apache Cassandra? I mean:

· could a NoSQL DB be faster for mass inserts of data

· can I retrieve a row from NoSQL based on a few parameters as quick as in an indexed postgres db?

 

Answer

The several 100G data isn’t static, so each time it will take very long to import it into the database, longer than the query time, which is the same thing with a NoSQL.  

esProc SPL can handle a CSV file directly, and supports handling big file using multithreaded processing and cursor. For example:

 

A

1

=file("D:/data.csv").cursor@tmc()

 

Besides, esProc SPL boasts powerful capability of processing (semi)structured data. This makes type conversion and your algorithm achievable. For example:

A2=A1.select(file1>1 && like(field2, “*some*”)).total(top(4; PRICE))

Which finds records meeting query condition, summarizes records in the cursor and gets the top 4.

esProc SPL also supports writing SQL to query a text file. For example:

 

A

1

$select * from data.csv where file1>1 and field2 like   "*some*" order by PRICE limit 4