Column Handling – Large CSV Files

Question

I've got a question regarding on how to process a delimited file with a large number of columns (>3000). I tried to extract fields with standard delimited file input component, but creating schema took hours and when I ran the job I got an error, because the toString() method exceeds the 65535-byte limit. At that point I could run the job but all the columns were messed up and I couldn’t really work with them anymore.

Is it possible to split that .csv file with Talend? Is there any other handling possible, maybe with some sort of Java code?

 

Answer

Here are your requirements: 1. The ability to process a large CSV file. 2. An optimized schema that accesses 3000 fields faster. 3. Java code. All of them can be achieved in SPL (Raqsoft’s Structured Process Language). The language supports accessing a large file with the cursor, provides a rich library of functions for structured computations, and is easy to integrate with a Java application. Below is the SPL script for retrieving certain columns from a large CSV file:

 

A

1

=file("d:\\data.csv").cursor@tc(field,fieldYouNeed)

The script is easily embedded into a Java application. See How to Call an SPL Script in Java to learn more.